Revised Chr() & Asc() for UTF-16 surrogate pairs
Posted: Fri Feb 19, 2016 4:51 am
Here is a set of replacement functions for Chr() and Asc() to handle UTF-16 surrogate code points in unicode compilations.
The functions use macros to allow the seamless replacement of PureBasic's native functions.
Essentially, PureBasic limits the codepoint values returned from Chr() to: 0 <= value$ <= $FFFF. My Chr() replacement will return a pair of UTF-16 surrogate codepoints that represent unicode characters (codepoints) for values > $FFFF that encode values as high as $10FFFF. It also returns the same values as PureBasic for the lower range of values.
For PureBasic's Asc() you can only obtain values for a single codepoint and not for a pair of surrogate code points that are needed for characters (codepoints) > $FFFF. My Asc() replacement will check the parameter of Asc() to see if it is a matching pair of UTF-16 surrogate codepoints and return the value encoded by them.
@Edit: Made a change to increase speed of the Asc() function by 5%.
@Edit2: Added the full URL to this thread to the source code.
The functions use macros to allow the seamless replacement of PureBasic's native functions.
Essentially, PureBasic limits the codepoint values returned from Chr() to: 0 <= value$ <= $FFFF. My Chr() replacement will return a pair of UTF-16 surrogate codepoints that represent unicode characters (codepoints) for values > $FFFF that encode values as high as $10FFFF. It also returns the same values as PureBasic for the lower range of values.
For PureBasic's Asc() you can only obtain values for a single codepoint and not for a pair of surrogate code points that are needed for characters (codepoints) > $FFFF. My Asc() replacement will check the parameter of Asc() to see if it is a matching pair of UTF-16 surrogate codepoints and return the value encoded by them.
Code: Select all
;File Name: UTF-16 Chr() and Asc() functions.pbi
;Author: Demivec
;Created: 02/18/2016
;Updated: 02/23/2016
;Version: v01.01
;OS: All ;only tested on Windows
;Compiler: PureBasic v5.41 x64
;License: open and free to use and abuse; no guarantees
;Forum: http://www.purebasic.fr/english/viewtopic.php?f=12&t=64947
;Description: Replacements for PureBasic's Chr() and Asc() functions.
; The replacements allow for proper handling of all values in the UTF-16 range.
; Specifically Chr() now returns a surrogate pair of codepoints for values > $FFFF and
; Asc() will return a value for the corresponding surrogate pair of codepoints.
; This allows the full unicode codepoint range (0 <= $10FFF).
CompilerIf #PB_Compiler_Unicode = 0
CompilerError "Requires compiling as unicode."
CompilerEndIf
Procedure.s _Chr(v.i) ;return a proper surrogate pair for unicode values outside the BMP (Basic Multilingual Plane)
Protected high, low
If v < $10000
ProcedureReturn Chr(v)
Else
;calculate surrogate pair of unicode codepoints to represent value in UTF-16
v - $10000
high = v / $400 + $D800 ;high/lead surrogate value
low = v % $400 + $DC00 ;low/tail surrogate value
ProcedureReturn Chr(high) + Chr(low)
EndIf
EndProcedure
Macro Chr(v = 0)
_Chr(v)
EndMacro
Procedure _Asc(u$) ;return a proper codepoint value for a UTF-16 surrogate pair
Protected *u = @u$, high = PeekU(*u), low
Select high
Case 0 To $D7FF, $DC00 To $FFFF ;includes range for low surrogate value ($DC00 to $DFFF)
ProcedureReturn high ;return value as is (may be an unmatched low surrogate value)
Case $D800 To $DBFF
low = PeekU(*u + SizeOf(Unicode))
If low & $DC00 = $DC00 ;low >= $DC00 And low <= $DFFF
ProcedureReturn (high - $D800) * $400 + (low - $DC00) + $10000 ;return decoded surrogate pair
EndIf
ProcedureReturn high ;an unmatched high surrogate value, return value as is
EndSelect
EndProcedure
Macro Asc(u = "")
_Asc(u)
EndMacro
CompilerIf #PB_Compiler_IsMainFile
;Sample range of values starting at the low end of the Unicode BMP (Basic Multilingual Plane)
;and moving through the high/low surrogate pairs and ending at the start of SMP (Supplemental Multilingual Plane).
Define i, m$, d
For i = $0 To $11000
m$ = Chr(i)
d = Asc(m$)
Debug "$" + Hex(i) + "; Asc: " + Hex(d) + " Chr: " + m$
Next
CompilerEndIf
@Edit2: Added the full URL to this thread to the source code.