The functions use macros to allow the seamless replacement of PureBasic's native functions.
Essentially, PureBasic limits the codepoint values returned from Chr() to: 0 <= value$ <= $FFFF. My Chr() replacement will return a pair of UTF-16 surrogate codepoints that represent unicode characters (codepoints) for values > $FFFF that encode values as high as $10FFFF. It also returns the same values as PureBasic for the lower range of values.
For PureBasic's Asc() you can only obtain values for a single codepoint and not for a pair of surrogate code points that are needed for characters (codepoints) > $FFFF. My Asc() replacement will check the parameter of Asc() to see if it is a matching pair of UTF-16 surrogate codepoints and return the value encoded by them.
Code: Select all
;File Name: UTF-16 Chr() and Asc() functions.pbi
;Author: Demivec
;Created: 02/18/2016
;Updated: 02/23/2016
;Version: v01.01
;OS: All ;only tested on Windows
;Compiler: PureBasic v5.41 x64
;License: open and free to use and abuse; no guarantees
;Forum: http://www.purebasic.fr/english/viewtopic.php?f=12&t=64947
;Description: Replacements for PureBasic's Chr() and Asc() functions.
; The replacements allow for proper handling of all values in the UTF-16 range.
; Specifically Chr() now returns a surrogate pair of codepoints for values > $FFFF and
; Asc() will return a value for the corresponding surrogate pair of codepoints.
; This allows the full unicode codepoint range (0 <= $10FFF).
CompilerIf #PB_Compiler_Unicode = 0
CompilerError "Requires compiling as unicode."
CompilerEndIf
Procedure.s _Chr(v.i) ;return a proper surrogate pair for unicode values outside the BMP (Basic Multilingual Plane)
Protected high, low
If v < $10000
ProcedureReturn Chr(v)
Else
;calculate surrogate pair of unicode codepoints to represent value in UTF-16
v - $10000
high = v / $400 + $D800 ;high/lead surrogate value
low = v % $400 + $DC00 ;low/tail surrogate value
ProcedureReturn Chr(high) + Chr(low)
EndIf
EndProcedure
Macro Chr(v = 0)
_Chr(v)
EndMacro
Procedure _Asc(u$) ;return a proper codepoint value for a UTF-16 surrogate pair
Protected *u = @u$, high = PeekU(*u), low
Select high
Case 0 To $D7FF, $DC00 To $FFFF ;includes range for low surrogate value ($DC00 to $DFFF)
ProcedureReturn high ;return value as is (may be an unmatched low surrogate value)
Case $D800 To $DBFF
low = PeekU(*u + SizeOf(Unicode))
If low & $DC00 = $DC00 ;low >= $DC00 And low <= $DFFF
ProcedureReturn (high - $D800) * $400 + (low - $DC00) + $10000 ;return decoded surrogate pair
EndIf
ProcedureReturn high ;an unmatched high surrogate value, return value as is
EndSelect
EndProcedure
Macro Asc(u = "")
_Asc(u)
EndMacro
CompilerIf #PB_Compiler_IsMainFile
;Sample range of values starting at the low end of the Unicode BMP (Basic Multilingual Plane)
;and moving through the high/low surrogate pairs and ending at the start of SMP (Supplemental Multilingual Plane).
Define i, m$, d
For i = $0 To $11000
m$ = Chr(i)
d = Asc(m$)
Debug "$" + Hex(i) + "; Asc: " + Hex(d) + " Chr: " + m$
Next
CompilerEndIf
@Edit2: Added the full URL to this thread to the source code.