Page 1 of 2

Revised Chr() & Asc() for UTF-16 surrogate pairs

Posted: Fri Feb 19, 2016 4:51 am
by Demivec
Here is a set of replacement functions for Chr() and Asc() to handle UTF-16 surrogate code points in unicode compilations.

The functions use macros to allow the seamless replacement of PureBasic's native functions.

Essentially, PureBasic limits the codepoint values returned from Chr() to: 0 <= value$ <= $FFFF. My Chr() replacement will return a pair of UTF-16 surrogate codepoints that represent unicode characters (codepoints) for values > $FFFF that encode values as high as $10FFFF. It also returns the same values as PureBasic for the lower range of values.

For PureBasic's Asc() you can only obtain values for a single codepoint and not for a pair of surrogate code points that are needed for characters (codepoints) > $FFFF. My Asc() replacement will check the parameter of Asc() to see if it is a matching pair of UTF-16 surrogate codepoints and return the value encoded by them.

Code: Select all

;File Name: UTF-16 Chr() and Asc() functions.pbi
;Author: Demivec
;Created: 02/18/2016
;Updated: 02/23/2016
;Version: v01.01
;OS: All ;only tested on Windows
;Compiler: PureBasic v5.41 x64
;License: open and free to use and abuse; no guarantees
;Forum: http://www.purebasic.fr/english/viewtopic.php?f=12&t=64947
;Description: Replacements for PureBasic's Chr() and Asc() functions.
;  The replacements allow for proper handling of all values in the UTF-16 range.
;  Specifically Chr() now returns a surrogate pair of codepoints for values > $FFFF and
;  Asc() will return a value for the corresponding surrogate pair of codepoints.
;  This allows the full unicode codepoint range (0 <= $10FFF).

CompilerIf #PB_Compiler_Unicode = 0
  CompilerError "Requires compiling as unicode."
CompilerEndIf

Procedure.s _Chr(v.i) ;return a proper surrogate pair for unicode values outside the BMP (Basic Multilingual Plane)
  Protected high, low
  If v < $10000
    ProcedureReturn Chr(v)
  Else
    ;calculate surrogate pair of unicode codepoints to represent value in UTF-16
    v - $10000
    high = v / $400 + $D800 ;high/lead surrogate value
    low = v % $400 + $DC00 ;low/tail surrogate value
    ProcedureReturn Chr(high) + Chr(low)
  EndIf
EndProcedure

Macro Chr(v = 0)
  _Chr(v)
EndMacro


Procedure _Asc(u$)  ;return a proper codepoint value for a UTF-16 surrogate pair
  Protected *u = @u$, high = PeekU(*u), low
  Select high
    Case 0 To $D7FF, $DC00 To $FFFF ;includes range for low surrogate value ($DC00 to $DFFF)
      ProcedureReturn high             ;return value as is (may be an unmatched low surrogate value)
    Case $D800 To $DBFF
      low = PeekU(*u + SizeOf(Unicode)) 
      If low & $DC00 = $DC00 ;low >= $DC00 And low <= $DFFF
        ProcedureReturn (high - $D800) * $400 + (low - $DC00) + $10000 ;return decoded surrogate pair
      EndIf
      
      ProcedureReturn high ;an unmatched high surrogate value, return value as is
  EndSelect
EndProcedure

Macro Asc(u = "")
  _Asc(u)
EndMacro

CompilerIf #PB_Compiler_IsMainFile
  ;Sample range of values starting at the low end of the Unicode BMP (Basic Multilingual Plane)
  ;and moving through the high/low surrogate pairs and ending at the start of SMP (Supplemental Multilingual Plane).
  Define i, m$, d
  
  For i = $0 To $11000
    m$ = Chr(i)
    d = Asc(m$)
    Debug  "$" + Hex(i) + "; Asc: " + Hex(d) + " Chr: " + m$
  Next
CompilerEndIf

@Edit: Made a change to increase speed of the Asc() function by 5%.
@Edit2: Added the full URL to this thread to the source code.

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Posted: Fri Feb 19, 2016 6:00 am
by Little John
Many thanks, Demivec!

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Posted: Fri Feb 19, 2016 10:40 am
by davido
@Demivec,
Interesting; something new to learn.
The output seems a little odd, though:
The debug font seems to change to a mono font from time-to-time. At $3001 it appears to become a mono font with a reversion at $3022 and each increment of $65 thereafter!

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Posted: Fri Feb 19, 2016 1:59 pm
by Demivec
davido wrote:@Demivec,
Interesting; something new to learn.
The output seems a little odd, though:
The debug font seems to change to a mono font from time-to-time. At $3001 it appears to become a mono font with a reversion at $3022 and each increment of $65 thereafter!
If you are running Windows, it selects a different font if you the one you are using doesn't have a glyph for the character you are trying to print. You will notice though that there are still many codepoints that don't have a visible glyph or are not included in a font yet.

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Posted: Fri Feb 19, 2016 5:40 pm
by davido
@Demivec,
I am running Windows 10.
Thank you very much for the explanation.

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Posted: Mon Apr 09, 2018 10:33 pm
by sevny
thank you much for this code . to display Emoji etc.. on my MAC ., i was not able to understand the problem..

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Posted: Tue Apr 10, 2018 6:40 am
by STARGÅTE
If you work with strings, which includes character over $FFFF you need also "new" functions für Len(), Mid() etc.
Here are my solution: http://www.purebasic.fr/german/viewtopi ... 14#p340514

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Posted: Tue Apr 10, 2018 6:52 am
by wilbert
STARGÅTE wrote:If you work with strings, which includes character over $FFFF you need also "new" functions für Len(), Mid() etc.
Here are my solution: http://www.purebasic.fr/german/viewtopi ... 14#p340514
Very nice :)

For functions like Left, Right , Mid and Len, you could consider writing asm procedures to make things faster.

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Posted: Tue Apr 10, 2018 9:10 am
by sevny
very useful, it works , i have to say there are others problems from system
for the font is searched to obtain the special glyph and i often have to make a
"LoadFont" again to retrieve my crushed font.
these new large Unicode seem difficult to display..

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Posted: Sun Mar 31, 2024 5:36 pm
by Little John
The code in the first post here works fine e.g. with PB 5.73 LTS. Many thanks again to Demivec!

However, when trying to run the code e.g. with PB 6.04 (x64) or PB 6.10 (x64) on Windows, an error is raised. I didn't test with other PB versions, though.
The following snippet

Code: Select all

Procedure.s _Chr(v.i) ;return a proper surrogate pair for unicode values outside the BMP (Basic Multilingual Plane)
   Protected high, low
   If v < $10000
      ProcedureReturn Chr(v)
   Else
      ;calculate surrogate pair of unicode codepoints to represent value in UTF-16
      v - $10000
      high = v / $400 + $D800 ;high/lead surrogate value
      low = v % $400 + $DC00  ;low/tail surrogate value
      ProcedureReturn Chr(high) + Chr(low)
   EndIf
EndProcedure


Debug _Chr($1F600)  ; Smiley
stops at the 2nd ProcedureReturn line and causes the following error message:
“[ERROR] Chr(): Invalid value for Chr(), should be between 0 and $D7FF or between $E000 and $FFFF.”

Where ist the change with PureBasic's Chr() function documented?
How can we get UTF-16 surrogate pairs for characters > $FFFF e.g. with PB 6.04 and PB 6.10 :?:

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Posted: Sun Mar 31, 2024 7:01 pm
by STARGÅTE
Little John wrote: Sun Mar 31, 2024 5:36 pm How can we get UTF-16 surrogate pairs for characters > $FFFF e.g. with PB 6.04 and PB 6.10 :?:
Just replace Chr() with PeekS(@high, 2, #PB_Unicode) and PeekS(@low, 2, #PB_Unicode).
Little John wrote: Sun Mar 31, 2024 5:36 pm Where ist the change with PureBasic's Chr() function documented?
please remove range test from chr()

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Posted: Sun Mar 31, 2024 8:14 pm
by mk-soft
Maybe ...

Code: Select all

Structure ArrayOfChar
  c.c[0]
EndStructure

Procedure.s _Chr(v.i) ;return a proper surrogate pair for unicode values outside the BMP (Basic Multilingual Plane)
  Protected r1.s{1}, r2.s{2}, *p1.ArrayOfChar
  
   If v < $10000
     *p1 = @r1
     *p1\c[0] = v
      ProcedureReturn r1
   Else
     ;calculate surrogate pair of unicode codepoints to represent value in UTF-16
     *p1 = @r2 
      v - $10000
      *p1\c[0] = v / $400 + $D800 ;high/lead surrogate value
      *p1\c[1] = v % $400 + $DC00  ;low/tail surrogate value
      ProcedureReturn r2
   EndIf
EndProcedure


a$ = _Chr($1F600) + " Smiley"
Debug a$
a$ = _Chr($0040) + " At"
Debug a$

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Posted: Sun Mar 31, 2024 8:53 pm
by Little John
STARGÅTE, you saved my program.
Thank you very much!

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Posted: Mon Apr 01, 2024 8:21 am
by STARGÅTE
Little John wrote: Sun Mar 31, 2024 8:53 pm STARGÅTE, you saved my program.
Thank you very much!
Actually, I did a mistake. PeekS only needs to read 1 character.

Code: Select all

ProcedureReturn PeekS(@high, 1, #PB_Unicode) + PeekS(@low, 1, #PB_Unicode)

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Posted: Mon Apr 01, 2024 12:03 pm
by Little John
Oh, yes. I had forgotten that PeekS() requires the number of characters rather than the number of bytes. Thanks!