Revised Chr() & Asc() for UTF-16 surrogate pairs

Share your advanced PureBasic knowledge/code with the community.
User avatar
Demivec
Addict
Addict
Posts: 4091
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by Demivec »

Here is a set of replacement functions for Chr() and Asc() to handle UTF-16 surrogate code points in unicode compilations.

The functions use macros to allow the seamless replacement of PureBasic's native functions.

Essentially, PureBasic limits the codepoint values returned from Chr() to: 0 <= value$ <= $FFFF. My Chr() replacement will return a pair of UTF-16 surrogate codepoints that represent unicode characters (codepoints) for values > $FFFF that encode values as high as $10FFFF. It also returns the same values as PureBasic for the lower range of values.

For PureBasic's Asc() you can only obtain values for a single codepoint and not for a pair of surrogate code points that are needed for characters (codepoints) > $FFFF. My Asc() replacement will check the parameter of Asc() to see if it is a matching pair of UTF-16 surrogate codepoints and return the value encoded by them.

Code: Select all

;File Name: UTF-16 Chr() and Asc() functions.pbi
;Author: Demivec
;Created: 02/18/2016
;Updated: 02/23/2016
;Version: v01.01
;OS: All ;only tested on Windows
;Compiler: PureBasic v5.41 x64
;License: open and free to use and abuse; no guarantees
;Forum: http://www.purebasic.fr/english/viewtopic.php?f=12&t=64947
;Description: Replacements for PureBasic's Chr() and Asc() functions.
;  The replacements allow for proper handling of all values in the UTF-16 range.
;  Specifically Chr() now returns a surrogate pair of codepoints for values > $FFFF and
;  Asc() will return a value for the corresponding surrogate pair of codepoints.
;  This allows the full unicode codepoint range (0 <= $10FFF).

CompilerIf #PB_Compiler_Unicode = 0
  CompilerError "Requires compiling as unicode."
CompilerEndIf

Procedure.s _Chr(v.i) ;return a proper surrogate pair for unicode values outside the BMP (Basic Multilingual Plane)
  Protected high, low
  If v < $10000
    ProcedureReturn Chr(v)
  Else
    ;calculate surrogate pair of unicode codepoints to represent value in UTF-16
    v - $10000
    high = v / $400 + $D800 ;high/lead surrogate value
    low = v % $400 + $DC00 ;low/tail surrogate value
    ProcedureReturn Chr(high) + Chr(low)
  EndIf
EndProcedure

Macro Chr(v = 0)
  _Chr(v)
EndMacro


Procedure _Asc(u$)  ;return a proper codepoint value for a UTF-16 surrogate pair
  Protected *u = @u$, high = PeekU(*u), low
  Select high
    Case 0 To $D7FF, $DC00 To $FFFF ;includes range for low surrogate value ($DC00 to $DFFF)
      ProcedureReturn high             ;return value as is (may be an unmatched low surrogate value)
    Case $D800 To $DBFF
      low = PeekU(*u + SizeOf(Unicode)) 
      If low & $DC00 = $DC00 ;low >= $DC00 And low <= $DFFF
        ProcedureReturn (high - $D800) * $400 + (low - $DC00) + $10000 ;return decoded surrogate pair
      EndIf
      
      ProcedureReturn high ;an unmatched high surrogate value, return value as is
  EndSelect
EndProcedure

Macro Asc(u = "")
  _Asc(u)
EndMacro

CompilerIf #PB_Compiler_IsMainFile
  ;Sample range of values starting at the low end of the Unicode BMP (Basic Multilingual Plane)
  ;and moving through the high/low surrogate pairs and ending at the start of SMP (Supplemental Multilingual Plane).
  Define i, m$, d
  
  For i = $0 To $11000
    m$ = Chr(i)
    d = Asc(m$)
    Debug  "$" + Hex(i) + "; Asc: " + Hex(d) + " Chr: " + m$
  Next
CompilerEndIf

@Edit: Made a change to increase speed of the Asc() function by 5%.
@Edit2: Added the full URL to this thread to the source code.
Last edited by Demivec on Mon Feb 27, 2017 8:37 am, edited 3 times in total.
Little John
Addict
Addict
Posts: 4527
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by Little John »

Many thanks, Demivec!
davido
Addict
Addict
Posts: 1890
Joined: Fri Nov 09, 2012 11:04 pm
Location: Uttoxeter, UK

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by davido »

@Demivec,
Interesting; something new to learn.
The output seems a little odd, though:
The debug font seems to change to a mono font from time-to-time. At $3001 it appears to become a mono font with a reversion at $3022 and each increment of $65 thereafter!
DE AA EB
User avatar
Demivec
Addict
Addict
Posts: 4091
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by Demivec »

davido wrote:@Demivec,
Interesting; something new to learn.
The output seems a little odd, though:
The debug font seems to change to a mono font from time-to-time. At $3001 it appears to become a mono font with a reversion at $3022 and each increment of $65 thereafter!
If you are running Windows, it selects a different font if you the one you are using doesn't have a glyph for the character you are trying to print. You will notice though that there are still many codepoints that don't have a visible glyph or are not included in a font yet.
davido
Addict
Addict
Posts: 1890
Joined: Fri Nov 09, 2012 11:04 pm
Location: Uttoxeter, UK

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by davido »

@Demivec,
I am running Windows 10.
Thank you very much for the explanation.
DE AA EB
User avatar
sevny
New User
New User
Posts: 6
Joined: Wed Jan 10, 2018 2:33 pm

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by sevny »

thank you much for this code . to display Emoji etc.. on my MAC ., i was not able to understand the problem..
MacBookAir M1 2020 Ventura PB6LTS
User avatar
STARGÅTE
Addict
Addict
Posts: 2089
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by STARGÅTE »

If you work with strings, which includes character over $FFFF you need also "new" functions für Len(), Mid() etc.
Here are my solution: http://www.purebasic.fr/german/viewtopi ... 14#p340514
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by wilbert »

STARGÅTE wrote:If you work with strings, which includes character over $FFFF you need also "new" functions für Len(), Mid() etc.
Here are my solution: http://www.purebasic.fr/german/viewtopi ... 14#p340514
Very nice :)

For functions like Left, Right , Mid and Len, you could consider writing asm procedures to make things faster.
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
sevny
New User
New User
Posts: 6
Joined: Wed Jan 10, 2018 2:33 pm

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by sevny »

very useful, it works , i have to say there are others problems from system
for the font is searched to obtain the special glyph and i often have to make a
"LoadFont" again to retrieve my crushed font.
these new large Unicode seem difficult to display..
MacBookAir M1 2020 Ventura PB6LTS
Little John
Addict
Addict
Posts: 4527
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by Little John »

The code in the first post here works fine e.g. with PB 5.73 LTS. Many thanks again to Demivec!

However, when trying to run the code e.g. with PB 6.04 (x64) or PB 6.10 (x64) on Windows, an error is raised. I didn't test with other PB versions, though.
The following snippet

Code: Select all

Procedure.s _Chr(v.i) ;return a proper surrogate pair for unicode values outside the BMP (Basic Multilingual Plane)
   Protected high, low
   If v < $10000
      ProcedureReturn Chr(v)
   Else
      ;calculate surrogate pair of unicode codepoints to represent value in UTF-16
      v - $10000
      high = v / $400 + $D800 ;high/lead surrogate value
      low = v % $400 + $DC00  ;low/tail surrogate value
      ProcedureReturn Chr(high) + Chr(low)
   EndIf
EndProcedure


Debug _Chr($1F600)  ; Smiley
stops at the 2nd ProcedureReturn line and causes the following error message:
“[ERROR] Chr(): Invalid value for Chr(), should be between 0 and $D7FF or between $E000 and $FFFF.”

Where ist the change with PureBasic's Chr() function documented?
How can we get UTF-16 surrogate pairs for characters > $FFFF e.g. with PB 6.04 and PB 6.10 :?:
User avatar
STARGÅTE
Addict
Addict
Posts: 2089
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by STARGÅTE »

Little John wrote: Sun Mar 31, 2024 5:36 pm How can we get UTF-16 surrogate pairs for characters > $FFFF e.g. with PB 6.04 and PB 6.10 :?:
Just replace Chr() with PeekS(@high, 2, #PB_Unicode) and PeekS(@low, 2, #PB_Unicode).
Little John wrote: Sun Mar 31, 2024 5:36 pm Where ist the change with PureBasic's Chr() function documented?
please remove range test from chr()
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
User avatar
mk-soft
Always Here
Always Here
Posts: 5406
Joined: Fri May 12, 2006 6:51 pm
Location: Germany

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by mk-soft »

Maybe ...

Code: Select all

Structure ArrayOfChar
  c.c[0]
EndStructure

Procedure.s _Chr(v.i) ;return a proper surrogate pair for unicode values outside the BMP (Basic Multilingual Plane)
  Protected r1.s{1}, r2.s{2}, *p1.ArrayOfChar
  
   If v < $10000
     *p1 = @r1
     *p1\c[0] = v
      ProcedureReturn r1
   Else
     ;calculate surrogate pair of unicode codepoints to represent value in UTF-16
     *p1 = @r2 
      v - $10000
      *p1\c[0] = v / $400 + $D800 ;high/lead surrogate value
      *p1\c[1] = v % $400 + $DC00  ;low/tail surrogate value
      ProcedureReturn r2
   EndIf
EndProcedure


a$ = _Chr($1F600) + " Smiley"
Debug a$
a$ = _Chr($0040) + " At"
Debug a$
My Projects ThreadToGUI / OOP-BaseClass / EventDesigner V3
PB v3.30 / v5.75 - OS Mac Mini OSX 10.xx - VM Window Pro / Linux Ubuntu
Downloads on my Webspace / OneDrive
Little John
Addict
Addict
Posts: 4527
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by Little John »

STARGÅTE, you saved my program.
Thank you very much!
User avatar
STARGÅTE
Addict
Addict
Posts: 2089
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by STARGÅTE »

Little John wrote: Sun Mar 31, 2024 8:53 pm STARGÅTE, you saved my program.
Thank you very much!
Actually, I did a mistake. PeekS only needs to read 1 character.

Code: Select all

ProcedureReturn PeekS(@high, 1, #PB_Unicode) + PeekS(@low, 1, #PB_Unicode)
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
Little John
Addict
Addict
Posts: 4527
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by Little John »

Oh, yes. I had forgotten that PeekS() requires the number of characters rather than the number of bytes. Thanks!
Post Reply