Revised Chr() & Asc() for UTF-16 surrogate pairs

Share your advanced PureBasic knowledge/code with the community.
User avatar
Demivec
Addict
Addict
Posts: 4085
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by Demivec »

Here is a set of replacement functions for Chr() and Asc() to handle UTF-16 surrogate code points in unicode compilations.

The functions use macros to allow the seamless replacement of PureBasic's native functions.

Essentially, PureBasic limits the codepoint values returned from Chr() to: 0 <= value$ <= $FFFF. My Chr() replacement will return a pair of UTF-16 surrogate codepoints that represent unicode characters (codepoints) for values > $FFFF that encode values as high as $10FFFF. It also returns the same values as PureBasic for the lower range of values.

For PureBasic's Asc() you can only obtain values for a single codepoint and not for a pair of surrogate code points that are needed for characters (codepoints) > $FFFF. My Asc() replacement will check the parameter of Asc() to see if it is a matching pair of UTF-16 surrogate codepoints and return the value encoded by them.

Code: Select all

;File Name: UTF-16 Chr() and Asc() functions.pbi
;Author: Demivec
;Created: 02/18/2016
;Updated: 02/23/2016
;Version: v01.01
;OS: All ;only tested on Windows
;Compiler: PureBasic v5.41 x64
;License: open and free to use and abuse; no guarantees
;Forum: http://www.purebasic.fr/english/viewtopic.php?f=12&t=64947
;Description: Replacements for PureBasic's Chr() and Asc() functions.
;  The replacements allow for proper handling of all values in the UTF-16 range.
;  Specifically Chr() now returns a surrogate pair of codepoints for values > $FFFF and
;  Asc() will return a value for the corresponding surrogate pair of codepoints.
;  This allows the full unicode codepoint range (0 <= $10FFF).

CompilerIf #PB_Compiler_Unicode = 0
  CompilerError "Requires compiling as unicode."
CompilerEndIf

Procedure.s _Chr(v.i) ;return a proper surrogate pair for unicode values outside the BMP (Basic Multilingual Plane)
  Protected high, low
  If v < $10000
    ProcedureReturn Chr(v)
  Else
    ;calculate surrogate pair of unicode codepoints to represent value in UTF-16
    v - $10000
    high = v / $400 + $D800 ;high/lead surrogate value
    low = v % $400 + $DC00 ;low/tail surrogate value
    ProcedureReturn Chr(high) + Chr(low)
  EndIf
EndProcedure

Macro Chr(v = 0)
  _Chr(v)
EndMacro


Procedure _Asc(u$)  ;return a proper codepoint value for a UTF-16 surrogate pair
  Protected *u = @u$, high = PeekU(*u), low
  Select high
    Case 0 To $D7FF, $DC00 To $FFFF ;includes range for low surrogate value ($DC00 to $DFFF)
      ProcedureReturn high             ;return value as is (may be an unmatched low surrogate value)
    Case $D800 To $DBFF
      low = PeekU(*u + SizeOf(Unicode)) 
      If low & $DC00 = $DC00 ;low >= $DC00 And low <= $DFFF
        ProcedureReturn (high - $D800) * $400 + (low - $DC00) + $10000 ;return decoded surrogate pair
      EndIf
      
      ProcedureReturn high ;an unmatched high surrogate value, return value as is
  EndSelect
EndProcedure

Macro Asc(u = "")
  _Asc(u)
EndMacro

CompilerIf #PB_Compiler_IsMainFile
  ;Sample range of values starting at the low end of the Unicode BMP (Basic Multilingual Plane)
  ;and moving through the high/low surrogate pairs and ending at the start of SMP (Supplemental Multilingual Plane).
  Define i, m$, d
  
  For i = $0 To $11000
    m$ = Chr(i)
    d = Asc(m$)
    Debug  "$" + Hex(i) + "; Asc: " + Hex(d) + " Chr: " + m$
  Next
CompilerEndIf

@Edit: Made a change to increase speed of the Asc() function by 5%.
@Edit2: Added the full URL to this thread to the source code.
Last edited by Demivec on Mon Feb 27, 2017 8:37 am, edited 3 times in total.
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by Little John »

Many thanks, Demivec!
davido
Addict
Addict
Posts: 1890
Joined: Fri Nov 09, 2012 11:04 pm
Location: Uttoxeter, UK

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by davido »

@Demivec,
Interesting; something new to learn.
The output seems a little odd, though:
The debug font seems to change to a mono font from time-to-time. At $3001 it appears to become a mono font with a reversion at $3022 and each increment of $65 thereafter!
DE AA EB
User avatar
Demivec
Addict
Addict
Posts: 4085
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by Demivec »

davido wrote:@Demivec,
Interesting; something new to learn.
The output seems a little odd, though:
The debug font seems to change to a mono font from time-to-time. At $3001 it appears to become a mono font with a reversion at $3022 and each increment of $65 thereafter!
If you are running Windows, it selects a different font if you the one you are using doesn't have a glyph for the character you are trying to print. You will notice though that there are still many codepoints that don't have a visible glyph or are not included in a font yet.
davido
Addict
Addict
Posts: 1890
Joined: Fri Nov 09, 2012 11:04 pm
Location: Uttoxeter, UK

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by davido »

@Demivec,
I am running Windows 10.
Thank you very much for the explanation.
DE AA EB
User avatar
sevny
New User
New User
Posts: 6
Joined: Wed Jan 10, 2018 2:33 pm

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by sevny »

thank you much for this code . to display Emoji etc.. on my MAC ., i was not able to understand the problem..
MacBookAir M1 2020 Ventura PB6LTS
User avatar
STARGÅTE
Addict
Addict
Posts: 2067
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by STARGÅTE »

If you work with strings, which includes character over $FFFF you need also "new" functions für Len(), Mid() etc.
Here are my solution: http://www.purebasic.fr/german/viewtopi ... 14#p340514
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by wilbert »

STARGÅTE wrote:If you work with strings, which includes character over $FFFF you need also "new" functions für Len(), Mid() etc.
Here are my solution: http://www.purebasic.fr/german/viewtopi ... 14#p340514
Very nice :)

For functions like Left, Right , Mid and Len, you could consider writing asm procedures to make things faster.
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
sevny
New User
New User
Posts: 6
Joined: Wed Jan 10, 2018 2:33 pm

Re: Revised Chr() & Asc() for UTF-16 surrogate pairs

Post by sevny »

very useful, it works , i have to say there are others problems from system
for the font is searched to obtain the special glyph and i often have to make a
"LoadFont" again to retrieve my crushed font.
these new large Unicode seem difficult to display..
MacBookAir M1 2020 Ventura PB6LTS
Post Reply