Page 1 of 1

Chr() support characters > $FFFF

Posted: Wed Oct 26, 2016 5:40 pm
by kenmo
Warning: This post applies to PB on Windows -- I don't know how these examples work on other OSes!

PB's Unicode string mode seems to really be UTF-16, meaning characters over $FFFF can be represented as surrogate pairs (two 16-bit values).
For example, you can use the Unicode RUNNER character U+1F3C3 in your GUI with the pair $D83C $DFC3.
However it's not simple to use these high characters in your PB code.

My request is:
1. Chr() accepts characters > $FFFF and generates the surrogate pair (string with Len() = 2)
AND/OR
2. Chr() accepts surrogate pair constants ($D800-$DFFF) without converting them to REPLACEMENT CHAR ($FFFD)


Try the example below.

Methods 1-3 show three ways of using surrogate pairs (PokeU, Chr with variable inputs, DataSection)
Method 4 passes the high codepoint to Chr(), which truncates it to 16-bit and produces the wrong character :(
Method 5 tries to build the surrogate pair with two Chr() calls, but they are converted to $FFFD instead :(
Method 6 shows a custom "ChrU" procedure which corrects case 4-5!


Please consider this functionality for future PB, especially now that Unicode mode is standard! :)

Code: Select all

; A Unicode character > $FFFF
#Runner_Codepoint     = $1F3C3

; Represented as UTF-16 surrogate pair
#Runner_HighSurrogate = $D83C
#Runner_LowSurrogate  = $DFC3

CompilerIf Not #PB_Compiler_Unicode
  CompilerError "Compile in Unicode mode"
CompilerEndIf
Debug "Use a debugger font like Segoe UI Symbol..."
Debug ""






; Method 1: Poke surrogate pair
Str$ = Space(2)
PokeU(@Str$,     #Runner_HighSurrogate)
PokeU(@Str$ + 2, #Runner_LowSurrogate)
Debug Str$

; Method 2: Chr() with variables
hi = #Runner_HighSurrogate
lo = #Runner_LowSurrogate
Str$ = Chr(hi) + Chr(lo)
Debug Str$

; Method 3: Data Section
DataSection
  UTF16_String:
  Data.u #Runner_HighSurrogate, #Runner_LowSurrogate, #NUL
EndDataSection
Str$ = PeekS(?UTF16_String, -1, #PB_Unicode) ; same as #PB_UTF16
Debug Str$






; Method 4: Chr() with value > $FFFF - TRUNCATED TO 16-BIT
Str$ = Chr(#Runner_Codepoint)
Debug Str$
;ShowMemoryViewer(@Str$, StringByteLength(Str$) + 2)

; Method 5: Chr() with constants - CONVERTED TO $FFFD REPLACEMENT CHARS
Str$ = Chr(#Runner_HighSurrogate) + Chr(#Runner_LowSurrogate)
Debug Str$
;ShowMemoryViewer(@Str$, StringByteLength(Str$) + 2)




; Method 6: Modified Chr() which can output surrogate pairs
Procedure.s ChrU(Codepoint.i)
  If (Codepoint > $FFFF)
    Result.s = "  "
    Codepoint - $10000
    PokeU(@Result, $D800 + ((Codepoint >> 10) & $3FF))
    PokeU(@Result + 2, $DC00 + (Codepoint & $3FF))
    ProcedureReturn Result
  ElseIf (Codepoint >= $0000)
    ProcedureReturn Chr(Codepoint)
  Else
    ProcedureReturn ""
  EndIf
EndProcedure
Str$ = ChrU(#Runner_Codepoint)
Debug Str$
Str$ = ChrU(#Runner_HighSurrogate) + ChrU(#Runner_LowSurrogate)
Debug Str$

Re: Chr() support characters > $FFFF

Posted: Wed Oct 26, 2016 6:45 pm
by Demivec

Re: Chr() support characters > $FFFF

Posted: Wed Oct 26, 2016 7:06 pm
by kenmo
Oh, nice code Demivec. I guess I never saw your post (or I didn't understand UTF-16 surrogate pairs back then).

I wrote my own ChrU() and AscU() procedures like yours, but I did not think to override the native Chr() and Asc() with a macro.

Anyway, at a minimum I think Chr() should allow surrogate pairs in the preprocessor (constants) like they are allowed at runtime (vars).

Code: Select all

#High = $D83C
#Low  = $DFC3
 High = #High
 Low  = #Low

Debug Chr(#High) + Chr(#Low)
Debug Chr( High) + Chr( Low)

On a side note, the preprocessor doesn't allow Chr(#SurrogateValue), but why does it map it to TWO replacement chars (U+FFFD) ?

Code: Select all

#BadCodepoint = $D83C
Str$ = Chr(#BadCodepoint)
Debug Len(Str$)
ShowMemoryViewer(@Str$, 6) ; why two $FFFD replacement chars?