Page 1 of 1

Replace UCS-2 with UTF-16

Posted: Fri Sep 11, 2015 4:57 pm
by Dadido3
Hello,

i suggest to change PB's internal string representation from UCS-2 to UTF-16. The problem right now is that if you compile your application as "Unicode" it doesn't really support unicode at all. UCS-2 can only encode a smaller part of the unicode codepage, but UTF-16 can encode the complete range of unicode characters.

The thing is that the change isn't as big as it seems, neither for the users of purebasic nor for the developers of it. Both encodings are compatible, and UCS-2 strings can be interpreted as UTF-16 without loss of information.
As you can see in the following example, UTF-16 is already partly supported (Tested under windows, its API expects UTF-16 (Prior to Windows 2000 UCS-2 was used internally)):

Code: Select all

*Buffer = AllocateMemory(6)

PokeA(*Buffer+0, $53) ; high surrogate
PokeA(*Buffer+1, $D8) ; high surrogate
PokeA(*Buffer+2, $5C) ; low surrogate
PokeA(*Buffer+3, $DF) ; low surrogate

Debug PeekS(*Buffer, -1, #PB_Unicode)
Compiled as unicode executable, this code should output the character (U+24F5C). If you only see a rectangle or something else, that's because the font used in the debug window doesn't contain this code point. Just copy and paste the debug output into another editor.

Even peeking a UTF-8 string returns a valid UTF-16 string in unicode mode (on windows):

Code: Select all

*Buffer = AllocateMemory(5)

PokeA(*Buffer+0, $F0)
PokeA(*Buffer+1, $A4)
PokeA(*Buffer+2, $BD)
PokeA(*Buffer+3, $9C)

Debug PeekS(*Buffer, -1, #PB_UTF8) ; (U+24F5C) (Which is internally represented by $D853 $DF5C, a surrogate pair)
The same thing works in the other direction. If you have a UTF-16 string, PokeS() writes the correct UTF-8 representation, instead of encoding the single surrogates.

Code: Select all

*Buffer = AllocateMemory(5)
PokeA(*Buffer+0, $F0)
PokeA(*Buffer+1, $A4)
PokeA(*Buffer+2, $BD)
PokeA(*Buffer+3, $9C)
String.s = PeekS(*Buffer, -1, #PB_UTF8) ; (U+24F5C)
; Had to do it this way, the forum didn't let me insert characters outside of the Basic Multilingual Plane (BMP).

*Buffer = AllocateMemory(100)

Debug PokeS(*Buffer, String, -1, #PB_UTF8) ; Should output 4

Debug Hex(PeekA(*Buffer+0)) ; F0
Debug Hex(PeekA(*Buffer+1)) ; A4
Debug Hex(PeekA(*Buffer+2)) ; BD
Debug Hex(PeekA(*Buffer+3)) ; 9C
Debug Hex(PeekA(*Buffer+4)) ; 00
What needs to be done?
To make PB fully UTF-16 compatible, the following things have to be changed:
  • Chr() and Asc(), see this example:

    Code: Select all

    *Buffer = AllocateMemory(6)
    
    PokeA(*Buffer+0, $53) ; high surrogate
    PokeA(*Buffer+1, $D8) ; high surrogate
    PokeA(*Buffer+2, $5C) ; low surrogate
    PokeA(*Buffer+3, $DF) ; low surrogate
    
    Debug Asc(PeekS(*Buffer, -1, #PB_Unicode)) ; Should output 151388 (U+24F5C)
    Debug Chr(151388) ; should output (U+24F5C)
    This can be done easily, a code-point outside of the BMP will allways result in two surrogate pairs. The calculation needed is explained here: https://en.wikipedia.org/wiki/UTF-16.
  • LCase() and UCase() See this:

    Code: Select all

    *Buffer = AllocateMemory(6)
    
    PokeA(*Buffer+0, $06) ; high surrogate
    PokeA(*Buffer+1, $D8) ; high surrogate
    PokeA(*Buffer+2, $A0) ; low surrogate
    PokeA(*Buffer+3, $DC) ; low surrogate
    
    Debug PeekS(*Buffer, -1, #PB_Unicode) ; Should output (U+118A0)
    Debug UCase(PeekS(*Buffer, -1, #PB_Unicode)) ; Should output (U+118A0) (It is already uppercase)
    Debug LCase(PeekS(*Buffer, -1, #PB_Unicode)) ; Should output (U+118C0)
    I don't know how complex this change is internally. The different operating systems should provide functions to do that correctly. But even if not, without changes to these functions they should work for the BMP.
  • All commands with length or position measured in characters have to be corrected. (LSet(), RSet(), FindString(), InsertString(), Left(), Right(), Mid(), Len(), RemoveString(), ReplaceString(), ReadString(), WriteString(), PeekS(), PokeS(), MemoryStringLength(), any others i forgot?) I also don't know how complex the changes are for these functions. But the most important functions are Len() and MemoryStringLength(). The examination of the length in characters can be easily done by checking for surrogate pairs. Here is an example of how it should work:

    Code: Select all

    *Buffer = AllocateMemory(5)
    PokeA(*Buffer+0, $F0)
    PokeA(*Buffer+1, $A4)
    PokeA(*Buffer+2, $BD)
    PokeA(*Buffer+3, $9C)
    String.s = PeekS(*Buffer, -1, #PB_UTF8) + "a" ; (U+24F5C)
    ; Had to do it this way, the forum didn't let me insert characters outside of the BMP
    
    Debug Len(String) ; Should output 2
    Debug MemoryStringLength(@String) ; Should output 2
    Debug PeekS(@String, 1) ; Should output (U+24F5C)
    Debug PeekS(@String, 2) ; Should output (U+24F5C)a
  • Make the newly introduced #PB_ByteLength flag work with #PB_Unicode. With UTF-16 the amount of characters can be less than the amount of code-units / code-points. http://www.purebasic.fr/english/viewtop ... =4&t=62981

    Code: Select all

    *Buffer = AllocateMemory(5)
    PokeA(*Buffer+0, $F0)
    PokeA(*Buffer+1, $A4)
    PokeA(*Buffer+2, $BD)
    PokeA(*Buffer+3, $9C)
    String.s = PeekS(*Buffer, -1, #PB_UTF8) + "a" ; (U+24F5C)
    ; Had to do it this way, the forum didn't let me insert characters outside of the BMP
    
    Debug PeekS(@String, 2, #PB_Unicode | #PB_ByteLength) ; Output undefined (Half surrogate pair), maybe (U+D853) or � (U+FFFD)
    Debug PeekS(@String, 4, #PB_Unicode | #PB_ByteLength) ; Should output (U+24F5C)
    Debug PeekS(@String, 6, #PB_Unicode | #PB_ByteLength) ; Should output (U+24F5C)a
Links:
https://en.wikipedia.org/wiki/UTF-16
http://hackipedia.org/Character%20sets/ ... /UCS-2.htm

I had to remove some of the characters in this post, because i got database errors when i tried to submit.
These characters were: Any ideas, suggestions? Is there anything important i forgot? And can someone test how these code snippets work on other operating systems?

Re: Replace UCS-2 with UTF-16

Posted: Fri Sep 11, 2015 10:36 pm
by IdeasVacuum
:shock: I am surprised, I thought PB would be UTF-16 under the hood. There is no doubt that if your app is to be Unicode, then UTF-16 has to be fully supported if the User inputs string data. Perhaps there are other technical reasons why currently this is not the case? Compiler issues? Other OS issues?

Re: Replace UCS-2 with UTF-16

Posted: Fri Sep 11, 2015 11:10 pm
by freak
PB supports UTF-16 in the same way that Java does: A surrogate pair simply counts as two characters in string functions. In my experience, this is close enough for almost all cases because situations in which they need to be treated as a single character are pretty rare (the use of code-points outside of the BMP is pretty exotic in itself).