Replace UCS-2 with UTF-16
Posted: Fri Sep 11, 2015 4:57 pm
Hello,
i suggest to change PB's internal string representation from UCS-2 to UTF-16. The problem right now is that if you compile your application as "Unicode" it doesn't really support unicode at all. UCS-2 can only encode a smaller part of the unicode codepage, but UTF-16 can encode the complete range of unicode characters.
The thing is that the change isn't as big as it seems, neither for the users of purebasic nor for the developers of it. Both encodings are compatible, and UCS-2 strings can be interpreted as UTF-16 without loss of information.
As you can see in the following example, UTF-16 is already partly supported (Tested under windows, its API expects UTF-16 (Prior to Windows 2000 UCS-2 was used internally)):
Compiled as unicode executable, this code should output the character (U+24F5C). If you only see a rectangle or something else, that's because the font used in the debug window doesn't contain this code point. Just copy and paste the debug output into another editor.
Even peeking a UTF-8 string returns a valid UTF-16 string in unicode mode (on windows):
The same thing works in the other direction. If you have a UTF-16 string, PokeS() writes the correct UTF-8 representation, instead of encoding the single surrogates.
What needs to be done?
To make PB fully UTF-16 compatible, the following things have to be changed:
https://en.wikipedia.org/wiki/UTF-16
http://hackipedia.org/Character%20sets/ ... /UCS-2.htm
I had to remove some of the characters in this post, because i got database errors when i tried to submit.
These characters were: Any ideas, suggestions? Is there anything important i forgot? And can someone test how these code snippets work on other operating systems?
i suggest to change PB's internal string representation from UCS-2 to UTF-16. The problem right now is that if you compile your application as "Unicode" it doesn't really support unicode at all. UCS-2 can only encode a smaller part of the unicode codepage, but UTF-16 can encode the complete range of unicode characters.
The thing is that the change isn't as big as it seems, neither for the users of purebasic nor for the developers of it. Both encodings are compatible, and UCS-2 strings can be interpreted as UTF-16 without loss of information.
As you can see in the following example, UTF-16 is already partly supported (Tested under windows, its API expects UTF-16 (Prior to Windows 2000 UCS-2 was used internally)):
Code: Select all
*Buffer = AllocateMemory(6)
PokeA(*Buffer+0, $53) ; high surrogate
PokeA(*Buffer+1, $D8) ; high surrogate
PokeA(*Buffer+2, $5C) ; low surrogate
PokeA(*Buffer+3, $DF) ; low surrogate
Debug PeekS(*Buffer, -1, #PB_Unicode)
Even peeking a UTF-8 string returns a valid UTF-16 string in unicode mode (on windows):
Code: Select all
*Buffer = AllocateMemory(5)
PokeA(*Buffer+0, $F0)
PokeA(*Buffer+1, $A4)
PokeA(*Buffer+2, $BD)
PokeA(*Buffer+3, $9C)
Debug PeekS(*Buffer, -1, #PB_UTF8) ; (U+24F5C) (Which is internally represented by $D853 $DF5C, a surrogate pair)
Code: Select all
*Buffer = AllocateMemory(5)
PokeA(*Buffer+0, $F0)
PokeA(*Buffer+1, $A4)
PokeA(*Buffer+2, $BD)
PokeA(*Buffer+3, $9C)
String.s = PeekS(*Buffer, -1, #PB_UTF8) ; (U+24F5C)
; Had to do it this way, the forum didn't let me insert characters outside of the Basic Multilingual Plane (BMP).
*Buffer = AllocateMemory(100)
Debug PokeS(*Buffer, String, -1, #PB_UTF8) ; Should output 4
Debug Hex(PeekA(*Buffer+0)) ; F0
Debug Hex(PeekA(*Buffer+1)) ; A4
Debug Hex(PeekA(*Buffer+2)) ; BD
Debug Hex(PeekA(*Buffer+3)) ; 9C
Debug Hex(PeekA(*Buffer+4)) ; 00
To make PB fully UTF-16 compatible, the following things have to be changed:
- Chr() and Asc(), see this example: This can be done easily, a code-point outside of the BMP will allways result in two surrogate pairs. The calculation needed is explained here: https://en.wikipedia.org/wiki/UTF-16.
Code: Select all
*Buffer = AllocateMemory(6) PokeA(*Buffer+0, $53) ; high surrogate PokeA(*Buffer+1, $D8) ; high surrogate PokeA(*Buffer+2, $5C) ; low surrogate PokeA(*Buffer+3, $DF) ; low surrogate Debug Asc(PeekS(*Buffer, -1, #PB_Unicode)) ; Should output 151388 (U+24F5C) Debug Chr(151388) ; should output (U+24F5C)
- LCase() and UCase() See this:I don't know how complex this change is internally. The different operating systems should provide functions to do that correctly. But even if not, without changes to these functions they should work for the BMP.
Code: Select all
*Buffer = AllocateMemory(6) PokeA(*Buffer+0, $06) ; high surrogate PokeA(*Buffer+1, $D8) ; high surrogate PokeA(*Buffer+2, $A0) ; low surrogate PokeA(*Buffer+3, $DC) ; low surrogate Debug PeekS(*Buffer, -1, #PB_Unicode) ; Should output (U+118A0) Debug UCase(PeekS(*Buffer, -1, #PB_Unicode)) ; Should output (U+118A0) (It is already uppercase) Debug LCase(PeekS(*Buffer, -1, #PB_Unicode)) ; Should output (U+118C0)
- All commands with length or position measured in characters have to be corrected. (LSet(), RSet(), FindString(), InsertString(), Left(), Right(), Mid(), Len(), RemoveString(), ReplaceString(), ReadString(), WriteString(), PeekS(), PokeS(), MemoryStringLength(), any others i forgot?) I also don't know how complex the changes are for these functions. But the most important functions are Len() and MemoryStringLength(). The examination of the length in characters can be easily done by checking for surrogate pairs. Here is an example of how it should work:
Code: Select all
*Buffer = AllocateMemory(5) PokeA(*Buffer+0, $F0) PokeA(*Buffer+1, $A4) PokeA(*Buffer+2, $BD) PokeA(*Buffer+3, $9C) String.s = PeekS(*Buffer, -1, #PB_UTF8) + "a" ; (U+24F5C) ; Had to do it this way, the forum didn't let me insert characters outside of the BMP Debug Len(String) ; Should output 2 Debug MemoryStringLength(@String) ; Should output 2 Debug PeekS(@String, 1) ; Should output (U+24F5C) Debug PeekS(@String, 2) ; Should output (U+24F5C)a
- Make the newly introduced #PB_ByteLength flag work with #PB_Unicode. With UTF-16 the amount of characters can be less than the amount of code-units / code-points. http://www.purebasic.fr/english/viewtop ... =4&t=62981
Code: Select all
*Buffer = AllocateMemory(5) PokeA(*Buffer+0, $F0) PokeA(*Buffer+1, $A4) PokeA(*Buffer+2, $BD) PokeA(*Buffer+3, $9C) String.s = PeekS(*Buffer, -1, #PB_UTF8) + "a" ; (U+24F5C) ; Had to do it this way, the forum didn't let me insert characters outside of the BMP Debug PeekS(@String, 2, #PB_Unicode | #PB_ByteLength) ; Output undefined (Half surrogate pair), maybe (U+D853) or � (U+FFFD) Debug PeekS(@String, 4, #PB_Unicode | #PB_ByteLength) ; Should output (U+24F5C) Debug PeekS(@String, 6, #PB_Unicode | #PB_ByteLength) ; Should output (U+24F5C)a
https://en.wikipedia.org/wiki/UTF-16
http://hackipedia.org/Character%20sets/ ... /UCS-2.htm
I had to remove some of the characters in this post, because i got database errors when i tried to submit.
These characters were: Any ideas, suggestions? Is there anything important i forgot? And can someone test how these code snippets work on other operating systems?