It is currently Mon Nov 30, 2020 6:34 am

All times are UTC + 1 hour




Post new topic Reply to topic  [ 3 posts ] 
Author Message
 Post subject: Replace UCS-2 with UTF-16
PostPosted: Fri Sep 11, 2015 4:57 pm 
Offline
User
User
User avatar

Joined: Sat Jan 12, 2008 11:50 pm
Posts: 44
Location: Hessen, Germany
Hello,

i suggest to change PB's internal string representation from UCS-2 to UTF-16. The problem right now is that if you compile your application as "Unicode" it doesn't really support unicode at all. UCS-2 can only encode a smaller part of the unicode codepage, but UTF-16 can encode the complete range of unicode characters.

The thing is that the change isn't as big as it seems, neither for the users of purebasic nor for the developers of it. Both encodings are compatible, and UCS-2 strings can be interpreted as UTF-16 without loss of information.
As you can see in the following example, UTF-16 is already partly supported (Tested under windows, its API expects UTF-16 (Prior to Windows 2000 UCS-2 was used internally)):
Code:
*Buffer = AllocateMemory(6)

PokeA(*Buffer+0, $53) ; high surrogate
PokeA(*Buffer+1, $D8) ; high surrogate
PokeA(*Buffer+2, $5C) ; low surrogate
PokeA(*Buffer+3, $DF) ; low surrogate

Debug PeekS(*Buffer, -1, #PB_Unicode)

Compiled as unicode executable, this code should output the character (U+24F5C). If you only see a rectangle or something else, that's because the font used in the debug window doesn't contain this code point. Just copy and paste the debug output into another editor.

Even peeking a UTF-8 string returns a valid UTF-16 string in unicode mode (on windows):
Code:
*Buffer = AllocateMemory(5)

PokeA(*Buffer+0, $F0)
PokeA(*Buffer+1, $A4)
PokeA(*Buffer+2, $BD)
PokeA(*Buffer+3, $9C)

Debug PeekS(*Buffer, -1, #PB_UTF8) ; (U+24F5C) (Which is internally represented by $D853 $DF5C, a surrogate pair)


The same thing works in the other direction. If you have a UTF-16 string, PokeS() writes the correct UTF-8 representation, instead of encoding the single surrogates.
Code:
*Buffer = AllocateMemory(5)
PokeA(*Buffer+0, $F0)
PokeA(*Buffer+1, $A4)
PokeA(*Buffer+2, $BD)
PokeA(*Buffer+3, $9C)
String.s = PeekS(*Buffer, -1, #PB_UTF8) ; (U+24F5C)
; Had to do it this way, the forum didn't let me insert characters outside of the Basic Multilingual Plane (BMP).

*Buffer = AllocateMemory(100)

Debug PokeS(*Buffer, String, -1, #PB_UTF8) ; Should output 4

Debug Hex(PeekA(*Buffer+0)) ; F0
Debug Hex(PeekA(*Buffer+1)) ; A4
Debug Hex(PeekA(*Buffer+2)) ; BD
Debug Hex(PeekA(*Buffer+3)) ; 9C
Debug Hex(PeekA(*Buffer+4)) ; 00


What needs to be done?
To make PB fully UTF-16 compatible, the following things have to be changed:
  • Chr() and Asc(), see this example:
    Code:
    *Buffer = AllocateMemory(6)

    PokeA(*Buffer+0, $53) ; high surrogate
    PokeA(*Buffer+1, $D8) ; high surrogate
    PokeA(*Buffer+2, $5C) ; low surrogate
    PokeA(*Buffer+3, $DF) ; low surrogate

    Debug Asc(PeekS(*Buffer, -1, #PB_Unicode)) ; Should output 151388 (U+24F5C)
    Debug Chr(151388) ; should output (U+24F5C)
    This can be done easily, a code-point outside of the BMP will allways result in two surrogate pairs. The calculation needed is explained here: https://en.wikipedia.org/wiki/UTF-16.
  • LCase() and UCase() See this:
    Code:
    *Buffer = AllocateMemory(6)

    PokeA(*Buffer+0, $06) ; high surrogate
    PokeA(*Buffer+1, $D8) ; high surrogate
    PokeA(*Buffer+2, $A0) ; low surrogate
    PokeA(*Buffer+3, $DC) ; low surrogate

    Debug PeekS(*Buffer, -1, #PB_Unicode) ; Should output (U+118A0)
    Debug UCase(PeekS(*Buffer, -1, #PB_Unicode)) ; Should output (U+118A0) (It is already uppercase)
    Debug LCase(PeekS(*Buffer, -1, #PB_Unicode)) ; Should output (U+118C0)
    I don't know how complex this change is internally. The different operating systems should provide functions to do that correctly. But even if not, without changes to these functions they should work for the BMP.
  • All commands with length or position measured in characters have to be corrected. (LSet(), RSet(), FindString(), InsertString(), Left(), Right(), Mid(), Len(), RemoveString(), ReplaceString(), ReadString(), WriteString(), PeekS(), PokeS(), MemoryStringLength(), any others i forgot?) I also don't know how complex the changes are for these functions. But the most important functions are Len() and MemoryStringLength(). The examination of the length in characters can be easily done by checking for surrogate pairs. Here is an example of how it should work:
    Code:
    *Buffer = AllocateMemory(5)
    PokeA(*Buffer+0, $F0)
    PokeA(*Buffer+1, $A4)
    PokeA(*Buffer+2, $BD)
    PokeA(*Buffer+3, $9C)
    String.s = PeekS(*Buffer, -1, #PB_UTF8) + "a" ; (U+24F5C)
    ; Had to do it this way, the forum didn't let me insert characters outside of the BMP

    Debug Len(String) ; Should output 2
    Debug MemoryStringLength(@String) ; Should output 2
    Debug PeekS(@String, 1) ; Should output (U+24F5C)
    Debug PeekS(@String, 2) ; Should output (U+24F5C)a

  • Make the newly introduced #PB_ByteLength flag work with #PB_Unicode. With UTF-16 the amount of characters can be less than the amount of code-units / code-points. http://www.purebasic.fr/english/viewtopic.php?f=4&t=62981
    Code:
    *Buffer = AllocateMemory(5)
    PokeA(*Buffer+0, $F0)
    PokeA(*Buffer+1, $A4)
    PokeA(*Buffer+2, $BD)
    PokeA(*Buffer+3, $9C)
    String.s = PeekS(*Buffer, -1, #PB_UTF8) + "a" ; (U+24F5C)
    ; Had to do it this way, the forum didn't let me insert characters outside of the BMP

    Debug PeekS(@String, 2, #PB_Unicode | #PB_ByteLength) ; Output undefined (Half surrogate pair), maybe (U+D853) or � (U+FFFD)
    Debug PeekS(@String, 4, #PB_Unicode | #PB_ByteLength) ; Should output (U+24F5C)
    Debug PeekS(@String, 6, #PB_Unicode | #PB_ByteLength) ; Should output (U+24F5C)a

Links:
https://en.wikipedia.org/wiki/UTF-16
http://hackipedia.org/Character%20sets/Unicode,%20UTF%20and%20UCS%20encodings/UCS-2.htm

I had to remove some of the characters in this post, because i got database errors when i tried to submit.
These characters were:

Any ideas, suggestions? Is there anything important i forgot? And can someone test how these code snippets work on other operating systems?

_________________
Website


Top
 Profile  
Reply with quote  
 Post subject: Re: Replace UCS-2 with UTF-16
PostPosted: Fri Sep 11, 2015 10:36 pm 
Offline
Always Here
Always Here

Joined: Fri Oct 23, 2009 2:33 am
Posts: 6262
Location: Wales, UK
:shock: I am surprised, I thought PB would be UTF-16 under the hood. There is no doubt that if your app is to be Unicode, then UTF-16 has to be fully supported if the User inputs string data. Perhaps there are other technical reasons why currently this is not the case? Compiler issues? Other OS issues?

_________________
IdeasVacuum
If it sounds simple, you have not grasped the complexity.


Top
 Profile  
Reply with quote  
 Post subject: Re: Replace UCS-2 with UTF-16
PostPosted: Fri Sep 11, 2015 11:10 pm 
Offline
PureBasic Team
PureBasic Team
User avatar

Joined: Fri Apr 25, 2003 5:21 pm
Posts: 5815
Location: Germany
PB supports UTF-16 in the same way that Java does: A surrogate pair simply counts as two characters in string functions. In my experience, this is close enough for almost all cases because situations in which they need to be treated as a single character are pretty rare (the use of code-points outside of the BMP is pretty exotic in itself).

_________________
quidquid Latine dictum sit altum videtur


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 3 posts ] 

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 12 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  

 


Powered by phpBB © 2008 phpBB Group
subSilver+ theme by Canver Software, sponsor Sanal Modifiye