Question about UCS2

Everything else that doesn't fall into one of the other PB categories.
coco2
Enthusiast
Enthusiast
Posts: 368
Joined: Mon Nov 25, 2013 5:38 am
Location: Australia

Question about UCS2

Post by coco2 »

I read that PureBasic uses UCS2 encoding which uses two bytes per character. That allows a possible 65,536 characters, but according to Wikipedia there are 136,755 Unicode characters, how does it account for these extras?
cas
Enthusiast
Enthusiast
Posts: 597
Joined: Mon Nov 03, 2008 9:56 pm

Re: Question about UCS2

Post by cas »

User avatar
Dadido3
User
User
Posts: 52
Joined: Sat Jan 12, 2008 11:50 pm
Location: Hessen, Germany
Contact:

Re: Question about UCS2

Post by Dadido3 »

On windows the strings are interpreted as UTF-16, which is like an extension of UCS-2. That means that a pair of special codepoints is used to represent characters above 65536.
These codepoints are contained in the unicode codepage anyway, so the only difference between UCS-2 and UTF-16 is that the latter can use these codepoints and the former doesn't/can't. Theoretically you could even use these codepoints with UTF-8, but it's probably the worst thing to do, as this encoding can address the higher codepoints directly.

What that means for PureBasic is that it will handle these surrogate codepoints like any other 16 bit character. So it will be displayed correctly on windows as far as the fonts can display them (idk if it works similar on linux or macOS, because they use UTF-32 internally afaik).
But some PB functions may not give correct results, like getting the amount of characters. Also, cutting a string with Mid() or any similar function may make those characters invalid, if you happen to make a cut between the surrogate pairs. Searching and replacing should work fine, tho.

So if you only need to read/write/display strings with these characters, it works fine (On windows atleast, don't have much experience on other OSes in this regard). But if you need more, you may have to use the API of your OS, or use an external library.
User avatar
kenmo
Addict
Addict
Posts: 1967
Joined: Tue Dec 23, 2003 3:54 am

Re: Question about UCS2

Post by kenmo »

Yep, look up "surrogate pairs". Characters > 0xFFFF are encoded as two 16-bit values in the reserved 0xD800 to 0xDFFF range.

Here are some examples:
www.purebasic.fr/english/viewtopic.php?f=3&t=66836
Post Reply