Page 1 of 1

Unicode oddities

Posted: Fri Mar 27, 2015 3:56 pm
by Roger Hågensen
To avoid polluting the other thread on unicode I started this one.

Code: Select all

;16 characters (grapheme clusters)
text$="Приве́т नमस्ते שָׁלוֹם"

Debug Len(text$) ;22 codepoints
Debug StringByteLength(text$) ;44 bytes
Debug StringByteLength(text$, #PB_UTF8) ;48 bytes

text$=text$+#CRLF$+UCase(text$)+#CRLF$+LCase(text$)

MessageRequester("Test", text$)
Three different words, three different languages.

To my eyes it looks like 16 character total (14 symbols and 2 spaces)
As stated on http://utf8everywhere.org/ that string actually consists of 22 code points but only 16 grapheme clusters.
For cursor movement, text selection and alike, grapheme clusters shall be used.
In latest Firefox this seems to be true, the cursor moves 16 places.
But in the PB IDE (v5.31) the cursor moves 22 places.


The first word seems to be correctly upper and lowercased in the requester, the other two words seem to be ignored, this is probably a issue with the locales installed on the system.
On your system this may end up behaving differently.

It would be interesting to hear about other unicode oddities that people find.

Re: Unicode oddities

Posted: Fri Mar 27, 2015 6:31 pm
by Little John
Roger Hågensen wrote:To avoid polluting the other thread on unicode I started this one.
Thank you. :-)
However, even when Unicode oddities are not specific to PureBasic, it's good when we are aware of them while using Unicode with PureBasic.
I'll post a link in the other thread that points to this one.

Here is an oddity similar to the one above, but simpler:

Code: Select all

Debug Chr($006E) + Chr($0303)   ; -> ñ
Debug Chr($00F1)                ; -> ñ
Anyone who gets a different result on her/his system, please see here.

Stuff like this makes it hard to reliably search for, compare, and sort Unicode strings.
That's why IMHO built-in Unicode normalization in PB would be very useful.

The following trick is more fun. ;-)

Code: Select all

Debug Chr('A') + Chr('B')              + Chr('C') + Chr('D')   ; -> ABCD
Debug Chr('A') + Chr('B') + Chr($202E) + Chr('C') + Chr('D')   ; -> ABDC  !!
xkcd puts it this way:

Image