Unicode and PureBasic

Everything else that doesn't fall into one of the other PB categories.
User avatar
Rescator
Addict
Addict
Posts: 1769
Joined: Sat Feb 19, 2005 5:05 pm
Location: Norway

Re: Unicode and PureBasic

Post by Rescator »

From wikipedia: http://en.wikipedia.org/wiki/UTF-16
UTF-16 developed from an earlier fixed-width 16-bit encoding known as UCS-2 (for 2-byte Universal Character Set) once it became clear that a fixed-width 2-byte encoding could not encode enough characters to be truly universal.
UTF-16 is used for text in the OS API in Microsoft Windows 2000/XP/2003/Vista/7/8/CE. Older Windows NT systems (prior to Windows 2000) only support UCS-2. In Windows XP, no code point above U+FFFF is included in any font delivered with Windows for European languages.
So do note that on Windows a unicode "character" may use 2, 4, 6 or 8 bytes to represent a unicode character.
Never assume a unicode character is 2 bytes when doing strings processing (i.e. parsing two and two bytes would technically be wrong).
The first some 65 thousand unicode characters fit in two bytes, the higher unicode characters need 4 bytes.

This is why UTF-8 is better suited for transmitting/sharing unicode text, you avoid the endian issue, most European/European originating languages are readable even if mistakenly displayed as being ASCII text. XML defaults to UTF-8, and even very old web browsers handle HTML with UTF-8 just fine if you specify the encoding and ASCII (7bit) fits "AS IS" into UTF-8.
My rule of thumb is... If in doubt use UTF-8.


As to Mac and Linux. I can at least say that I have read that with Linux it varies, it's either UTF-8 or it's UTF-32 depending on the desktop environment you use, Gnome vs something else for example. I'm sure some Linux geek could dig that info up and post it here.

Currently UTF-32 fits all unicode characters that exists, but in the future it may not, in that case a UTF-32 character may not use 4 bytes, but 8 bytes instead.

Also note that no normal font exists that have the glyphs for all unicode code points. You may need to compromise on looks to ensure you get/use a font that has a wide support for unicode characters. ID3 v2.4 support UTF-8, and Vorbis comments (used by Ogg, FLAC, Opus) are UTF-8 just as an example on where you may encounter them.
User avatar
Danilo
Addict
Addict
Posts: 3037
Joined: Sat Apr 26, 2003 8:26 am
Location: Planet Earth

Re: Unicode and PureBasic

Post by Danilo »

Rescator wrote:So do note that on Windows a unicode "character" may use 2, 4, 6 or 8 bytes to represent a unicode character.
Never assume a unicode character is 2 bytes when doing strings processing (i.e. parsing two and two bytes would technically be wrong).
The first some 65 thousand unicode characters fit in two bytes, the higher unicode characters need 4 bytes.
- PureBASIC internal encoding of unicode, UCS-2 or UTF-16?
Fred wrote:For the record, PB uses UCS2 string encoding internally when unicode mode is ON (it doesn't support multibyte UTF16 encoding).
chris319
Enthusiast
Enthusiast
Posts: 782
Joined: Mon Oct 24, 2005 1:05 pm

Re: Unicode and PureBasic

Post by chris319 »

This thread is supposed to UNconfuse us?
User avatar
Roger Hågensen
User
User
Posts: 47
Joined: Wed Mar 25, 2015 1:06 pm
Location: Norway

Re: Unicode and PureBasic

Post by Roger Hågensen »

Danilo wrote:PureBASIC internal encoding of unicode, UCS-2 or UTF-16?
Fred wrote:For the record, PB uses UCS2 string encoding internally when unicode mode is ON (it doesn't support multibyte UTF16 encoding).
That cold be an issue as Windows will return (if converting UTF-8 that is outside the BMP then a UTF-16 with surrogate pairs are returned. In that case a single UTF-16 character actually takes up 4 bytes rather than 2.
So treating uTF-16 (as I mentioned in the post above) as always 2 bytes (or 16bit) is a possible problem. A UTF-16 string should be treated as one do with a UTF-8 string, only with UTF-16 you have to keep in mind the endianess of it.
4 music albums under CC BY license available for free (any use, even commercial) at Skuldwyrm.no
User avatar
Danilo
Addict
Addict
Posts: 3037
Joined: Sat Apr 26, 2003 8:26 am
Location: Planet Earth

Re: Unicode and PureBasic

Post by Danilo »

Roger Hågensen wrote: So treating uTF-16 (as I mentioned in the post above) as always 2 bytes (or 16bit) is a possible problem. A UTF-16 string should be treated [...]
Fred said PB does not support UTF16. It supports only UCS2, which is always 2 bytes for each character.

I can't change that, and it's not my fault. :D
User avatar
Roger Hågensen
User
User
Posts: 47
Joined: Wed Mar 25, 2015 1:06 pm
Location: Norway

Re: Unicode and PureBasic

Post by Roger Hågensen »

CharNext_() is an interesting WinAPI function. https://msdn.microsoft.com/en-us/librar ... 47469.aspx
This function works with default "user" expectations of characters when dealing with diacritics. For example: A string that contains U+0061 U+030a "LATIN SMALL LETTER A" + COMBINING RING ABOVE" — which looks like "å", will advance two code points, not one. A string that contains U+0061 U+0301 U+0302 U+0303 U+0304 — which looks like "a´^~¯", will advance five code points, not one, and so on.
There is also a CharPrev_()


Now using Windows API calls and dealing with local text is mostly ok. The issue is when you get text from a different locale than the user (a Spanish guy with a Arabic name for example) how would a program on a American system display that properly, let alone alone apply upper and lower case properly.

Then there is unicode normalization which treat ß and ss the same to simplify comparisons (like list order of filenames for example).
4 music albums under CC BY license available for free (any use, even commercial) at Skuldwyrm.no
User avatar
Roger Hågensen
User
User
Posts: 47
Joined: Wed Mar 25, 2015 1:06 pm
Location: Norway

Re: Unicode and PureBasic

Post by Roger Hågensen »

Danilo wrote:Fred said PB does not support UTF16. It supports only UCS2, which is always 2 bytes for each character.
I can't change that, and it's not my fault. :D
I know. But it's troublesome as Windows uses UTF-16 and if PB treats it as UCS-2 (like Windows NT 4.0 and older did) then text may be handled wrong.

Now if text is simply stored as UCS-2 i PureBasic but WinAPI functions (on Windows) are used for the string handling then there probably is no issues as PureBasic is not doing any UCS-2 text processing at all.

This is similar to how UTF-8 can be stored as if it was a ASCII (8bit) string, you just can't process it as if it was ASCII that's all.
4 music albums under CC BY license available for free (any use, even commercial) at Skuldwyrm.no
User avatar
Roger Hågensen
User
User
Posts: 47
Joined: Wed Mar 25, 2015 1:06 pm
Location: Norway

Re: Unicode and PureBasic

Post by Roger Hågensen »

Here is a interesting read http://utf8everywhere.org/

Some of these things are worth considering when dealing with PureBasic and Unicode as well.
4 music albums under CC BY license available for free (any use, even commercial) at Skuldwyrm.no
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Unicode and PureBasic

Post by Little John »

mariosk8s has posted interesting information about Converting from UTF-8 NFD to NFC & vice versa.
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Unicode and PureBasic

Post by Little John »

For another discussion about UCS-2 vs. UTF-16 see here.
In that thread, freak wrote:PB supports UTF-16 in the same way that Java does: A surrogate pair simply counts as two characters in string functions. In my experience, this is close enough for almost all cases because situations in which they need to be treated as a single character are pretty rare (the use of code-points outside of the BMP is pretty exotic in itself).
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Unicode and PureBasic

Post by Little John »

Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Unicode and PureBasic

Post by Little John »

Regular Expressions

The PCRE library which ships with PureBasic (tested with 5.71 beta 2 on Windows) does not properly support Unicode: For instance, the anchor \b, as well as the shorthand character classes \w and \W do not work as expected. So I made a related feature request.

Until PureBasic comes with a PCRE library that completely supports Unicode, we can use tricks for working around some of the limitations. For examples click at the first link in this message. For more information see this tutorial about Unicode Regular Expressions.
Sooraa
User
User
Posts: 48
Joined: Thu Mar 12, 2015 2:07 pm
Location: Germany

Re: Unicode and PureBasic

Post by Sooraa »

Hi Little John,

although your feature request i.r. to real Unicode-Support for \b, \w, \d, \s has led to an integration of PCRE-Lib 8.44. in PB5.72.
But this did'nt help it.

We have to turn on the UCP-Support of the PCRE-compiler during the "CreateRegularExpression" statement by preceding (*UCP) to the regex. For the example \bglich\b" it is "(*UCP)\bglich\b").

\b, \w, \d, \s work fine with it.
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Unicode and PureBasic

Post by Little John »

Cool, thank you! Image
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Unicode and PureBasic

Post by Little John »

idle wrote a UTF-16 module.
Post Reply