It is currently Mon Dec 11, 2017 5:10 am

All times are UTC + 1 hour




Post new topic Reply to topic  [ 26 posts ]  Go to page Previous  1, 2
Author Message
 Post subject: Re: Unicode and PureBasic
PostPosted: Wed Mar 25, 2015 4:40 am 
Offline
Addict
Addict
User avatar

Joined: Sat Feb 19, 2005 5:05 pm
Posts: 1769
Location: Norway
From wikipedia: http://en.wikipedia.org/wiki/UTF-16

Quote:
UTF-16 developed from an earlier fixed-width 16-bit encoding known as UCS-2 (for 2-byte Universal Character Set) once it became clear that a fixed-width 2-byte encoding could not encode enough characters to be truly universal.


Quote:
UTF-16 is used for text in the OS API in Microsoft Windows 2000/XP/2003/Vista/7/8/CE. Older Windows NT systems (prior to Windows 2000) only support UCS-2. In Windows XP, no code point above U+FFFF is included in any font delivered with Windows for European languages.


So do note that on Windows a unicode "character" may use 2, 4, 6 or 8 bytes to represent a unicode character.
Never assume a unicode character is 2 bytes when doing strings processing (i.e. parsing two and two bytes would technically be wrong).
The first some 65 thousand unicode characters fit in two bytes, the higher unicode characters need 4 bytes.

This is why UTF-8 is better suited for transmitting/sharing unicode text, you avoid the endian issue, most European/European originating languages are readable even if mistakenly displayed as being ASCII text. XML defaults to UTF-8, and even very old web browsers handle HTML with UTF-8 just fine if you specify the encoding and ASCII (7bit) fits "AS IS" into UTF-8.
My rule of thumb is... If in doubt use UTF-8.


As to Mac and Linux. I can at least say that I have read that with Linux it varies, it's either UTF-8 or it's UTF-32 depending on the desktop environment you use, Gnome vs something else for example. I'm sure some Linux geek could dig that info up and post it here.

Currently UTF-32 fits all unicode characters that exists, but in the future it may not, in that case a UTF-32 character may not use 4 bytes, but 8 bytes instead.

Also note that no normal font exists that have the glyphs for all unicode code points. You may need to compromise on looks to ensure you get/use a font that has a wide support for unicode characters. ID3 v2.4 support UTF-8, and Vorbis comments (used by Ogg, FLAC, Opus) are UTF-8 just as an example on where you may encounter them.


Top
 Profile  
Reply with quote  
 Post subject: Re: Unicode and PureBasic
PostPosted: Wed Mar 25, 2015 10:57 am 
Offline
Addict
Addict
User avatar

Joined: Sat Apr 26, 2003 8:26 am
Posts: 2916
Location: Planet Earth
Rescator wrote:
So do note that on Windows a unicode "character" may use 2, 4, 6 or 8 bytes to represent a unicode character.
Never assume a unicode character is 2 bytes when doing strings processing (i.e. parsing two and two bytes would technically be wrong).
The first some 65 thousand unicode characters fit in two bytes, the higher unicode characters need 4 bytes.

- PureBASIC internal encoding of unicode, UCS-2 or UTF-16?
Fred wrote:
For the record, PB uses UCS2 string encoding internally when unicode mode is ON (it doesn't support multibyte UTF16 encoding).


Top
 Profile  
Reply with quote  
 Post subject: Re: Unicode and PureBasic
PostPosted: Thu Mar 26, 2015 7:37 am 
Offline
Enthusiast
Enthusiast

Joined: Mon Oct 24, 2005 1:05 pm
Posts: 661
This thread is supposed to UNconfuse us?


Top
 Profile  
Reply with quote  
 Post subject: Re: Unicode and PureBasic
PostPosted: Fri Mar 27, 2015 12:50 pm 
Offline
User
User
User avatar

Joined: Wed Mar 25, 2015 1:06 pm
Posts: 44
Location: Norway
Danilo wrote:
PureBASIC internal encoding of unicode, UCS-2 or UTF-16?
Fred wrote:
For the record, PB uses UCS2 string encoding internally when unicode mode is ON (it doesn't support multibyte UTF16 encoding).


That cold be an issue as Windows will return (if converting UTF-8 that is outside the BMP then a UTF-16 with surrogate pairs are returned. In that case a single UTF-16 character actually takes up 4 bytes rather than 2.
So treating uTF-16 (as I mentioned in the post above) as always 2 bytes (or 16bit) is a possible problem. A UTF-16 string should be treated as one do with a UTF-8 string, only with UTF-16 you have to keep in mind the endianess of it.

_________________
4 music albums under CC BY license available for free (any use, even commercial) at Skuldwyrm.no


Top
 Profile  
Reply with quote  
 Post subject: Re: Unicode and PureBasic
PostPosted: Fri Mar 27, 2015 1:32 pm 
Offline
Addict
Addict
User avatar

Joined: Sat Apr 26, 2003 8:26 am
Posts: 2916
Location: Planet Earth
Roger Hågensen wrote:
So treating uTF-16 (as I mentioned in the post above) as always 2 bytes (or 16bit) is a possible problem. A UTF-16 string should be treated [...]

Fred said PB does not support UTF16. It supports only UCS2, which is always 2 bytes for each character.

I can't change that, and it's not my fault. :D


Top
 Profile  
Reply with quote  
 Post subject: Re: Unicode and PureBasic
PostPosted: Fri Mar 27, 2015 2:46 pm 
Offline
User
User
User avatar

Joined: Wed Mar 25, 2015 1:06 pm
Posts: 44
Location: Norway
CharNext_() is an interesting WinAPI function. https://msdn.microsoft.com/en-us/librar ... 47469.aspx
Quote:
This function works with default "user" expectations of characters when dealing with diacritics. For example: A string that contains U+0061 U+030a "LATIN SMALL LETTER A" + COMBINING RING ABOVE" — which looks like "å", will advance two code points, not one. A string that contains U+0061 U+0301 U+0302 U+0303 U+0304 — which looks like "a´^~¯", will advance five code points, not one, and so on.


There is also a CharPrev_()


Now using Windows API calls and dealing with local text is mostly ok. The issue is when you get text from a different locale than the user (a Spanish guy with a Arabic name for example) how would a program on a American system display that properly, let alone alone apply upper and lower case properly.

Then there is unicode normalization which treat ß and ss the same to simplify comparisons (like list order of filenames for example).

_________________
4 music albums under CC BY license available for free (any use, even commercial) at Skuldwyrm.no


Top
 Profile  
Reply with quote  
 Post subject: Re: Unicode and PureBasic
PostPosted: Fri Mar 27, 2015 2:50 pm 
Offline
User
User
User avatar

Joined: Wed Mar 25, 2015 1:06 pm
Posts: 44
Location: Norway
Danilo wrote:
Fred said PB does not support UTF16. It supports only UCS2, which is always 2 bytes for each character.
I can't change that, and it's not my fault. :D

I know. But it's troublesome as Windows uses UTF-16 and if PB treats it as UCS-2 (like Windows NT 4.0 and older did) then text may be handled wrong.

Now if text is simply stored as UCS-2 i PureBasic but WinAPI functions (on Windows) are used for the string handling then there probably is no issues as PureBasic is not doing any UCS-2 text processing at all.

This is similar to how UTF-8 can be stored as if it was a ASCII (8bit) string, you just can't process it as if it was ASCII that's all.

_________________
4 music albums under CC BY license available for free (any use, even commercial) at Skuldwyrm.no


Top
 Profile  
Reply with quote  
 Post subject: Re: Unicode and PureBasic
PostPosted: Fri Mar 27, 2015 3:25 pm 
Offline
User
User
User avatar

Joined: Wed Mar 25, 2015 1:06 pm
Posts: 44
Location: Norway
Here is a interesting read http://utf8everywhere.org/

Some of these things are worth considering when dealing with PureBasic and Unicode as well.

_________________
4 music albums under CC BY license available for free (any use, even commercial) at Skuldwyrm.no


Top
 Profile  
Reply with quote  
 Post subject: Re: Unicode and PureBasic
PostPosted: Wed Aug 12, 2015 3:56 pm 
Offline
Addict
Addict
User avatar

Joined: Thu Jun 07, 2007 3:25 pm
Posts: 3142
Location: Berlin, Germany
mariosk8s has posted interesting information about Converting from UTF-8 NFD to NFC & vice versa.

_________________
Please excuse my flawed English. My native language is PureBasic.
Search
RSBasic's backups


Top
 Profile  
Reply with quote  
 Post subject: Re: Unicode and PureBasic
PostPosted: Sat Sep 12, 2015 8:56 am 
Offline
Addict
Addict
User avatar

Joined: Thu Jun 07, 2007 3:25 pm
Posts: 3142
Location: Berlin, Germany
For another discussion about UCS-2 vs. UTF-16 see here.

In that thread, freak wrote:
PB supports UTF-16 in the same way that Java does: A surrogate pair simply counts as two characters in string functions. In my experience, this is close enough for almost all cases because situations in which they need to be treated as a single character are pretty rare (the use of code-points outside of the BMP is pretty exotic in itself).

_________________
Please excuse my flawed English. My native language is PureBasic.
Search
RSBasic's backups


Top
 Profile  
Reply with quote  
 Post subject: Re: Unicode and PureBasic
PostPosted: Fri Feb 19, 2016 11:33 am 
Offline
Addict
Addict
User avatar

Joined: Thu Jun 07, 2007 3:25 pm
Posts: 3142
Location: Berlin, Germany
Demivec wrote a module for Detecting Text File Encoding without BOM,
and he also implemented Revised Chr() & Asc() for UTF-16 surrogate pairs.

_________________
Please excuse my flawed English. My native language is PureBasic.
Search
RSBasic's backups


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 26 posts ]  Go to page Previous  1, 2

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  

 


Powered by phpBB © 2008 phpBB Group
subSilver+ theme by Canver Software, sponsor Sanal Modifiye