chr() and unicode

pdwyer · Post by **pdwyer** » Wed Oct 24, 2007 2:58 am

I think the unicode support is working though, the chars are different. In ascii mode DBCS chars are displayed okay but when you select them it selects on half at a time (kindo of difficult to explain, you'd have to see it so I'll try to get a pic later when I'm home) with this in UTF8 mode, the full chars select.

How does this work under the hood then?

The source is UTF8 and when loaded into the IDE its ??? and when static strings are compiled it's in UTF16 ? When is the conversion from UTF8 to UTF16 happening?

- At IDE Load?
- At Complile time?
- At run Time?
- Or am I completely misunderstanding how this works?

Trond · Post by **Trond** » Wed Oct 24, 2007 3:04 pm

Forget UTF-16. It's not used by PB. And forget all about codepages.

pdwyer · Post by **pdwyer** » Wed Oct 24, 2007 3:14 pm

Actually you are wrong, type "Unicode" into the help file! UTF16 (UCS2) is what the win32API "w" APIs are so if PB wasn't using it then it would have to convert everything.

However the docs say:

Unicode and Windows

On Windows, PureBasic internally uses the UCS2 encoding which is the format used by the Windows unicode API, so no conversions are needed at runtime when calling an OS function. When dealing with an API function, PureBasic will automatically use the unicode one if available (for example MessageBox_() will map to MessageBoxW() in unicode mode and MessageBoxA() in Ascii mode). All the API structures and constants supported by PureBasic will also automatically switch to their unicode version. That means than same API code can be compiled either in unicode or ascii without any change.

Unicode is only natively supported on Windows NT and later (Windows 2000/XP/Vista): a unicode program won't work on Windows 95/98/Me. There is solution by using the 'unicows' wrapper dll but it is not yet supported by PureBasic. If the Windows 9x support is needed, the best is to provide two version of the executable: one compiler in ascii, and another in unicode. As it's only a switch to specify, it should be quite fast.

UTF-8

UTF-8 is another unicode encoding, which is byte based. Unlike UCS2 which always takes 2 bytes per characters, UTF-8 uses a variable length encoding for each character (up to 4 bytes can represent one character). The biggest advantage of UTF-8 is the fact it doesn't includes null characters in its coding, so it can be edited like a regular text file. Also the ASCII characters from 1 to 127 are always preserved, so the text is kind of readable as only special characters will be encoded. One drawback is its variable length, so it all string operations will be slower due to needed pre-processing to locate a character is in the text.

PureBasic uses UTF-8 by default when writing string to files in unicode mode (File and Preference libraries), so all texts are fully cross-platform.

The PureBasic compiler also handles both Ascii and UTF-8 files (the UTF-8 files need to have the correct BOM header to be handled correctly). Both can be mixed in a single program without problem: an ascii file can include an UTF-8 file and vice-versa. When developing a unicode program, it's recommended to set the IDE in UTF-8 mode, so all the source files will be unicode ready. As the UTF-8 format doesn't hurt as well when developing ascii only programs, it is not needed to change this setting back.

So my questions from the previous post still stand:

The source is UTF8 and when loaded into the IDE its ??? and when static strings are compiled it's in UTF16 ? When is the conversion from UTF8 to UTF16 happening?

- At IDE Load?
- At Complile time?
- At run Time?
- Or am I completely misunderstanding how this works?

PS: if you are wondering where UTF8 fits into windows, it's code page 65001.

Edit: It's not that I don't know how unicode works, I've dealt with DBCS for a lot of years, its that I don't know how PB works. Just so you know where I'm coming from

Trond · Post by **Trond** » Wed Oct 24, 2007 3:22 pm

pdwyer wrote:Actually you are wrong, type "Unicode" into the help file! UTF16 (UCS2) is what the win32API "w" APIs are so if PB wasn't using it then it would have to convert everything.

Oh my. UTF-16 is not the same as UCS2. :roll:

The manual section on unicode explains exactly which encodings are used where, I don't see the problem. But, if it's easier to understand when I say it:

- The source files (.pb) are saved in UTF-8 format, which is a variable length unicode encoding, where each character takes anything from 1 to 4 bytes. This also applies to files you read and write with the file library.

- Strings in executables (dynamic and static) are stored in UCS2, which is a fixed length unicode encoding, where each character takes 2 bytes.

Both encodings uses the same character set, stored in different ways.

gnozal · Post by **gnozal** » Wed Oct 24, 2007 3:24 pm

My guess :
An UTF8 source when loaded into the IDE is UTF-8 in the scintilla control.
In an EXE compiled in Unicode mode, the static strings are Unicode (UCS2 ?).

Trond · Post by **Trond** » Wed Oct 24, 2007 3:25 pm

In an EXE compiled in Unicode mode, the static strings are Unicode (UTF-16 ?).

No, it's not UTF-16, it's UCS2. This is clearly written in the quote from the manual pasted in above.

gnozal · Post by **gnozal** » Wed Oct 24, 2007 3:28 pm

Trond wrote:
In an EXE compiled in Unicode mode, the static strings are Unicode (UTF-16 ?).
No, it's not UTF-16, it's UCS2. This is clearly written in the quote from the manual pasted in above.

Edit too late, I just opened Opera

From Wikipedia :

UTF-16 (16-bit Unicode Transformation Format) is a variable-length character encoding for Unicode, capable of encoding the entire Unicode repertoire

UCS-2 (2-byte Universal Character Set) is an obsolete character encoding which is a predecessor to UTF-16. The UCS-2 encoding form is nearly identical to that of UTF-16, except that it does not support surrogate pairs and therefore can only encode characters in the BMP range U+0000 through U+FFFF. As a consequence it is a fixed-length encoding that always encodes characters into a single 16-bit value.

Because of the technical similarities and upwards compatibility from UCS-2 to UTF-16, the two encodings are often erroneously conflated and used as if interchangeable, so that strings encoded in UTF-16 are sometimes misidentified as being encoded in UCS-2.

pdwyer · Post by **pdwyer** » Wed Oct 24, 2007 3:56 pm

UTF16 and UCS2 are often used interchangably and when refering to UTF16 I have been refering to the 16bit wide char used by windows W APIs(as apposed to UTF 8 which is sometime 1 or more bytes per chr but not "ALWAYS 2 BYTES"

Perhaps you should read the whole article http://en.wikipedia.org/wiki/UTF16

The simple point which stands is this. (in spite of my slightly incorrect naming)

UTF8 is NOT used by WIN32APIs which use wide 16bit Chars (I won't name them so as to avoid confusion this time)
The PB source code used UTF8 and so conversion MUST be happening

My questions still stand.

Where is the conversion happening? :roll:

Trond · Post by **Trond** » Wed Oct 24, 2007 4:09 pm

It's at compile time.

When else would it be? Let's say you've got source.pb in UTF-8 and type pbcompiler source.pb. You now have source.exe in UCS2. So it's obviously not happening in the IDE. And if it happened at run-time then the statement "PureBasic internally uses the UCS2 encoding" would be false.

pdwyer · Post by **pdwyer** » Thu Oct 25, 2007 1:01 am

Thanks for the info,

Like I said, I don't know PB well, what's with the know-it-all attitude all of a sudden

Are you embarrassed about not knowing that windows unicode was commonly known as UTF-16?

Microsoft use UCS-2 and UTF-16 interchangebly too though as one is just an extention of the other.

For example, in this link describing MultiByteToWideChar http://msdn2.microsoft.com/en-us/library/ms776413.aspx they say: (and don't mention UCS)

MultiByteToWideChar
Maps a character string to a wide character (Unicode UTF-16) string. The character string mapped by this function is not necessarily from a multibyte character set.

Note: The ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the most consistent results, applications should use Unicode, such as UTF-8 (code page 65001) or UTF-16, instead of a specific code page, unless legacy standards or data formats prevent the use of Unicode. If use of Unicode is not possible, applications should tag the data stream with the appropriate encoding name when protocols allow it. HTML, XML, and HTTP files allow tagging, but text files do not.

But in this article
http://support.microsoft.com/kb/232580
they mention the same API in terms of UCS and don't even mention UTF16

On Windows NT or Windows 2000, you may use the Win32 functions MultiByteToWideChar and WideCharToMultiByte to convert UTF-8 to and from UCS-2 by passing the constant CP_UTF8 (65001) as the first parameter to the functions.

The world of internationalisation is murky, always has been and only getting better slowly. IMO, M$ is probably the best company at supporting intl in apps, whether unicode or otherwise.

Maybe we should just let this thread die before it generates flames