Page 1 of 1
UTF-8 to UTF-16
Posted: Sun Aug 18, 2013 6:21 pm
by Korolev Michael
Save this as UTF-8 and compile as Unicode executable. Why it returns single "o" instead of "onetwo"?
Code: Select all
string$ = "onetwo"
bytes = Len(string$)*2+2
*buf = AllocateMemory(bytes)
MultiByteToWideChar_(65001,0,@string$,Len(string$),*buf,Len(string$))
Debug PeekS(*buf,bytes,#PB_Unicode)
Re: UTF-8 to UTF-16
Posted: Mon Aug 19, 2013 7:25 am
by JHPJHP
Removed; post ignored.
Re: UTF-8 to UTF-16
Posted: Mon Aug 19, 2013 7:35 am
by wilbert
Korolev Michael wrote:Save this as UTF-8 and compile as Unicode executable. Why it returns single "o" instead of "onetwo"?
Saving as UTF-8 only influences the way your source code is stored.
When you compile in unicode mode, string$ already is a unicode string.
So the problem is that you are presenting a unicode string to MultiByteToWideChar_ while it expects a multibyte string.
Re: UTF-8 to UTF-16
Posted: Mon Aug 19, 2013 12:30 pm
by Korolev Michael
@wilbert u mean, string is already in UTF-16 LE format?
Re: UTF-8 to UTF-16
Posted: Mon Aug 19, 2013 12:50 pm
by wilbert
Korolev Michael wrote:@wilbert u mean, string is already in UTF-16 LE format?
Not exactly.
Internally PureBasic uses either ASCII (char range 0-255) or Unicode (char range 0-65535).
So unicode characters above 65535 aren't supported in unicode mode while UTF-16 LE does support those.
When using unicode, every character takes two bytes of memory.
Re: UTF-8 to UTF-16
Posted: Mon Aug 19, 2013 1:17 pm
by Korolev Michael
OK, just now discovered, PB uses UCS2, not UTF-16. But this knowledge doesn't bring any clarity into my Unicode understanding.
String is stored in UTF-8. Purebasic app, compiled in U-mode, loads string into memory and automatically converts it to UCS2?
Re: UTF-8 to UTF-16
Posted: Mon Aug 19, 2013 3:04 pm
by wilbert
Korolev Michael wrote:String is stored in UTF-8. Purebasic app, compiled in U-mode, loads string into memory and automatically converts it to UCS2?
The code you write in the editor is stored in UTF-8 format on disk if you specify this. This enables you to use foreign language characters in your code.
When compiled, strings are stored in the generated assembler code in ASCII or Unicode, not UTF-8.
Basically, when you want to support foreign languages, you compile your application with unicode enabled and don't worry about UTF-8.
UTF-8 is only relevant if you access files on disk or embed a UTF-8 file using IncludeBinary.
In this case you can use some PureBasic commands with a #PB_UTF8 flag to work with them.