UTF-8 to UTF-16

Korolev Michael · Post by **Korolev Michael** » Sun Aug 18, 2013 6:21 pm

Save this as UTF-8 and compile as Unicode executable. Why it returns single "o" instead of "onetwo"?

string$ = "onetwo"
bytes = Len(string$)*2+2
*buf = AllocateMemory(bytes)
MultiByteToWideChar_(65001,0,@string$,Len(string$),*buf,Len(string$))
Debug PeekS(*buf,bytes,#PB_Unicode)

JHPJHP · Post by **JHPJHP** » Mon Aug 19, 2013 7:25 am

Removed; post ignored.

wilbert · Post by **wilbert** » Mon Aug 19, 2013 7:35 am

Korolev Michael wrote:Save this as UTF-8 and compile as Unicode executable. Why it returns single "o" instead of "onetwo"?

Saving as UTF-8 only influences the way your source code is stored.
When you compile in unicode mode, string$ already is a unicode string.
So the problem is that you are presenting a unicode string to MultiByteToWideChar_ while it expects a multibyte string.

Korolev Michael · Post by **Korolev Michael** » Mon Aug 19, 2013 12:30 pm

@wilbert u mean, string is already in UTF-16 LE format?

wilbert · Post by **wilbert** » Mon Aug 19, 2013 12:50 pm

Korolev Michael wrote:@wilbert u mean, string is already in UTF-16 LE format?

Not exactly.
Internally PureBasic uses either ASCII (char range 0-255) or Unicode (char range 0-65535).
So unicode characters above 65535 aren't supported in unicode mode while UTF-16 LE does support those.
When using unicode, every character takes two bytes of memory.

Korolev Michael · Post by **Korolev Michael** » Mon Aug 19, 2013 1:17 pm

OK, just now discovered, PB uses UCS2, not UTF-16. But this knowledge doesn't bring any clarity into my Unicode understanding.

String is stored in UTF-8. Purebasic app, compiled in U-mode, loads string into memory and automatically converts it to UCS2?

wilbert · Post by **wilbert** » Mon Aug 19, 2013 3:04 pm

Korolev Michael wrote:String is stored in UTF-8. Purebasic app, compiled in U-mode, loads string into memory and automatically converts it to UCS2?

The code you write in the editor is stored in UTF-8 format on disk if you specify this. This enables you to use foreign language characters in your code.

When compiled, strings are stored in the generated assembler code in ASCII or Unicode, not UTF-8.
Basically, when you want to support foreign languages, you compile your application with unicode enabled and don't worry about UTF-8.
UTF-8 is only relevant if you access files on disk or embed a UTF-8 file using IncludeBinary.
In this case you can use some PureBasic commands with a #PB_UTF8 flag to work with them.

PureBasic Forums - English

UTF-8 to UTF-16

UTF-8 to UTF-16

Re: UTF-8 to UTF-16

Re: UTF-8 to UTF-16

Re: UTF-8 to UTF-16

Re: UTF-8 to UTF-16

Re: UTF-8 to UTF-16

Re: UTF-8 to UTF-16