UTF-8 to UTF-16

Windows specific forum
Korolev Michael
Enthusiast
Enthusiast
Posts: 200
Joined: Wed Feb 01, 2012 5:30 pm
Location: Russian Federation

UTF-8 to UTF-16

Post by Korolev Michael »

Save this as UTF-8 and compile as Unicode executable. Why it returns single "o" instead of "onetwo"?

Code: Select all

string$ = "onetwo"
bytes = Len(string$)*2+2
*buf = AllocateMemory(bytes)
MultiByteToWideChar_(65001,0,@string$,Len(string$),*buf,Len(string$))
Debug PeekS(*buf,bytes,#PB_Unicode)
Former user of pirated PB.
Now registered user :].
User avatar
JHPJHP
Addict
Addict
Posts: 2273
Joined: Sat Oct 09, 2010 3:47 am

Re: UTF-8 to UTF-16

Post by JHPJHP »

Removed; post ignored.
Last edited by JHPJHP on Sat May 26, 2018 9:17 pm, edited 4 times in total.

If you're not investing in yourself, you're falling behind.

My PureBasic StuffFREE STUFF, Scripts & Programs.
My PureBasic Forum ➤ Questions, Requests & Comments.
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3943
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: UTF-8 to UTF-16

Post by wilbert »

Korolev Michael wrote:Save this as UTF-8 and compile as Unicode executable. Why it returns single "o" instead of "onetwo"?
Saving as UTF-8 only influences the way your source code is stored.
When you compile in unicode mode, string$ already is a unicode string.
So the problem is that you are presenting a unicode string to MultiByteToWideChar_ while it expects a multibyte string.
Windows (x64)
Raspberry Pi OS (Arm64)
Korolev Michael
Enthusiast
Enthusiast
Posts: 200
Joined: Wed Feb 01, 2012 5:30 pm
Location: Russian Federation

Re: UTF-8 to UTF-16

Post by Korolev Michael »

@wilbert u mean, string is already in UTF-16 LE format?
Former user of pirated PB.
Now registered user :].
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3943
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: UTF-8 to UTF-16

Post by wilbert »

Korolev Michael wrote:@wilbert u mean, string is already in UTF-16 LE format?
Not exactly.
Internally PureBasic uses either ASCII (char range 0-255) or Unicode (char range 0-65535).
So unicode characters above 65535 aren't supported in unicode mode while UTF-16 LE does support those.
When using unicode, every character takes two bytes of memory.
Windows (x64)
Raspberry Pi OS (Arm64)
Korolev Michael
Enthusiast
Enthusiast
Posts: 200
Joined: Wed Feb 01, 2012 5:30 pm
Location: Russian Federation

Re: UTF-8 to UTF-16

Post by Korolev Michael »

OK, just now discovered, PB uses UCS2, not UTF-16. But this knowledge doesn't bring any clarity into my Unicode understanding.

String is stored in UTF-8. Purebasic app, compiled in U-mode, loads string into memory and automatically converts it to UCS2?
Last edited by Korolev Michael on Mon Aug 19, 2013 5:25 pm, edited 1 time in total.
Former user of pirated PB.
Now registered user :].
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3943
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: UTF-8 to UTF-16

Post by wilbert »

Korolev Michael wrote:String is stored in UTF-8. Purebasic app, compiled in U-mode, loads string into memory and automatically converts it to UCS2?
The code you write in the editor is stored in UTF-8 format on disk if you specify this. This enables you to use foreign language characters in your code.

When compiled, strings are stored in the generated assembler code in ASCII or Unicode, not UTF-8.
Basically, when you want to support foreign languages, you compile your application with unicode enabled and don't worry about UTF-8.
UTF-8 is only relevant if you access files on disk or embed a UTF-8 file using IncludeBinary.
In this case you can use some PureBasic commands with a #PB_UTF8 flag to work with them.
Windows (x64)
Raspberry Pi OS (Arm64)
Post Reply