Page 1 of 1

UTF-8: I don't get it...

Posted: Tue Mar 09, 2010 3:13 am
by klaver

Code: Select all

mem = AllocateMemory(64)
PokeS(mem, "ĄąĆćĘꣳŃńŚśŹźŻż", -1, #PB_UTF8)
For i=0 To 16
  Debug Hex(PeekB(mem+i*2) & $FF) +" "+ Hex(PeekB(mem+i*2+1) & $FF)
Next
Debug PeekS(mem, -1, #PB_UTF8)

Debug PeekS(?Start, -1, #PB_UTF8)

DataSection
  Start:
  Data.b $C4, $84;Ą
  Data.b $C4, $85;ą
  Data.b $C4, $86;Ć
  Data.b $C4, $87;ć
  Data.b $C4, $98;Ę
  Data.b $C4, $99;ę
  Data.b $C5, $81;Ł
  Data.b $C5, $82;ł
  Data.b $C5, $83;Ń
  Data.b $C5, $84;ń
  Data.b $C5, $9A;Ś
  Data.b $C5, $9B;ś
  Data.b $C5, $B9;Ź
  Data.b $C5, $BA;ź
  Data.b $C5, $BB;Ż
  Data.b $C5, $BC;ż
  Data.b $00
EndDataSection
Please run this code with Unicode checked and unchecked. Why nothing is working correctly here?
Why is "ń" converted into "C3 B1" while it should be "C5 84" ? Link: http://www.utf8-chartable.de/unicode-utf8-table.pl
Why is the string from DataSection read as it should in Unicode mode, but in ANSI it's "?????????????????"?

Can someone explain this in simple words for me? :roll:
Thanks in advance.

Re: UTF-8: I don't get it...

Posted: Tue Mar 09, 2010 6:34 am
by Joakim Christiansen
klaver wrote:Why is the string from DataSection read as it should in Unicode mode, but in ANSI it's "?????????????????"?
Because all those characters are outside the standard ASCII range (0-127).

Re: UTF-8: I don't get it...

Posted: Wed Mar 10, 2010 2:26 pm
by klaver
Thanks, but why are characters converted into UTF-8 incorrectly?

Re: UTF-8: I don't get it...

Posted: Wed Mar 10, 2010 2:40 pm
by srod
Have you set the IDE file format to UTF-8 ?

Re: UTF-8: I don't get it...

Posted: Wed Mar 10, 2010 3:01 pm
by Trond
Consider:
1. The integrated debugger is compiled in ascii mode, and can't display unicode-only characters.
2. When compiling your program in ascii mode, literal strings which contain unicode characters will inevitably not contain those characters any more.
3. When compiling in ascii mode, and peeking utf-8 strings, the result is put into a PB ascii string, so unicode-only characters are lost.