UTF8 and strings...

Joakim Christiansen · Tue Jan 12, 2010 12:44 am

EDIT: Better jump down to this post for the better explanation and the discovery of a bug: http://www.purebasic.fr/english/viewtop ... 44#p312444

Code: Select all

string$ = "Mélissa"
Debug PeekS(@string$,StringByteLength(string$),#PB_UTF8)

Okay, first I was unsure if this was a bug or what since I don't have much clue about UTF8. But reading wikipedia I see that the é character is illegal in UTF8 since it only support the 7 bit ASCII codes (0-127). EDIT: What I meant was that it can't be READ as UTF-8.

But should PureBasic really just cut it there?

Oh btw:

RFC 3629 states "Implementations of the decoding algorithm MUST protect against decoding invalid sequences."[7] The Unicode Standard requires decoders to "...treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence." Many UTF-8 decoders throw an exception if a string has an error in it. In recent times this has been found to be impractical: being unable to work with data means you cannot even try to fix it. One example was Python 3.0 which would exit immediately if the command line had invalid UTF-8 in it.[8] A more useful solution is to translate the first byte to a replacement and continue parsing with the next byte.

http://en.wikipedia.org/wiki/UTF-8#Inva ... _sequences
So "a more useful solution is to translate the first byte to a replacement and continue parsing with the next byte" and I agree. Well, at least that or an error message?

helpy · Post by **helpy** » Tue Jan 12, 2010 10:28 am

Pay attention to the following:

You can choose the encoding of the source file (Plain Text or UTF-8). With this option you only decide, which characters you can use in the source code.
If you choose "Plain Text" for source code encoding, than the usable characters depend on the used font (code page).
With the compiler option "Create unicode executable" you choose, which is the default characterset of strings in the compiled program: ASCII or Unicode!
If you use "UTF-8" as source code encoding an compiler option "Create unicode executable" switched off ... this will have the effect, that all characters in the source code, which can not be displayed with ASCII characterset, will get lost.

Your example will never work:

source code encoding = Plain Text // compiler option "Create unicode executable" is off
==> "é" can be handled in Plain Text
==> The string will be stored in memory with characterset ASCII

Code: Select all

string$ = "Mélissa"
Debug PeekS(@string$, -1) ; will work
Debug PeekS(@string$, -1, #PB_ASCII) ; will work
Debug PeekS(@string$, -1, #PB_UTF8) ; will not work
Debug PeekS(@string$, -1, #PB_Unicode) ; will not work

source code encoding = Plain Text // compiler option "Create unicode executable" is on
==> "é" can be handled in Plain Text
==> The string will be stored in memory with characterset Unicode

Code: Select all

string$ = "Mélissa"
Debug PeekS(@string$, -1) ; will work
Debug PeekS(@string$, -1, #PB_ASCII) ; will not work
Debug PeekS(@string$, -1, #PB_UTF8) ; will not work
Debug PeekS(@string$, -1, #PB_Unicode) ; will work

source code encoding = UTF-8 // compiler option "Create unicode executable" is off
==> "é" can be handled in UTF-8
==> The string will be stored in memory with characterset ASCII

Code: Select all

string$ = "Mélissa"
Debug PeekS(@string$, -1) ; will work
Debug PeekS(@string$, -1, #PB_ASCII) ; will work
Debug PeekS(@string$, -1, #PB_UTF8) ; will not work
Debug PeekS(@string$, -1, #PB_Unicode) ; will not work

source code encoding = UTF-8 // compiler option "Create unicode executable" is on
==> "é" can be handled in UTF-8
==> The string will be stored in memory with characterset Unicode

Code: Select all

string$ = "Mélissa"
Debug PeekS(@string$, -1) ; will work
Debug PeekS(@string$, -1, #PB_ASCII) ; will not work
Debug PeekS(@string$, -1, #PB_UTF8) ; will not work
Debug PeekS(@string$, -1, #PB_Unicode) ; will work

==> In your case just use PeekS(@string$) because the ASCII or Unicode characterset (PB_ASCII or PB_Unicode) will be used depending on the compiler option!

==> If you will try these tests with different fonts in Editor and Debugger output, the result can be different, if the Debugger font can not display the same characters as the Editor font! Unicode option and UTF-8 encoding will make no sense if you use an non-unicode font to display strings.

==> I suggest to use always UTF-8 for file encoding and "Create unicode executable" as compiler option! There are only rare application areas today were you should use Plain ASCII file encoding and "Create unicode executable" switched to off!

==> Use the optional flags (#PB_ASCII, #PB_UTF8, #PB_Unicode, #PB_UTF16, ...) only in the following cases:

... when reading strings from files or network streams, ...
... when writing strings to files or network streams, ...

==> Pay attention to the fact that PeekS( ..., #PB_xxxxx) will return the string in ASCII or Unicode depending on the compiler option which was used ... regardless of the source string #PB_xxxxx !!!

Okay, first I was unsure if this was a bug or what since I don't have much clue about UTF8. But reading wikipedia I see that the é character is illegal in UTF8 since it only support the 7 bit ASCII codes (0-127).

That is WRONG! "é" is supported in UTF-8 ... but it uses more then one byte. The byte length of UTF-8 characters can be between ONE and FOUR bytes!

Hope this helps a step further ...
cu, helpy

rsts · Post by **rsts** » Tue Jan 12, 2010 5:59 pm

http://www.joelonsoftware.com/articles/Unicode.html

cheers

helpy · Post by **helpy** » Tue Jan 12, 2010 6:24 pm

Cool! Thank you for this link!

Contest: Find the wrongs in my explanation

...

blueznl · Post by **blueznl** » Tue Jan 12, 2010 7:19 pm

http://www.xs4all.nl/~bluez/purebasic/p ... #2_unicode

Joakim Christiansen · Thu Jan 14, 2010 12:43 am

Lot's of information, handy.

But I still wonder if PureBasic should cut the string at é. Is it intended behavior? Not that it is a problem or anything.
But as I said, it would be better to either continue parsing the string or throw an error message.
EDIT: To be very clear; I talk about behavior of this function with Unicode in the compiler options set to OFF.

Actually my father used some similar code downloading webpages and storing them in a string and this was the cause of some serious messed up results in the program. What happened was actually that a random part of the string after such a character was cut away and then it continued with the rest. This was not easy to debug and must also be some bug in PureBasic, I can give the code if the team wanna look at it, but recreating it (the bug) isn't very easy.

helpy · Post by **helpy** » Thu Jan 14, 2010 8:11 am

Joakim Christiansen wrote:But I still wonder if PureBasic should cut the string at é. Is it intended behavior? Not that it is a problem or anything.

The string is NOT cut at é !!! It is cut after the first character !!! See following example:

Code: Select all

string$ = "Mmmmmmélissa"
Debug PeekS(@string$,StringByteLength(string$),#PB_UTF8)

Explanation:
1. If the compiler option "Create unicode executable" is switched on, the string is stored with the Unicode character set "Universal Character Set" UCS2 in string memory!

2. This means, that each character needs TWO bytes in memory!!! If you would analyze the memory you would recognize that the String "Mélissa" is stored in Memory like this:

Code: Select all

00390668  4D 00 E9 00 6C 00 69 00 73 00 73 00 61 00        M.é.l.i.s.s.a.

3. If you read this memory with Peeks(@string$,StringByteLength(string$),#PB_UTF8) the byte sequence at the memory location is interpreted as UTF-8 and NOT as UCS2. This means, that "hex. 4D" is interpreted as "M" and the second byte "hex. 00" as NULL character (= string end)!

4. Your suggestions "it would be better to either continue parsing the string or throw an error message." would be definitly wrong, because the Purebasic function Peeks(@string$,StringByteLength(string$),#PB_UTF8) does exactly the right thing! Purebasic can not throw an error, if you use the wrong encoding! #PB_UTF8 is definitly wrong in this case!

If you use Peeks(..., ..., CharacterSet) ==> YOU have to choose the right constant! YOU are responsible to tell Purebasic how the memory location, which is passed to Peeks(), should be interpreted by Purebasic

cu, helpy

[edit]
Sorry for the big red letters!
[/edit]

djes · Post by **djes** » Thu Jan 14, 2010 10:59 am

rsts wrote:http://www.joelonsoftware.com/articles/Unicode.html

Really nice link, a must read! Thanx!

Joakim Christiansen · Post by **Joakim Christiansen** » Fri Jan 15, 2010 2:41 am

helpy wrote:
Joakim Christiansen wrote:But I still wonder if PureBasic should cut the string at é. Is it intended behavior? Not that it is a problem or anything.
The string is NOT cut at é !!! It is cut after the first character !!! See following example:
Code: Select all
string$ = "Mmmmmmélissa"
Debug PeekS(@string$,StringByteLength(string$),#PB_UTF8)

If you have selected to run your executable in UNICODE!! Then yes all you say is true, sorry if I am unclear, but I meant the behavior when UNICODE is NOT selected in the compiler options; then what I say will make more sense. Sorry for the confusion and maybe silly thread to begin with, but I think that I am okay to be curious about this. And I don't need lessons in big red text btw.

helpy · Post by **helpy** » Fri Jan 15, 2010 12:10 pm

Hello Joakim Christiansen,

Sorry for the big red letters ... I should not have done it !

Code: Select all

string$ = "Mélissa"
Debug PeekS(@string$,StringByteLength(string$),#PB_UTF8)

Just one question:
With Unicode option switched off the string "Mélissa" is not stored as UTF-8 string.
==> Why do you try to read it as UTF8 string?

cu, helpy

Joakim Christiansen · Post by **Joakim Christiansen** » Sat Jan 16, 2010 8:03 am

helpy wrote:Just one question:
With Unicode option switched off the string "Mélissa" is not stored as UTF-8 string.
==> Why do you try to read it as UTF8 string?

True, well, it's some code found on the forum which my dad (which I'm trying to get to use PureBasic) used when downloading web pages into a string, since most web pages are UTF-8 encoded I guess.

But then apparently he came over a web page which used ASCII and this é character somewhere in the text, but since the é character is ASCII code 233 (the extended range) reading it as UTF-8 is not possible. And that caused some very strange behavior in his program.

So that's why I started thinking about this.
Sure UTF-8 is backwards compatible with ASCII so I just changed his code to read it as ASCII and the problem was solved.
But maybe it would be better for PureBasic to throw an error, or just jump over é.

Okay, time for some code to really test this stuff.
First run it with unicode off, then it cuts the string at é.
Now run with unicode on, now it just jumps over é. (

this is the behavior I want)
Conclusion: both times it reads the ASCII string as UTF-8 so the result should have been the same.

Code: Select all

Debug PeekS(?string,?dataEnd-?string,#PB_UTF8)

DataSection
  string: ;"Test é 123" as ASCII (é which is in the extended range)
    Data.a 84, 101, 115, 116, 32, 233, 32, 49, 50, 51
  dataEnd:
EndDataSection

I admit my first post was not very well written, but at least we now have an understanding and as you see my curiousness seems to have found a "bug" or at least something which seems like unintended behavior by PureBasic.

blueznl · Post by **blueznl** » Sat Jan 16, 2010 8:06 pm

You're 100% sure it's only 233, and not 233 followed by a 0? Sounds weird to me...

helpy · Post by **helpy** » Sat Jan 16, 2010 10:08 pm

==> Unicode/UTF-8 characterset table (german)

Unicode for "é" ==> U+00E9 ==> dez. 233
UTF-8 encoding for "é" ==> two bytes ==> hex. c3 a9

see also at:
==> http://demo.icu-project.org/icu-bin/convexp?conv=UTF-8
==> http://demo.icu-project.org/icu-bin/con ... ALL#layout

an unicode converter:
==> http://macchiato.com/unicode/convert.html

another utf-8 table:
==> http://home.tiscali.nl/t876506/utf8tbl.html

After looking over all this the value (ONE byte) hex E9 (dez. 233) is not an valid UTF-8 character! ... so the string ends there!

cu, helpy

Joakim Christiansen · Sat Jan 16, 2010 10:35 pm

helpy wrote:After looking over all this the value (ONE byte) hex E9 (dez. 233) is not an valid UTF-8 character!

Yes I pointed that out too.

helpy wrote:... so the string ends there!

Hey, did you not try my example?!

blueznl wrote:You're 100% sure it's only 233, and not 233 followed by a 0? Sounds weird to me...

I am! I mean, if you refer to my latest example.

EDIT: Included this for fun:
(please look at the example over before this one)
The code below has nothing to do with the original discussion, but is to test how PureBasic handles correct UTF-8 strings instead, just for fun.

Code: Select all

Debug PeekS(?string,?dataEnd-?string,#PB_UTF8)

DataSection
  string: ;"Test é 123" as UTF-8, é is now 195,169
    Data.a 84,101,115,116,32,195,169,32,49,50,51
  dataEnd:
EndDataSection

And this code works fine in every case though, which it should.

helpy · Post by **helpy** » Mon Jan 18, 2010 8:43 am

Joakim Christiansen wrote:
helpy wrote:... so the string ends there!
Hey, did you not try my example?!

... I must have been a blockhead

You are right there! The result of PeekS(..., ..., #PB_UTF8) is different depending on the unicode compiler switch!

At least the result should be the same!

cu, helpy

PureBasic Forums - English

UTF8 and strings...

UTF8 and strings...

Re: UTF8 and strings...

Re: UTF8 and strings...

Re: UTF8 and strings...

Re: UTF8 and strings...

Re: UTF8 and strings...

Re: UTF8 and strings...

Re: UTF8 and strings...

Re: UTF8 and strings...

Re: UTF8 and strings...

Re: UTF8 and strings...

Re: UTF8 and strings...

Re: UTF8 and strings...

Re: UTF8 and strings...

Re: UTF8 and strings...