[Implemented] UTF-8 length

Nik · Post by **Nik** » Tue Feb 28, 2006 11:24 pm

It would be nice to have a function that tells us how long a UTF-16 string (normal PB String in unicodemode) will be when converted to UTF-8 so we can for example allocate the needed memory and PokeS the string into the Memory with the UTF-8 flag. Because UTF-8 can produce even biger strings as UTF-16 when used with higher codepoints this would be the only method to allocate enough memory and prevent memory access errors.

Nik · Post by **Nik** » Mon Mar 13, 2006 2:16 pm

Does anybody know a way to do this, e.g. stroing an UTF-8 String in a pre allocated Memory area with the size of the String (In UTF-8 encoding with codepoints>127)?

Rescator · Post by **Rescator** » Mon Mar 13, 2006 7:35 pm

Well! You should be able to get the actual memory size of a UTF-8 string by doing:

MemoryStringLength(*String,#PB_Ascii)

Yeah I'm not joking, afaik #PB_Ascii will count every character and stop at first binary 0.
As far as I know all PB strings whether they are Ascii or UTF-8 are all 0 terminated.

Using #PB_Ascii simply force PB to chec the length as if it was ascii.
UTF-8 string do not have binary 0 in them (except at end of string like ascii strings tend to have)

There may however be a api call to get the estimated length,
but it won't save you much time really.
It's pretty much just as easy to simply convert and then use MemoryStringLength(*String,#PB_Ascii)

Btw! In unicode mode (Fred correct me if I'm wrong),
strings are stored internaly as UTF-8.
UTF-16 is not supported/used at all, Unicode is however (16 bit/byte pairs always)
Unicode always use 16bit's per character, do not confuse this with being a UTF-16 as it's not.
(should be easy to check though, just dump the string memory and see how it's stored)

Unicode strings should be even length always + a binary 0 at the end unless Fred decided to skip that for Unicode strings.

UTF-8 however is basicaly a normal Latin-1 ascii string, just that a few characters are control codes.
i.e UTF-8 is backwards compatible with latin-1 (iso 8559-1 or something can't recall)

Post by **Fred** » Mon Mar 13, 2006 7:55 pm

In fact we will add a function for this, as it's not possible to predict for now the size in byte of an utf8 string from a PB unicode or ascii string.

Nik · Post by **Nik** » Mon Mar 13, 2006 8:06 pm

Very Good since MemoryStringlength() is only useable after you put the string into memory which you can't do because you do not know the size^^
And no I think Unicode Strings are stored as UTF-16 in memory as it's the standard in Windows. however I also think that real UTF-16 would also enable charcters bigger then 2 bytes which isn't the case for Windows Wide Chars, would be nice to have some clarification though.

Rescator · Post by **Rescator** » Mon Mar 13, 2006 8:21 pm

*scratches head*

I could have sworn that Fred once said that PB used UTF-8 internaly.

*really confused now*

And your right Nik, UTF-16 can potentially be larger than 2 bytes.
Unicode however is assumed to be 16bit (similar to windows Widechars but not identical).

I believe there exist a 32bit Unicode, but by Unicode most people mean the 16bit characters, or UTF-8. UTF-16 is very rare.

Had to dig up a few links to maintain my own insanity here:
http://en.wikipedia.org/wiki/UTF-8
http://en.wikipedia.org/wiki/UTF-16

Fred, am I correct in assuming that PB in unicode mode uses UCS-2 as it's native unicode storage format?
Gah *head explodes*

This is why I stick religiously to UTF-8 as it's backwards compatible to ascii and latin-1.
And latin-1 can (mostly) be read transparently as if it was UTF-8,
and UTF-8 is endian safe, and as far as I can tell the only form of Unicode storage method that is. (UTF-16 is endian dependant)

I'd still like to know (and have it mentioned in the PB manual) what PB uses internaly when storing unicode strings,
as one might optimize handling of lot of strings/text more easy then.
So one can avoid converting from UTF-16 to UTF-8 to, well you get the point

Someone really should add the http://en.wikipedia.org/wiki/UTF-8
link to the forum FAQ thread if it hasn't yet.
Interesting reading to say the least

blueznl · Post by **blueznl** » Wed Mar 15, 2006 12:10 am

rescator, did you check the survival guide?

unicode is a list of possible characters (codepoints) and has nothing to do with encoding

utf8 = 1..4 bytes for each codepoint
uft16 = 2..4 bytes for each codepoint

windows (xp) uses ucs2, which is almost but not entirely utf16, i'm not sure if it does 4 byte codepoints (dwcs may apply to pre xp, xp and later may do ucs2, but i cannot find any sure statement)

Nik · Post by **Nik** » Wed Mar 15, 2006 2:57 pm

utf8 indeed can use even more than 4 bytes per char thats it's main disadvantage to utf16