Problems converting double encoded strings

Everything else that doesn't fall into one of the other PB categories.
User avatar
Kukulkan
Addict
Addict
Posts: 1422
Joined: Mon Jun 06, 2005 2:35 pm
Location: germany
Contact:

Problems converting double encoded strings

Post by Kukulkan »

Hello,

I'm compiling an unicode executable and I have to work with some program parameters in UTF8. Therefore, in PB the UTF8 string is double encoded (unicode encoded utf8 string).

For conversion, I try to poke the input as single byte and then peek it to unicode by seeing it as utf8, but I have issues between plattforms (Linux, Windows). This is my test source:

Code: Select all

; PB 5.22 LTS, 32 Bit, Unicode (Windows, Linux)
; Here is the UTF-8 representation of "©äΣ丌‡".
; Normally retrieved by ProgramParameter() function.
; Now it is double encoded (as executable is unicode).
t.s = "©äΣ丌‡" ;

; DEBUG: output original memory content
l.i = StringByteLength(t.s)
Orig.s = ""
For x.i = 0 To l.i-1
  Orig.s = Orig.s + RSet(Hex(PeekA(@t + x)), 2, "0") + " "
Next
Debug "Original Bytes: [" + Trim(Orig.s) + "]"

; Convert from unicode to single bytes UTF-8
sbp.s = Space(StringByteLength(t) / 2 + 1)
FillMemory(@sbp, StringByteLength(sbp), 0)
pbytes.i = PokeS(@sbp, t.s, -1, #PB_Ascii)

; DEBUG: output poked memory content
pked.s = ""
For x.i = 0 To pbytes.i-1
  pked.s = pked.s + RSet(Hex(PeekA(@sbp + x)), 2, "0") + " "
Next
Debug "Poked Bytes: [" + Trim(pked.s) + "]"

; Convert from UTF-8 single bytes UTF-8 to unicode string
ret.s =  PeekS(@sbp, pbytes.i, #PB_UTF8)

; DEBUG: Output result (should be ©äΣ丌‡)
Debug "Result: [" + ret.s + "]"

End
This are the results on Linux:
Original Bytes: [C2 00 A9 00 C3 00 A4 00 CE 00 A3 00 E4 00 B8 00 52 01 E2 00 AC 20 A1 00]
Poked Bytes: [C2 A9 C3 A4 CE A3 E4 B8 E2 A1]
Result: [©ä]

This are the results on Windows:
Original Bytes: [C2 00 A9 00 C3 00 A4 00 CE 00 A3 00 E4 00 B8 00 52 01 E2 00 AC 20 A1 00]
Poked Bytes: [C2 A9 C3 A4 CE A3 E4 B8 8C E2 80 A1]
Result: [©äS?‡]

I assume the PokeS() command is making the difference, but why? And is there a secure "PB only" way to do this cross platform?

(I know the debug Window is the reason for not correctly displaying the result on Windows, because using file logging it seems okay on Windows).

Best,

Kukulkan
IdeasVacuum
Always Here
Always Here
Posts: 6426
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Problems converting double encoded strings

Post by IdeasVacuum »

It looks as though you have made an assumption that UTF-8 is single byte, but only the first 128 chars that match ASCII are single byte, after that UTF-8 is a variable-width encoding.

Is it not possible to work with Unicode exclusively or, better, UTF-8 exclusively?
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
User avatar
Kukulkan
Addict
Addict
Posts: 1422
Joined: Mon Jun 06, 2005 2:35 pm
Location: germany
Contact:

Re: Problems converting double encoded strings

Post by Kukulkan »

Hi,

no, I do not make such assumption. I see every character of the UTF-8 encoding as single character. As in my example, I see it as "©äΣ丌‡". This are 12 single byte characters representing a 5 characters utf-8 encoded string. I know that a single unicode character may need up to 4 bytes in such string. But this is not the point.

I do not understand the difference between Linux and Windows. I assume that there is some extra character conversion coming up on Linux that does not happen on Windows?

But in fact, I'm looking for a quick, reliable and cross-platform way to decode the double encoded values (unicode + utf-8) into normal unicode.

Best,

Kukulkan
IdeasVacuum
Always Here
Always Here
Posts: 6426
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Problems converting double encoded strings

Post by IdeasVacuum »

...looks like a tricky task Kukulkan. I don't know the difference between how Windows and Linux handle Unicode, perhaps it is to do with endianness and byte order.

Your test parameters might need to be more demanding too - at least, all the parameters and combinations thereof that the app allows. Hopefully, someone else has hit this issue and can show you a solution.
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
User avatar
Thunder93
Addict
Addict
Posts: 1790
Joined: Tue Mar 21, 2006 12:31 am
Location: Canada

Re: Problems converting double encoded strings

Post by Thunder93 »

Just a glance.., but isn't the Linux result cut-short? .. as shown in the original post?

The Windows result isn't even showing what is said that the Output result should be.. However, if you use x86 compiler, the Output result is exact.

Code: Select all

; DEBUG: Output result (should be ©äΣ丌‡)
PB x86
Original Bytes: [C2 00 A9 00 C3 00 A4 00 CE 00 A3 00 E4 00 B8 00 52 01 E2 00 AC 20 A1 00]
Poked Bytes: [C2 A9 C3 A4 CE A3 E4 B8 8C E2 80 A1]
Result: [©äΣ丌‡]

PB x64
Original Bytes: [C2 00 A9 00 C3 00 A4 00 CE 00 A3 00 E4 00 B8 00 52 01 E2 00 AC 20 A1 00]
Poked Bytes: [C2 A9 C3 A4 CE A3 E4 B8 8C E2 80 A1]
Result: [©äS?‡]
ʽʽSuccess is almost totally dependent upon drive and persistence. The extra energy required to make another effort or try another approach is the secret of winning.ʾʾ --Dennis Waitley
User avatar
Kukulkan
Addict
Addict
Posts: 1422
Joined: Mon Jun 06, 2005 2:35 pm
Location: germany
Contact:

Re: Problems converting double encoded strings

Post by Kukulkan »

I found the issue. It is, that PureBasic is doing some magic translation. PB always assumes Windows-1252 encoding, but on Linux it is ISO-8859-1 or other. For example, the UTF-8 encoding is showing the Œ character. In Windows-1252, this is character $8C. In ISO-8859-1 it does not exist. Therefore, PureBasic converts to the Unicode character $0152. This is making trouble during conversion.

I think that PB is having a bug with ProgramParameter() function. It looks at the comandline parameters like they where unicode but my LOCALE is set to de_DE.UTF8. Therefore, it should see the parameters as UTF8 and convert all parameters from that encoding to Unicode in the exeutable. But it doesn't. I found a solution for Linux to get the real comandline and if I finished the function, I will post it here.

Best,

Kukulkan
Post Reply