[×] Unicode conversion...

Just starting out? Need help? Post your questions and find answers here.
User avatar
Michael Vogel
Addict
Addict
Posts: 2823
Joined: Thu Feb 09, 2006 11:27 pm
Contact:

[×] Unicode conversion...

Post by Michael Vogel »

Hm,
seems that I make something wrong, but what?!

I've bought a nice GPS gadget which is able to load geocaching file, which are in a UT-Unix Format (starting with a header FF FE). The problem is, that the files use wrong bytes for german umlauts and this does not look fine on my Garmin device...

So I started with that code:

Code: Select all

Procedure RepairGPX(filename.s)

	Protected zeile.s

	#Wrong_ae_s="ä"
	#Wrong_Ae_l="Ã????"

	#Wrong_oe_s="ö"
	#Wrong_Oe_l="Ã"+Chr($96)

	#Wrong_ue_s="ü"
	#Wrong_Ue_l="Ã????"

	#Wrong_sz="Ã"+Chr($9f)


	ReadFile(0,filename)
	Debug ReadStringFormat(0); = 3

	While Eof(0)=0

		zeile=ReadString(0)

		If FindString(zeile,"Ã",1)

			Debug FindString(zeile,"Ã",1);		e.g. 17
			Debug FindString(zeile,"¼",1);	e.g. 18

			; Works:
			;n=1
			;Repeat
			;	n=FindString(zeile,"ü",n)
			;	If n
			;		zeile=Left(zeile,n-1)+"ü"+Mid(zeile,n+2)
			;	EndIf
			;Until n=0

			; Does not work?!
			;zeile=ReplaceString(zeile,#Wrong_ae_s,"ä")
			;zeile=ReplaceString(zeile,#Wrong_Ae_l,"Ä")
			;zeile=ReplaceString(zeile,#Wrong_oe_s,"ö")
			;zeile=ReplaceString(zeile,#Wrong_Oe_l,"Ö")
			zeile=ReplaceString(zeile,#Wrong_ue_s,"ü")
			;zeile=ReplaceString(zeile,#Wrong_Ue_l,"Ü")
			;zeile=ReplaceString(zeile,#Wrong_sz,"ß")
			
			Debug zeile
		EndIf

	Wend

EndProcedure

RepairGPX("C:\...\GC1AT4F.GPX")
Unicode is enabled in the compiler settings and the command Find works, but ReplaceString does not change anything :shock:

What I'm doing wrong here?

Thanks,
Michael
Last edited by Michael Vogel on Sat Sep 05, 2009 10:45 am, edited 3 times in total.
User avatar
Michael Vogel
Addict
Addict
Posts: 2823
Joined: Thu Feb 09, 2006 11:27 pm
Contact:

Post by Michael Vogel »

Hm,
everything works now with nearly all chars ('ä', 'ö',....), but one ('ß') ist still doing something strange :evil:

Code: Select all

Procedure.s UTF8(s.s)

	Protected *Buffer
	Protected Result.s

	*Buffer=AllocateMemory(StringByteLength(s,#PB_UTF8)+1)
	PokeS(*Buffer,s,-1,#PB_UTF8)
	Result=PeekS(*Buffer,-1,#PB_Ascii)
	FreeMemory(*Buffer)

	ProcedureReturn Result

EndProcedure

Debug Utf8("ä")
s.s=Utf8("ä")
For i=0 To 3
	Debug Str(i)+": "+Str(PeekB(@s+i)&255)
Next i

Debug Utf8("ß")
s.s=Utf8("ß")
For i=0 To 3
	Debug Str(i)+": "+Str(PeekB(@s+i)&255)
Next i
The code compiled with PB4.3 (unicode enabled) brings up the following table:

ä
0: 195
1: 0
2: 164
3: 0

ß
0: 195
1: 0
2: 120
3: 1

The first part is ok, but the 120/1 combination for 'ß' should be 159/0 ?!

What's going on here?
User avatar
Michael Vogel
Addict
Addict
Posts: 2823
Joined: Thu Feb 09, 2006 11:27 pm
Contact:

Post by Michael Vogel »

Seems to be a bug here, checked UTF tables and 'ß' should produce C3 9F which I can't get here?!
Am I doing something wrong or is it really a bug :shock:

Michael
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

Michael Vogel wrote:Seems to be a bug here, checked UTF tables and 'ß' should produce C3 9F which I can't get here?!
Am I doing something wrong or is it really a bug :shock:

Michael
But your code is converting to Ascii and then to Unicode (because of the Unicode switch!) You have lost the Utf-8 encoding!

The following produces $C39F okay :

Code: Select all

Procedure.s UTF8(s.s) 

  Protected *Buffer 
  Protected Result.s 
  Protected byteLen

  byteLen = StringByteLength(s,#PB_UTF8)
  *Buffer=AllocateMemory(byteLen+1) 
  PokeS(*Buffer,s,-1,#PB_UTF8) 

  For i = 0 To byteLen-1
    Debug Hex(PeekB(*Buffer+i), #PB_Byte)
  Next

EndProcedure 


Utf8("ß") 
I may look like a mule, but I'm not a complete ass.
User avatar
Michael Vogel
Addict
Addict
Posts: 2823
Joined: Thu Feb 09, 2006 11:27 pm
Contact:

Post by Michael Vogel »

srod wrote:The following produces $C39F okay [...]
Still being confused, your code is ok, but I'm still not able (not my best period of time :wink:) to get this done twice as it seems to have been happend with the geocache files...

Code: Select all

Procedure.s UTF8(s.s)

	Protected *Buffer
	Protected Result.s
	Protected byteLen

	byteLen = StringByteLength(s,#PB_UTF8)
	*Buffer=AllocateMemory(byteLen+1)
	PokeS(*Buffer,s,-1,#PB_UTF8)

	Debug "---"+s+"---"
	For i = 0 To byteLen-1
		Debug Hex(PeekB(*Buffer+i), #PB_Byte)
	Next

	ProcedureReturn PeekS(*Buffer,-1,#PB_Ascii)

EndProcedure

Utf8(Utf8("ß"))
The bytes which are seen in the geocache files instead of the C39F is C3009F00, but maybe I have only to "expand" your result and add the 00 bytes - I will try this now...

Thanks for your help, srod!
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

Remember that with the Unicode switch set, your function is returning a Unicode string regardless of whether you filled that string from a utf-8 buffer or an Ascii one etc. That is why you are seeing zeros.
I may look like a mule, but I'm not a complete ass.
User avatar
Michael Vogel
Addict
Addict
Posts: 2823
Joined: Thu Feb 09, 2006 11:27 pm
Contact:

Post by Michael Vogel »

Still being confused (and getting more and more grey hairs :? )...

Found some 3-byte codes (or 6-byte codes when also counting the zero-bytes) within the wrong geocache files, but the procedure is only generating 2 byte codes?!

I added two examples and the hex codes which should be returned (at least what I saw in the wikipedia) - can you help me once again :roll:

Thanks,
Michael

Code: Select all

Procedure.s UTF8(s.s) 

   Protected *Buffer 
   Protected Result.s 
   Protected byteLen 

   byteLen = StringByteLength(s,#PB_UTF8) 
   *Buffer=AllocateMemory(byteLen+1) 
   PokeS(*Buffer,s,-1,#PB_UTF8) 

   Debug "---"+s+"---" 
   For i = 0 To byteLen-1 
      Debug Hex(PeekB(*Buffer+i), #PB_Byte) 
   Next 

   ProcedureReturn PeekS(*Buffer,-1,#PB_Ascii) 

EndProcedure 

Utf8("ß");	c3 9f 
Utf8("„");	e2 80 9e
Utf8("€");	e2 82 ac
____
The code doesn't show the chars (the last line includes the euro symbol), but with copy and paste it should work...
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

Michael, the procedure is clearly producing the correct UTF-8 bytes, that isn't a problem.

Your problem, as I have tried to point out above, is that your procedure is returning a PB string (via ProcedureReturn PeekS(*Buffer,-1,#PB_Ascii)).

Now PB strings are either in 1-byte Ascii format or 2-byte Unicode format depending on your compiler settings. Purebasic does not deal with strings internally in UTF-8 format (no sensible compiler would!).

This is why you are seeing 2-byte codes from your procedure return whilst in Unicode mode.

To continue working in utf-8 you must work with a memory buffer containing your utf-8 bytes and NOT work with a PB string variable! Again, a PB string variable can contain only ascii or unicode characters depending on the compiler settings.

Right, now that is understood ( :wink: ), the following adjustment to your code returns your utf-8 buffer :

Code: Select all

Procedure.i UTF8(s.s) 

   Protected *Buffer 
   Protected Result.s 
   Protected byteLen 

   byteLen = StringByteLength(s,#PB_UTF8) 
   *Buffer=AllocateMemory(byteLen+1) 
   PokeS(*Buffer,s,-1,#PB_UTF8) 

   ProcedureReturn *buffer

EndProcedure 

*buffer = Utf8("€");   e2 82 ac

;Let us take a look at the UTF-8 characters.
  *ptr.BYTE = *buffer

While *ptr\b
  Debug Hex(*ptr\b, #PB_Byte)
  *ptr+1
Wend

FreeMemory(*buffer)
I may look like a mule, but I'm not a complete ass.
User avatar
Michael Vogel
Addict
Addict
Posts: 2823
Joined: Thu Feb 09, 2006 11:27 pm
Contact:

Post by Michael Vogel »

Thanks srod to be that patient - meanwhile I feel quite awful but still being to silly to get it :shock:

The output of your last example brings up (I thought, it will do the 3-Byte code now):
C2
80

And it doesn't matter if I use unicode or not - I give up :cry:

Michael


___
PS I "solved" the whole geocaching thing now, but with a kind of brute force technology :evil:

Code: Select all

#MaxChars=127
Global Dim WrongChar.s(#MaxChars)
Global Dim CorrectChar.s(#MaxChars)

For i=0 To #MaxChars
	CorrectChar(i)=Chr(i+128)+Chr(0)

	Read.b Bytes

	For z=1 To Bytes
		Read.b b
		b=b&255
		WrongChar(i)=WrongChar(i)+Chr(b)+Chr(0)
	Next z

Next i
:
Data.b 3,$E2,$82,$AC;  	€ (#128)
Data.b 2,$C2,$81;
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

Brings up the 3 bytes okay here.

Make sure you have the Unicode switch set (in order for the € character to be embedded correctly in the data section of the exe) and alter the IDE preferences for the source file to be encoded in utf-8 format.

Also, make sure that the € character was copied from these forums to your source okay.
I may look like a mule, but I'm not a complete ass.
User avatar
Michael Vogel
Addict
Addict
Posts: 2823
Joined: Thu Feb 09, 2006 11:27 pm
Contact:

Post by Michael Vogel »

srod wrote:... the IDE preferences for the source file to be encoded in utf-8 format...
:oops:
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

I take it that it worked okay then?

:)
I may look like a mule, but I'm not a complete ass.
User avatar
Michael Vogel
Addict
Addict
Posts: 2823
Joined: Thu Feb 09, 2006 11:27 pm
Contact:

Post by Michael Vogel »

srod wrote:I take it that it worked okay then?

:)
Yep :wink:

__
...and thanks once again
Post Reply