Page 1 of 1
[×] Unicode conversion...
Posted: Sat Aug 29, 2009 4:02 pm
by Michael Vogel
Hm,
seems that I make something wrong, but what?!
I've bought a nice GPS gadget which is able to load geocaching file, which are in a UT-Unix Format (starting with a header FF FE). The problem is, that the files use wrong bytes for german umlauts and this does not look fine on my Garmin device...
So I started with that code:
Code: Select all
Procedure RepairGPX(filename.s)
Protected zeile.s
#Wrong_ae_s="ä"
#Wrong_Ae_l="Ã????"
#Wrong_oe_s="ö"
#Wrong_Oe_l="Ã"+Chr($96)
#Wrong_ue_s="ü"
#Wrong_Ue_l="Ã????"
#Wrong_sz="Ã"+Chr($9f)
ReadFile(0,filename)
Debug ReadStringFormat(0); = 3
While Eof(0)=0
zeile=ReadString(0)
If FindString(zeile,"Ã",1)
Debug FindString(zeile,"Ã",1); e.g. 17
Debug FindString(zeile,"¼",1); e.g. 18
; Works:
;n=1
;Repeat
; n=FindString(zeile,"ü",n)
; If n
; zeile=Left(zeile,n-1)+"ü"+Mid(zeile,n+2)
; EndIf
;Until n=0
; Does not work?!
;zeile=ReplaceString(zeile,#Wrong_ae_s,"ä")
;zeile=ReplaceString(zeile,#Wrong_Ae_l,"Ä")
;zeile=ReplaceString(zeile,#Wrong_oe_s,"ö")
;zeile=ReplaceString(zeile,#Wrong_Oe_l,"Ö")
zeile=ReplaceString(zeile,#Wrong_ue_s,"ü")
;zeile=ReplaceString(zeile,#Wrong_Ue_l,"Ü")
;zeile=ReplaceString(zeile,#Wrong_sz,"ß")
Debug zeile
EndIf
Wend
EndProcedure
RepairGPX("C:\...\GC1AT4F.GPX")
Unicode is enabled in the compiler settings and the command Find works, but ReplaceString does not change anything
What I'm doing wrong here?
Thanks,
Michael
Posted: Fri Sep 04, 2009 8:30 am
by Michael Vogel
Hm,
everything works now with nearly all chars ('ä', 'ö',....), but one ('ß') ist still doing something strange
Code: Select all
Procedure.s UTF8(s.s)
Protected *Buffer
Protected Result.s
*Buffer=AllocateMemory(StringByteLength(s,#PB_UTF8)+1)
PokeS(*Buffer,s,-1,#PB_UTF8)
Result=PeekS(*Buffer,-1,#PB_Ascii)
FreeMemory(*Buffer)
ProcedureReturn Result
EndProcedure
Debug Utf8("ä")
s.s=Utf8("ä")
For i=0 To 3
Debug Str(i)+": "+Str(PeekB(@s+i)&255)
Next i
Debug Utf8("ß")
s.s=Utf8("ß")
For i=0 To 3
Debug Str(i)+": "+Str(PeekB(@s+i)&255)
Next i
The code compiled with PB4.3 (unicode enabled) brings up the following table:
ä
0: 195
1: 0
2: 164
3: 0
ß
0: 195
1: 0
2: 120
3: 1
The first part is ok, but the 120/1 combination for 'ß' should be 159/0 ?!
What's going on here?
Posted: Sat Sep 05, 2009 8:56 am
by Michael Vogel
Seems to be a bug here, checked UTF tables and 'ß' should produce C3 9F which I can't get here?!
Am I doing something wrong or is it really a bug
Michael
Posted: Sat Sep 05, 2009 9:04 am
by srod
Michael Vogel wrote:Seems to be a bug here, checked UTF tables and 'ß' should produce C3 9F which I can't get here?!
Am I doing something wrong or is it really a bug
Michael
But your code is converting to Ascii and then to Unicode (because of the Unicode switch!) You have lost the Utf-8 encoding!
The following produces $C39F okay :
Code: Select all
Procedure.s UTF8(s.s)
Protected *Buffer
Protected Result.s
Protected byteLen
byteLen = StringByteLength(s,#PB_UTF8)
*Buffer=AllocateMemory(byteLen+1)
PokeS(*Buffer,s,-1,#PB_UTF8)
For i = 0 To byteLen-1
Debug Hex(PeekB(*Buffer+i), #PB_Byte)
Next
EndProcedure
Utf8("ß")
Posted: Sat Sep 05, 2009 10:57 am
by Michael Vogel
srod wrote:The following produces $C39F okay [...]
Still being confused, your code is ok, but I'm still not able (not my best period of time

) to get this done twice as it seems to have been happend with the geocache files...
Code: Select all
Procedure.s UTF8(s.s)
Protected *Buffer
Protected Result.s
Protected byteLen
byteLen = StringByteLength(s,#PB_UTF8)
*Buffer=AllocateMemory(byteLen+1)
PokeS(*Buffer,s,-1,#PB_UTF8)
Debug "---"+s+"---"
For i = 0 To byteLen-1
Debug Hex(PeekB(*Buffer+i), #PB_Byte)
Next
ProcedureReturn PeekS(*Buffer,-1,#PB_Ascii)
EndProcedure
Utf8(Utf8("ß"))
The bytes which are seen in the geocache files instead of the C39F is C3009F00, but maybe I have only to "expand" your result and add the 00 bytes - I will try this now...
Thanks for your help, srod!
Posted: Sat Sep 05, 2009 11:07 am
by srod
Remember that with the Unicode switch set, your function is returning a Unicode string regardless of whether you filled that string from a utf-8 buffer or an Ascii one etc. That is why you are seeing zeros.
Posted: Sat Sep 05, 2009 10:50 pm
by Michael Vogel
Still being confused (and getting more and more grey hairs

)...
Found some 3-byte codes (or 6-byte codes when also counting the zero-bytes) within the wrong geocache files, but the procedure is only generating 2 byte codes?!
I added two examples and the hex codes which should be returned (at least what I saw in the wikipedia) - can you help me once again :roll:
Thanks,
Michael
Code: Select all
Procedure.s UTF8(s.s)
Protected *Buffer
Protected Result.s
Protected byteLen
byteLen = StringByteLength(s,#PB_UTF8)
*Buffer=AllocateMemory(byteLen+1)
PokeS(*Buffer,s,-1,#PB_UTF8)
Debug "---"+s+"---"
For i = 0 To byteLen-1
Debug Hex(PeekB(*Buffer+i), #PB_Byte)
Next
ProcedureReturn PeekS(*Buffer,-1,#PB_Ascii)
EndProcedure
Utf8("ß"); c3 9f
Utf8("„"); e2 80 9e
Utf8("€"); e2 82 ac
____
The code doesn't show the chars (the last line includes the euro symbol), but with copy and paste it should work...
Posted: Sun Sep 06, 2009 10:03 am
by srod
Michael, the procedure is clearly producing the correct UTF-8 bytes, that isn't a problem.
Your problem, as I have tried to point out above, is that your procedure is returning a PB string (via ProcedureReturn PeekS(*Buffer,-1,#PB_Ascii)).
Now PB strings are either in 1-byte Ascii format or 2-byte Unicode format depending on your compiler settings. Purebasic does not deal with strings internally in UTF-8 format (no sensible compiler would!).
This is why you are seeing 2-byte codes from your procedure return whilst in Unicode mode.
To continue working in utf-8 you must work with a memory buffer containing your utf-8 bytes and NOT work with a PB string variable! Again, a PB string variable can contain only ascii or unicode characters depending on the compiler settings.
Right, now that is understood (

), the following adjustment to your code returns your utf-8 buffer :
Code: Select all
Procedure.i UTF8(s.s)
Protected *Buffer
Protected Result.s
Protected byteLen
byteLen = StringByteLength(s,#PB_UTF8)
*Buffer=AllocateMemory(byteLen+1)
PokeS(*Buffer,s,-1,#PB_UTF8)
ProcedureReturn *buffer
EndProcedure
*buffer = Utf8("€"); e2 82 ac
;Let us take a look at the UTF-8 characters.
*ptr.BYTE = *buffer
While *ptr\b
Debug Hex(*ptr\b, #PB_Byte)
*ptr+1
Wend
FreeMemory(*buffer)
Posted: Sun Sep 06, 2009 10:18 am
by Michael Vogel
Thanks srod to be that patient - meanwhile I feel quite awful but still being to silly to get it
The output of your last example brings up (I thought, it will do the 3-Byte code now):
C2
80
And it doesn't matter if I use unicode or not - I give up
Michael
___
PS I "solved" the whole geocaching thing now, but with a kind of brute force technology
Code: Select all
#MaxChars=127
Global Dim WrongChar.s(#MaxChars)
Global Dim CorrectChar.s(#MaxChars)
For i=0 To #MaxChars
CorrectChar(i)=Chr(i+128)+Chr(0)
Read.b Bytes
For z=1 To Bytes
Read.b b
b=b&255
WrongChar(i)=WrongChar(i)+Chr(b)+Chr(0)
Next z
Next i
:
Data.b 3,$E2,$82,$AC; € (#128)
Data.b 2,$C2,$81;
Posted: Sun Sep 06, 2009 10:23 am
by srod
Brings up the 3 bytes okay here.
Make sure you have the Unicode switch set (in order for the € character to be embedded correctly in the data section of the exe) and alter the IDE preferences for the source file to be encoded in utf-8 format.
Also, make sure that the € character was copied from these forums to your source okay.
Posted: Sun Sep 06, 2009 10:49 am
by Michael Vogel
srod wrote:... the IDE preferences for the source file to be encoded in utf-8 format...

Posted: Sun Sep 06, 2009 11:00 am
by srod
I take it that it worked okay then?

Posted: Sun Sep 06, 2009 11:33 am
by Michael Vogel
srod wrote:I take it that it worked okay then?

Yep
__
...and thanks once again