UTF-8 file conversion
Posted: Wed Jul 06, 2011 8:41 am
Hi,
I have a text file to read that is probably UTF-8 encoded and contains words in many different languages.
So I read it with ReadString(1,#PB_UTF8) and most of the characters get converted correctly.
But some characters appear wrong.
One sample in portuguese:
- opened without conversion in ascii: Indústria e Comércio
- opened as UTF-8: Indústria e Comércio
As you can see the codes have been somewhat converted, but not correct.
Can anyone tell me why and how to correct that or could it be that my source file is already corrupt?
What I did to fix the problem a little bit is:
After reading the file line by line, I check each character for a being a control sequence (which should not exits anymore after read as UTF-8)
If I find a control character I convert it again with pokes and peeks.
This somehow seems to work and produces this result: Indústria e Comércio
But this is not straight and seems to be odd.
here the not so pretty test code (without the rest of the program) to be compiled in unicode
I have a text file to read that is probably UTF-8 encoded and contains words in many different languages.
So I read it with ReadString(1,#PB_UTF8) and most of the characters get converted correctly.
But some characters appear wrong.
One sample in portuguese:
- opened without conversion in ascii: Indústria e Comércio
- opened as UTF-8: Indústria e Comércio
As you can see the codes have been somewhat converted, but not correct.
Can anyone tell me why and how to correct that or could it be that my source file is already corrupt?
What I did to fix the problem a little bit is:
After reading the file line by line, I check each character for a being a control sequence (which should not exits anymore after read as UTF-8)
If I find a control character I convert it again with pokes and peeks.
This somehow seems to work and produces this result: Indústria e Comércio
But this is not straight and seems to be odd.
here the not so pretty test code (without the rest of the program) to be compiled in unicode
Code: Select all
#window=1
#text=1
#CRLF=Chr(13)+Chr(10)
Global infile$
Global outfile$
Procedure do()
SetGadgetItemText(#text,1,"Please wait...")
DisableGadget(#text,1)
If ReadFile(1,infile$)
outfile$=infile$+"_a2u_."+GetExtensionPart(infile$)
Debug outfile$
error$=""
If CreateFile(2,outfile$)
While Eof(1) = 0
in$=ReadString(1,#PB_UTF8)
out$=in$
out2$=""
For i=1 To StringByteLength(in$)
If Mid(in$,i,1)="Ã"
a$=Mid(in$,i,2)
buffer$=Space(8)
PokeS(@buffer$,a$,2,#PB_Ascii)
out2$+PeekS(@buffer$,2,#PB_UTF8)
i+1
error$+in$
Else
out2$+Mid(in$,i,1)
EndIf
Next
If out2$<>""
out$=out2$
EndIf
WriteStringN(2,out$,#PB_Unicode)
Wend
MessageRequester("Done!","file "+infile$+" converted to"+#CRLF+"file "+outfile$+#CRLF+"Errors:"+#CRLF+error$)
CloseFile(2)
EndIf
CloseFile(1)
EndIf
SetGadgetItemText(#text,1,"Drag'n'Drop the file here")
DisableGadget(#text,0)
EndProcedure
Procedure myWindow()
If OpenWindow(#window,0,0,240,120, "convert to unicode",#PB_Window_SystemMenu |#PB_Window_TitleBar | #PB_Window_ScreenCentered )
PanelGadget(#text,10,10,220,100)
AddGadgetItem(#text,-1,"Drag'n'Drop the file here")
EnableGadgetDrop(#text,#PB_Drop_Files,#PB_Drag_Copy|#PB_Drag_Move)
EndIf
EndProcedure
Procedure myEvents()
Repeat
EventID = WaitWindowEvent()
EventType = EventType()
Select EventID
Case #PB_Event_GadgetDrop
infile$=StringField(EventDropFiles(),1,Chr(10))
Debug filename$
do()
Case #PB_Event_Gadget
GadgetID = EventGadget()
Select GadgetID
Case #text
EndSelect
EndSelect
Until EventID=#PB_Event_CloseWindow
EndProcedure
myWindow()
myEvents()