UTF-8 file conversion

wichtel · Post by **wichtel** » Wed Jul 06, 2011 8:41 am

Hi,
I have a text file to read that is probably UTF-8 encoded and contains words in many different languages.
So I read it with ReadString(1,#PB_UTF8) and most of the characters get converted correctly.
But some characters appear wrong.
One sample in portuguese:
- opened without conversion in ascii: IndÃƒÂºstria e ComÃƒÂ©rcio
- opened as UTF-8: IndÃºstria e ComÃ©rcio
As you can see the codes have been somewhat converted, but not correct.

Can anyone tell me why and how to correct that or could it be that my source file is already corrupt?

What I did to fix the problem a little bit is:
After reading the file line by line, I check each character for a being a control sequence (which should not exits anymore after read as UTF-8)
If I find a control character I convert it again with pokes and peeks.
This somehow seems to work and produces this result: Indústria e Comércio

But this is not straight and seems to be odd.

here the not so pretty test code (without the rest of the program) to be compiled in unicode

Code: Select all

#window=1
#text=1

#CRLF=Chr(13)+Chr(10)

Global infile$
Global outfile$

Procedure do()
  SetGadgetItemText(#text,1,"Please wait...")
  DisableGadget(#text,1)
  If ReadFile(1,infile$)
    outfile$=infile$+"_a2u_."+GetExtensionPart(infile$)  
    Debug outfile$
    error$=""
    
    If CreateFile(2,outfile$)
      While Eof(1) = 0  
        in$=ReadString(1,#PB_UTF8)   
        out$=in$
        
        out2$=""
        For i=1 To StringByteLength(in$)
          If Mid(in$,i,1)="Ã"
            a$=Mid(in$,i,2) 
            buffer$=Space(8)
            PokeS(@buffer$,a$,2,#PB_Ascii)
            out2$+PeekS(@buffer$,2,#PB_UTF8)
            i+1
            error$+in$
          Else
            out2$+Mid(in$,i,1)
          EndIf  
        Next
        If out2$<>""
          out$=out2$
        EndIf
        WriteStringN(2,out$,#PB_Unicode)
      Wend
      MessageRequester("Done!","file "+infile$+" converted to"+#CRLF+"file "+outfile$+#CRLF+"Errors:"+#CRLF+error$)
      CloseFile(2)
    EndIf
    CloseFile(1)
  EndIf
  SetGadgetItemText(#text,1,"Drag'n'Drop the file here")
  DisableGadget(#text,0)
EndProcedure

Procedure myWindow()
  If OpenWindow(#window,0,0,240,120, "convert to unicode",#PB_Window_SystemMenu |#PB_Window_TitleBar | #PB_Window_ScreenCentered )
    PanelGadget(#text,10,10,220,100)
      AddGadgetItem(#text,-1,"Drag'n'Drop the file here")      
    EnableGadgetDrop(#text,#PB_Drop_Files,#PB_Drag_Copy|#PB_Drag_Move)
  EndIf
EndProcedure

Procedure myEvents()
  Repeat
    EventID = WaitWindowEvent()
    EventType = EventType()
    Select EventID
      Case #PB_Event_GadgetDrop
        infile$=StringField(EventDropFiles(),1,Chr(10)) 
        Debug filename$
        do()
      Case #PB_Event_Gadget
        GadgetID = EventGadget()
        Select GadgetID
          Case #text
        EndSelect
    EndSelect
  Until EventID=#PB_Event_CloseWindow
EndProcedure


myWindow()
myEvents()

Trond · Post by **Trond** » Wed Jul 06, 2011 9:06 am

I have a text file to read that is probably UTF-8 encoded

It is probably not if the characters appear wrong when decoded as UTF-8. Why don't you post your file (upload it)?

wichtel · Post by **wichtel** » Wed Jul 06, 2011 10:14 am

Trond wrote:... Why don't you post your file (upload it)?

Because it contains data that I am not allowed to share in public

I assumed UTF-8 because as far as I know UTF-8 uses ASCII characters where possible and only control codes when needed.
And opening it as UTF-8 did work for most characters. For example all German Umlauts have been read ok and all the french accents as well.
Out of over 6000 lines only a few caused problems.
The example I gave in my first post should show the problem.

again:
This is what is in the file (in one line)
IndÃƒÂºstria e ComÃƒÂ©rcio
And after reading as UTF-8
IndÃºstria e ComÃ©rcio

And after converting it a second time
Indústria e Comércio

it appears finally correct.

Do you know of any format that somehow does a double UTF-8 encoding?
Or is it more likely that my source file has been exported false? (unfortunately I have no influence on the source file format nor any tech contact)

I tried to open the source file in Notepad++ and it seems to have the same problems.

RASHAD · Post by **RASHAD** » Wed Jul 06, 2011 11:21 am

Search the web for Detenc.exe

wichtel · Post by **wichtel** » Wed Jul 06, 2011 11:57 am

RASHAD wrote:Search the web for Detenc.exe

Thanks a lot.
So UTF-8 still looks the best, but the same lines seem wrong like when I convert with PB.
If Detenc is corrcet, then my source file must be somehow broken.
It seems I need to detect the broken sequences and somehow restore them by try and error until I have a found a working solution to chew on the next hundreds of files...

Trond · Post by **Trond** » Wed Jul 06, 2011 1:01 pm

This is what is in the file (in one line)
IndÃƒÂºstria e ComÃƒÂ©rcio

No. That is one interpretation of what is in the file. It is unusable for anything because you didn't tell us which encoding was used when making this interpretation.
Open the file in a hex editor and give us the bytes that make up the line. That is what really is in the file.

(Honestly, I think your file is broken.)

infratec · Post by **infratec** » Wed Jul 06, 2011 1:07 pm

wichtel wrote:And after converting it a second time
Indústria e Comércio

Maybe it is coded in UTF-16.

But as Trond said:
Show us the hex snippet of this 3 words.

wichtel · Post by **wichtel** » Wed Jul 06, 2011 1:23 pm

Thanks guys. No it is not UTF-16.

Here is the Hex code of the lines that do not work when read as UTF-8

Code: Select all

Brasil IndÃƒÂºst
42 72 61 73 69 6C 20 49 6E 64 C3 83 C2 BA 73 74

ria e ComÃƒÂ©rci
72 69 61 20 65 20 43 6F 6D C3 83 C2 A9 72 63 69 

o de AutopeÃƒ as
6F 20 64 65 20 41 75 74 6F 70 65 C3 83 20 61 73

And here the codes of something in the same file that does read correct as UTF-8

Code: Select all

er MÃ¡quinas e E
65 72 20 4D C3 A1 71 75 69 6E 61 73 20 65 20 45

Trond · Post by **Trond** » Wed Jul 06, 2011 7:32 pm

I think the file is broken. You have to decode them as UTF-8 twice. Except it's a big problem to know which entries are broken and which are not.

Code: Select all

Procedure.s DecodeUTF8Again(String.s)
  Protected Buf.s = Space(Len(String))
  PokeS(@Buf, String, -1, #PB_Ascii)
  ProcedureReturn PeekS(@Buf, -1, #PB_UTF8)
EndProcedure


a.s = PeekS(?String, -1, #PB_UTF8)
If FindString(a, "Ã")
  a = DecodeUTF8Again(a)
EndIf

Debug a


DataSection
  String:
  Data.a $42, $72, $61, $73, $69, $6C, $20, $49, $6E, $64, $C3, $83, $C2, $BA, $73, $74, 0, 0, 0, 0
EndDataSection

wichtel · Post by **wichtel** » Wed Jul 06, 2011 8:32 pm

Trond wrote:I think the file is broken. You have to decode them as UTF-8 twice. Except it's a big problem to know which entries are broken and which are not.

Thanks.
That is what my code already does - except yours is more pretty.

To know what is broken and what not I read as UTF-8 from file and then look for the escape sequence character "Ã".
Usually there should be none. When I still find one, I decode again.

Well, so be it. That's all that could be done I guess.

PureBasic Forums - English

UTF-8 file conversion

UTF-8 file conversion

Re: UTF-8 file conversion

Re: UTF-8 file conversion

Re: UTF-8 file conversion

Re: UTF-8 file conversion

Re: UTF-8 file conversion

Re: UTF-8 file conversion

Re: UTF-8 file conversion

Re: UTF-8 file conversion

Re: UTF-8 file conversion