UTF-8 file conversion

Just starting out? Need help? Post your questions and find answers here.
User avatar
wichtel
User
User
Posts: 71
Joined: Fri May 02, 2003 11:14 am
Location: Germany
Contact:

UTF-8 file conversion

Post by wichtel »

Hi,
I have a text file to read that is probably UTF-8 encoded and contains words in many different languages.
So I read it with ReadString(1,#PB_UTF8) and most of the characters get converted correctly.
But some characters appear wrong.
One sample in portuguese:
- opened without conversion in ascii: Indústria e Comércio
- opened as UTF-8: Indústria e Comércio
As you can see the codes have been somewhat converted, but not correct.

Can anyone tell me why and how to correct that or could it be that my source file is already corrupt?

What I did to fix the problem a little bit is:
After reading the file line by line, I check each character for a being a control sequence (which should not exits anymore after read as UTF-8)
If I find a control character I convert it again with pokes and peeks.
This somehow seems to work and produces this result: Indústria e Comércio

But this is not straight and seems to be odd.

here the not so pretty test code (without the rest of the program) to be compiled in unicode

Code: Select all

#window=1
#text=1

#CRLF=Chr(13)+Chr(10)

Global infile$
Global outfile$

Procedure do()
  SetGadgetItemText(#text,1,"Please wait...")
  DisableGadget(#text,1)
  If ReadFile(1,infile$)
    outfile$=infile$+"_a2u_."+GetExtensionPart(infile$)  
    Debug outfile$
    error$=""
    
    If CreateFile(2,outfile$)
      While Eof(1) = 0  
        in$=ReadString(1,#PB_UTF8)   
        out$=in$
        
        out2$=""
        For i=1 To StringByteLength(in$)
          If Mid(in$,i,1)="Ã"
            a$=Mid(in$,i,2) 
            buffer$=Space(8)
            PokeS(@buffer$,a$,2,#PB_Ascii)
            out2$+PeekS(@buffer$,2,#PB_UTF8)
            i+1
            error$+in$
          Else
            out2$+Mid(in$,i,1)
          EndIf  
        Next
        If out2$<>""
          out$=out2$
        EndIf
        WriteStringN(2,out$,#PB_Unicode)
      Wend
      MessageRequester("Done!","file "+infile$+" converted to"+#CRLF+"file "+outfile$+#CRLF+"Errors:"+#CRLF+error$)
      CloseFile(2)
    EndIf
    CloseFile(1)
  EndIf
  SetGadgetItemText(#text,1,"Drag'n'Drop the file here")
  DisableGadget(#text,0)
EndProcedure

Procedure myWindow()
  If OpenWindow(#window,0,0,240,120, "convert to unicode",#PB_Window_SystemMenu |#PB_Window_TitleBar | #PB_Window_ScreenCentered )
    PanelGadget(#text,10,10,220,100)
      AddGadgetItem(#text,-1,"Drag'n'Drop the file here")      
    EnableGadgetDrop(#text,#PB_Drop_Files,#PB_Drag_Copy|#PB_Drag_Move)
  EndIf
EndProcedure

Procedure myEvents()
  Repeat
    EventID = WaitWindowEvent()
    EventType = EventType()
    Select EventID
      Case #PB_Event_GadgetDrop
        infile$=StringField(EventDropFiles(),1,Chr(10)) 
        Debug filename$
        do()
      Case #PB_Event_Gadget
        GadgetID = EventGadget()
        Select GadgetID
          Case #text
        EndSelect
    EndSelect
  Until EventID=#PB_Event_CloseWindow
EndProcedure


myWindow()
myEvents()
PB 5.40 LTS, W7,8,10 64bit and Mint x64
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Re: UTF-8 file conversion

Post by Trond »

I have a text file to read that is probably UTF-8 encoded
It is probably not if the characters appear wrong when decoded as UTF-8. Why don't you post your file (upload it)?
User avatar
wichtel
User
User
Posts: 71
Joined: Fri May 02, 2003 11:14 am
Location: Germany
Contact:

Re: UTF-8 file conversion

Post by wichtel »

Trond wrote:... Why don't you post your file (upload it)?
Because it contains data that I am not allowed to share in public :(

I assumed UTF-8 because as far as I know UTF-8 uses ASCII characters where possible and only control codes when needed.
And opening it as UTF-8 did work for most characters. For example all German Umlauts have been read ok and all the french accents as well.
Out of over 6000 lines only a few caused problems.
The example I gave in my first post should show the problem.

again:
This is what is in the file (in one line)
Indústria e Comércio
And after reading as UTF-8
Indústria e Comércio

And after converting it a second time
Indústria e Comércio

it appears finally correct.

Do you know of any format that somehow does a double UTF-8 encoding?
Or is it more likely that my source file has been exported false? (unfortunately I have no influence on the source file format nor any tech contact)

I tried to open the source file in Notepad++ and it seems to have the same problems.
PB 5.40 LTS, W7,8,10 64bit and Mint x64
RASHAD
PureBasic Expert
PureBasic Expert
Posts: 4946
Joined: Sun Apr 12, 2009 6:27 am

Re: UTF-8 file conversion

Post by RASHAD »

Search the web for Detenc.exe
Egypt my love
User avatar
wichtel
User
User
Posts: 71
Joined: Fri May 02, 2003 11:14 am
Location: Germany
Contact:

Re: UTF-8 file conversion

Post by wichtel »

RASHAD wrote:Search the web for Detenc.exe
Thanks a lot.
So UTF-8 still looks the best, but the same lines seem wrong like when I convert with PB.
If Detenc is corrcet, then my source file must be somehow broken.
It seems I need to detect the broken sequences and somehow restore them by try and error until I have a found a working solution to chew on the next hundreds of files...
PB 5.40 LTS, W7,8,10 64bit and Mint x64
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Re: UTF-8 file conversion

Post by Trond »

This is what is in the file (in one line)
Indústria e Comércio
No. That is one interpretation of what is in the file. It is unusable for anything because you didn't tell us which encoding was used when making this interpretation.
Open the file in a hex editor and give us the bytes that make up the line. That is what really is in the file.

(Honestly, I think your file is broken.)
infratec
Always Here
Always Here
Posts: 7584
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: UTF-8 file conversion

Post by infratec »

wichtel wrote:And after converting it a second time
Indústria e Comércio
Maybe it is coded in UTF-16.

But as Trond said:
Show us the hex snippet of this 3 words.
User avatar
wichtel
User
User
Posts: 71
Joined: Fri May 02, 2003 11:14 am
Location: Germany
Contact:

Re: UTF-8 file conversion

Post by wichtel »

Thanks guys. No it is not UTF-16.

Here is the Hex code of the lines that do not work when read as UTF-8

Code: Select all

Brasil Indúst
42 72 61 73 69 6C 20 49 6E 64 C3 83 C2 BA 73 74

ria e Comérci
72 69 61 20 65 20 43 6F 6D C3 83 C2 A9 72 63 69 

o de Autopeà as
6F 20 64 65 20 41 75 74 6F 70 65 C3 83 20 61 73
And here the codes of something in the same file that does read correct as UTF-8

Code: Select all

er Máquinas e E
65 72 20 4D C3 A1 71 75 69 6E 61 73 20 65 20 45
PB 5.40 LTS, W7,8,10 64bit and Mint x64
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Re: UTF-8 file conversion

Post by Trond »

I think the file is broken. You have to decode them as UTF-8 twice. Except it's a big problem to know which entries are broken and which are not.

Code: Select all

Procedure.s DecodeUTF8Again(String.s)
  Protected Buf.s = Space(Len(String))
  PokeS(@Buf, String, -1, #PB_Ascii)
  ProcedureReturn PeekS(@Buf, -1, #PB_UTF8)
EndProcedure


a.s = PeekS(?String, -1, #PB_UTF8)
If FindString(a, "Ã")
  a = DecodeUTF8Again(a)
EndIf

Debug a


DataSection
  String:
  Data.a $42, $72, $61, $73, $69, $6C, $20, $49, $6E, $64, $C3, $83, $C2, $BA, $73, $74, 0, 0, 0, 0
EndDataSection
User avatar
wichtel
User
User
Posts: 71
Joined: Fri May 02, 2003 11:14 am
Location: Germany
Contact:

Re: UTF-8 file conversion

Post by wichtel »

Trond wrote:I think the file is broken. You have to decode them as UTF-8 twice. Except it's a big problem to know which entries are broken and which are not.
Thanks.
That is what my code already does - except yours is more pretty. :)

To know what is broken and what not I read as UTF-8 from file and then look for the escape sequence character "Ã".
Usually there should be none. When I still find one, I decode again.

Well, so be it. That's all that could be done I guess.
PB 5.40 LTS, W7,8,10 64bit and Mint x64
Post Reply