How to view ASCII file in Scintilla Gadget with UTF-8 CP?

Just starting out? Need help? Post your questions and find answers here.
User avatar
Stefan Schnell
User
User
Posts: 85
Joined: Wed May 07, 2003 2:53 pm
Location: Germany - Oberirsen
Contact:

How to view ASCII file in Scintilla Gadget with UTF-8 CP?

Post by Stefan Schnell »

Hello community,

a tiny question:

I use the Scintilla gadget and with the following code I want to use ASCII or UTF8 encoded files.

Code: Select all

If ReadFile(0, FileName)
  FileFormat = ReadStringFormat(0)
  ScintillaSendMessage(#ScriptEdit, #SCI_CLEARALL)
  While Eof(0) = 0
    If FileFormat = #PB_Ascii
      *Line = Ascii(ReadString(0, #PB_Ascii) + #CRLF$)
      ScintillaSendMessage(#ScriptEdit, #SCI_APPENDTEXT, MemorySize(*Line), *Line)
      FreeMemory(*Line)
    ElseIf FileFormat = #PB_UTF8  
      Line = ReadString(0, #PB_UTF8) + #CRLF$
      ScintillaSendMessage(#ScriptEdit, #SCI_APPENDTEXT, StringByteLength(Line , #PB_UTF8), UTF8(Line))
    EndIf
  Wend
  ScintillaSendMessage(#ScriptEdit, #SCI_SETSAVEPOINT)
  CloseFile(0)  
Else
  MessageRequester("Important hint", "Can not open file", 
    #PB_MessageRequester_Ok)
EndIf
With UTF8 all works well but if I load an ASCII file I get the following result:
Test = "Dies ist ein Tx94x84x81st"
instead
Test = "Dies ist ein Töäüst"

I set the Scintilla gadget to the UTF8 codepage.
ScintillaSendMessage(#ScriptEdit, #SCI_SETCODEPAGE, #SC_CP_UTF8)

Thanks for tips and hints.

Cheers
Stefan
User avatar
kenmo
Addict
Addict
Posts: 2033
Joined: Tue Dec 23, 2003 3:54 am

Re: How to view ASCII file in Scintilla Gadget with UTF-8 CP

Post by kenmo »

Scintilla expects UTF8 text, regardless of whether you read it from an ASCII or UTF8 file. So this part is not correct, because you are feeding ASCII instead of UTF8 to the Scintilla:

Code: Select all

      *Line = Ascii(ReadString(0, #PB_Ascii) + #CRLF$)
      ScintillaSendMessage(#ScriptEdit, #SCI_APPENDTEXT, MemorySize(*Line), *Line)
This handles ASCII and UTF8 files (and Unicode too?) plus it simplifies your code:

Code: Select all

  FileFormat = ReadStringFormat(0)
  *UTF8 = UTF8(ReadString(0, FileFormat | #PB_File_IgnoreEOL))
  ScintillaSendMessage(#ScriptEdit, #SCI_SETTEXT, 0, *UTF8)
  FreeMemory(*UTF8)
  ScintillaSendMessage(#ScriptEdit, #SCI_SETSAVEPOINT)
User avatar
Stefan Schnell
User
User
Posts: 85
Joined: Wed May 07, 2003 2:53 pm
Location: Germany - Oberirsen
Contact:

Re: How to view ASCII file in Scintilla Gadget with UTF-8 CP

Post by Stefan Schnell »

Hello kenmo,

thank you very much for your reply. Your snippet is great :)

But now is the result
Test = "Dies ist ein T”„HOPst"

Code: Select all

037805C8  54 65 73 74 20 3D 20 22 44 69 65 73 20 69 73 74  Test = "Dies ist
037805D8  20 65 69 6E 20 54 E2 80 9D E2 80 9E C2 81 73 74   ein T”„st
037805E8  22 00                                            ".
instead of
Test = "Dies ist ein Töäüst"

Code: Select all

037805C8  24 54 65 73 74 20 3D 20 22 44 69 65 73 20 69 73  $Test = "Dies is
037805D8  74 20 65 69 6E 20 54 C3 B6 C3 A4 C3 BC 73 74 22  t ein Töäüst"
The byte sequence of the ASCII file E2 80 9D E2 80 9E C2 81 looks very different from the UTF8 file C3 B6 C3 A4 C3 BC.

What could I do now?

Thanks for tips and hints.

Cheers
Stefan
User avatar
Demivec
Addict
Addict
Posts: 4260
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: How to view ASCII file in Scintilla Gadget with UTF-8 CP

Post by Demivec »

Stefan Schnell wrote:The byte sequence of the ASCII file E2 80 9D E2 80 9E C2 81 looks very different from the UTF8 file C3 B6 C3 A4 C3 BC.

What could I do now?
I think you need to convert the non-ASCII values (i.e. those > 127) into Utf-8 values.

This depends on the code page that was used.

Here is an example of how to do that:
http://www.purebasic.fr/english/viewtopic.php?f=13&t=64821&p=481696#p481696

You can also use API to do the conversion.
User avatar
kenmo
Addict
Addict
Posts: 2033
Joined: Tue Dec 23, 2003 3:54 am

Re: How to view ASCII file in Scintilla Gadget with UTF-8 CP

Post by kenmo »

The three characters "öäü"...

In ASCII (Windows-1252) the bytes are:

Code: Select all

F6 E4 FC
In UTF-8 the bytes are:

Code: Select all

C3 B6 C3 A4 C3 BC
(Checked with http://www.ltg.ed.ac.uk/~richard/utf-8.html )



So where are these bytes coming from? Do you have a short example code I can run?

Code: Select all

E2 80 9D E2 80 9E C2 81
User avatar
Stefan Schnell
User
User
Posts: 85
Joined: Wed May 07, 2003 2:53 pm
Location: Germany - Oberirsen
Contact:

Re: How to view ASCII file in Scintilla Gadget with UTF-8 CP

Post by Stefan Schnell »

Hello kenmo,

thanks for your reply.

Here a code example:

Code: Select all

; Begin-----------------------------------------------------------------

  Global FileName.s = "C:\Dummy\Test.txt"
  Global FileFormat.i
  Global *Line

  If ReadFile(0, FileName)
    FileFormat = ReadStringFormat(0)
    If FileFormat = #PB_Ascii
      ;Ascii delivers the correct byte sequence in memory but I can't convert it to UTF8
      ;*Line = Ascii(ReadString(0, FileFormat | #PB_File_IgnoreEOL))
      ;The byte sequence E2 80 9D E2 80 9E C2 81 comes from here, look at memory viewer
      *Line = UTF8(ReadString(0, FileFormat | #PB_File_IgnoreEOL))
      ShowMemoryViewer(*Line, MemorySize(*Line))
      FreeMemory(*Line)
    Else
      *Line = UTF8(ReadString(0, FileFormat | #PB_File_IgnoreEOL))
      ShowMemoryViewer(*Line, MemorySize(*Line))
      FreeMemory(*Line)
    EndIf
    CloseFile(0)  
  Else
    MessageRequester("Important hint", "Can not open file", 
      #PB_MessageRequester_Ok)
  EndIf 

; End-------------------------------------------------------------------
Here the file Test.txt:

Code: Select all

Test = "Dies ist ein Töäüst"
Thanks for your support.

Cheers
Stefan
User avatar
kenmo
Addict
Addict
Posts: 2033
Joined: Tue Dec 23, 2003 3:54 am

Re: How to view ASCII file in Scintilla Gadget with UTF-8 CP

Post by kenmo »

I have a guess what your problem is.

Using Notepad++ I saved your test file in three encodings:
ASCII (Notepad++ calls ANSI) -- loads OK
UTF-8 with BOM -- loads OK
UTF-8 without BOM -- loads wrong

You are relying on ReadFileFormat(), which is OK for ASCII files & UTF-8 files with Byte Order Mark.

But a UTF-8 text file without a 3-byte BOM at the beginning, will look like ASCII according to ReadFileFormat(). So PB thinks the file is ASCII, and converts those bytes into wrong UTF-8 characters...


Some options are:
1. Add a UTF-8 BOM to the beginning of your UTF-8 text files
2. Assume text files are UTF-8, even if they don't have a BOM
3. There are code examples that look at the file and guess the encoding
User avatar
Stefan Schnell
User
User
Posts: 85
Joined: Wed May 07, 2003 2:53 pm
Location: Germany - Oberirsen
Contact:

Re: How to view ASCII file in Scintilla Gadget with UTF-8 CP

Post by Stefan Schnell »

Hello Demivec,

thanks for your support. Now I use MultiByteToWideChar conversion and it works.

Code: Select all

; Begin-----------------------------------------------------------------

  EnableExplicit

  Global FileName.s = "C:\Dummy\Test.txt"
  Global FileFormat.i
  Global *AnsiLine
  Global *UTF16Line
  Global *UTF8Line
  Global Line.s
  
  If ReadFile(0, FileName)
    FileFormat = ReadStringFormat(0)
    If FileFormat = #PB_Ascii
      *AnsiLine = Ascii(ReadString(0, FileFormat | #PB_File_IgnoreEOL))
      ShowMemoryViewer(*AnsiLine, MemorySize(*AnsiLine))
      *UTF16Line = AllocateMemory(MemorySize(*AnsiLine) * 2)
      MultiByteToWideChar_(437, 0, *AnsiLine, MemorySize(*AnsiLine), *UTF16Line, MemorySize(*UTF16Line))
      ShowMemoryViewer(*UTF16Line, MemorySize(*UTF16Line))
      Line = PeekS(*UTF16Line)
      Debug Line
    Else
      *UTF8Line = UTF8(ReadString(0, FileFormat | #PB_File_IgnoreEOL))
      ShowMemoryViewer(*UTF8Line, MemorySize(*UTF8Line))
      FreeMemory(*UTF8Line)
    EndIf
    CloseFile(0)  
  Else
    MessageRequester("Important hint", "Can not open file", 
      #PB_MessageRequester_Ok)
  EndIf 

; End-------------------------------------------------------------------
Cheers
Stefan
User avatar
Stefan Schnell
User
User
Posts: 85
Joined: Wed May 07, 2003 2:53 pm
Location: Germany - Oberirsen
Contact:

Re: How to view ASCII file in Scintilla Gadget with UTF-8 CP

Post by Stefan Schnell »

Hello kenmo,

thank your very much for your support. :)

You are right and I use your option 3 now.

Cheers
Stefan
Post Reply