Page 1 of 1
How to view ASCII file in Scintilla Gadget with UTF-8 CP?
Posted: Sun Sep 25, 2016 3:57 pm
by Stefan Schnell
Hello community,
a tiny question:
I use the Scintilla gadget and with the following code I want to use ASCII or UTF8 encoded files.
Code: Select all
If ReadFile(0, FileName)
FileFormat = ReadStringFormat(0)
ScintillaSendMessage(#ScriptEdit, #SCI_CLEARALL)
While Eof(0) = 0
If FileFormat = #PB_Ascii
*Line = Ascii(ReadString(0, #PB_Ascii) + #CRLF$)
ScintillaSendMessage(#ScriptEdit, #SCI_APPENDTEXT, MemorySize(*Line), *Line)
FreeMemory(*Line)
ElseIf FileFormat = #PB_UTF8
Line = ReadString(0, #PB_UTF8) + #CRLF$
ScintillaSendMessage(#ScriptEdit, #SCI_APPENDTEXT, StringByteLength(Line , #PB_UTF8), UTF8(Line))
EndIf
Wend
ScintillaSendMessage(#ScriptEdit, #SCI_SETSAVEPOINT)
CloseFile(0)
Else
MessageRequester("Important hint", "Can not open file",
#PB_MessageRequester_Ok)
EndIf
With UTF8 all works well but if I load an ASCII file I get the following result:
Test = "Dies ist ein Tx94x84x81st"
instead
Test = "Dies ist ein Töäüst"
I set the Scintilla gadget to the UTF8 codepage.
ScintillaSendMessage(#ScriptEdit, #SCI_SETCODEPAGE, #SC_CP_UTF8)
Thanks for tips and hints.
Cheers
Stefan
Re: How to view ASCII file in Scintilla Gadget with UTF-8 CP
Posted: Sun Sep 25, 2016 6:23 pm
by kenmo
Scintilla expects UTF8 text, regardless of whether you read it from an ASCII or UTF8 file. So this part is not correct, because you are feeding ASCII instead of UTF8 to the Scintilla:
Code: Select all
*Line = Ascii(ReadString(0, #PB_Ascii) + #CRLF$)
ScintillaSendMessage(#ScriptEdit, #SCI_APPENDTEXT, MemorySize(*Line), *Line)
This handles ASCII and UTF8 files (and Unicode too?) plus it simplifies your code:
Code: Select all
FileFormat = ReadStringFormat(0)
*UTF8 = UTF8(ReadString(0, FileFormat | #PB_File_IgnoreEOL))
ScintillaSendMessage(#ScriptEdit, #SCI_SETTEXT, 0, *UTF8)
FreeMemory(*UTF8)
ScintillaSendMessage(#ScriptEdit, #SCI_SETSAVEPOINT)
Re: How to view ASCII file in Scintilla Gadget with UTF-8 CP
Posted: Mon Sep 26, 2016 4:16 am
by Stefan Schnell
Hello kenmo,
thank you very much for your reply. Your snippet is great
But now is the result
Test = "Dies ist ein T”„
HOPst"
Code: Select all
037805C8 54 65 73 74 20 3D 20 22 44 69 65 73 20 69 73 74 Test = "Dies ist
037805D8 20 65 69 6E 20 54 E2 80 9D E2 80 9E C2 81 73 74 ein TââÂst
037805E8 22 00 ".
instead of
Test = "Dies ist ein Töäüst"
Code: Select all
037805C8 24 54 65 73 74 20 3D 20 22 44 69 65 73 20 69 73 $Test = "Dies is
037805D8 74 20 65 69 6E 20 54 C3 B6 C3 A4 C3 BC 73 74 22 t ein Töäüst"
The byte sequence of the ASCII file E2 80 9D E2 80 9E C2 81 looks very different from the UTF8 file C3 B6 C3 A4 C3 BC.
What could I do now?
Thanks for tips and hints.
Cheers
Stefan
Re: How to view ASCII file in Scintilla Gadget with UTF-8 CP
Posted: Mon Sep 26, 2016 5:29 am
by Demivec
Stefan Schnell wrote:The byte sequence of the ASCII file E2 80 9D E2 80 9E C2 81 looks very different from the UTF8 file C3 B6 C3 A4 C3 BC.
What could I do now?
I think you need to convert the non-ASCII values (i.e. those > 127) into Utf-8 values.
This depends on the code page that was used.
Here is an example of how to do that:
http://www.purebasic.fr/english/viewtopic.php?f=13&t=64821&p=481696#p481696
You can also use API to do the conversion.
Re: How to view ASCII file in Scintilla Gadget with UTF-8 CP
Posted: Mon Sep 26, 2016 5:50 pm
by kenmo
The three characters "öäü"...
In ASCII (Windows-1252) the bytes are:
In UTF-8 the bytes are:
(Checked with
http://www.ltg.ed.ac.uk/~richard/utf-8.html )
So where are these bytes coming from? Do you have a short example code I can run?
Re: How to view ASCII file in Scintilla Gadget with UTF-8 CP
Posted: Mon Sep 26, 2016 9:12 pm
by Stefan Schnell
Hello kenmo,
thanks for your reply.
Here a code example:
Code: Select all
; Begin-----------------------------------------------------------------
Global FileName.s = "C:\Dummy\Test.txt"
Global FileFormat.i
Global *Line
If ReadFile(0, FileName)
FileFormat = ReadStringFormat(0)
If FileFormat = #PB_Ascii
;Ascii delivers the correct byte sequence in memory but I can't convert it to UTF8
;*Line = Ascii(ReadString(0, FileFormat | #PB_File_IgnoreEOL))
;The byte sequence E2 80 9D E2 80 9E C2 81 comes from here, look at memory viewer
*Line = UTF8(ReadString(0, FileFormat | #PB_File_IgnoreEOL))
ShowMemoryViewer(*Line, MemorySize(*Line))
FreeMemory(*Line)
Else
*Line = UTF8(ReadString(0, FileFormat | #PB_File_IgnoreEOL))
ShowMemoryViewer(*Line, MemorySize(*Line))
FreeMemory(*Line)
EndIf
CloseFile(0)
Else
MessageRequester("Important hint", "Can not open file",
#PB_MessageRequester_Ok)
EndIf
; End-------------------------------------------------------------------
Here the file Test.txt:
Thanks for your support.
Cheers
Stefan
Re: How to view ASCII file in Scintilla Gadget with UTF-8 CP
Posted: Mon Sep 26, 2016 9:44 pm
by kenmo
I have a guess what your problem is.
Using Notepad++ I saved your test file in three encodings:
ASCII (Notepad++ calls ANSI) -- loads OK
UTF-8 with BOM -- loads OK
UTF-8 without BOM -- loads wrong
You are relying on ReadFileFormat(), which is OK for ASCII files & UTF-8 files with Byte Order Mark.
But a UTF-8 text file without a 3-byte BOM at the beginning, will look like ASCII according to ReadFileFormat(). So PB thinks the file is ASCII, and converts those bytes into wrong UTF-8 characters...
Some options are:
1. Add a UTF-8 BOM to the beginning of your UTF-8 text files
2. Assume text files are UTF-8, even if they don't have a BOM
3. There are code examples that look at the file and guess the encoding
Re: How to view ASCII file in Scintilla Gadget with UTF-8 CP
Posted: Mon Sep 26, 2016 10:35 pm
by Stefan Schnell
Hello Demivec,
thanks for your support. Now I use MultiByteToWideChar conversion and it works.
Code: Select all
; Begin-----------------------------------------------------------------
EnableExplicit
Global FileName.s = "C:\Dummy\Test.txt"
Global FileFormat.i
Global *AnsiLine
Global *UTF16Line
Global *UTF8Line
Global Line.s
If ReadFile(0, FileName)
FileFormat = ReadStringFormat(0)
If FileFormat = #PB_Ascii
*AnsiLine = Ascii(ReadString(0, FileFormat | #PB_File_IgnoreEOL))
ShowMemoryViewer(*AnsiLine, MemorySize(*AnsiLine))
*UTF16Line = AllocateMemory(MemorySize(*AnsiLine) * 2)
MultiByteToWideChar_(437, 0, *AnsiLine, MemorySize(*AnsiLine), *UTF16Line, MemorySize(*UTF16Line))
ShowMemoryViewer(*UTF16Line, MemorySize(*UTF16Line))
Line = PeekS(*UTF16Line)
Debug Line
Else
*UTF8Line = UTF8(ReadString(0, FileFormat | #PB_File_IgnoreEOL))
ShowMemoryViewer(*UTF8Line, MemorySize(*UTF8Line))
FreeMemory(*UTF8Line)
EndIf
CloseFile(0)
Else
MessageRequester("Important hint", "Can not open file",
#PB_MessageRequester_Ok)
EndIf
; End-------------------------------------------------------------------
Cheers
Stefan
Re: How to view ASCII file in Scintilla Gadget with UTF-8 CP
Posted: Mon Sep 26, 2016 10:41 pm
by Stefan Schnell
Hello kenmo,
thank your very much for your support.
You are right and I use your option 3 now.
Cheers
Stefan