PB 6.30 b6 - ReadStringFormat() seemingly unreliable?

Just starting out? Need help? Post your questions and find answers here.
ColeopterusMaximus
User
User
Posts: 78
Joined: Fri Oct 29, 2010 11:29 am

PB 6.30 b6 - ReadStringFormat() seemingly unreliable?

Post by ColeopterusMaximus »

ReadStringFormat() seems to be unreliable, is this a bug? Am I doing something wrong? Can someone else confirm?

If the below code is run:

Code: Select all

Define string.s="9.9.9.9,áéíóúñ,XX,2025"

Define file.i
Define format.i

file = CreateFile(#PB_Any, "test.txt", #PB_UTF8)
If file
    WriteString(file, String, #PB_UTF8)
    CloseFile(file)
EndIf

file = OpenFile(#PB_Any, "test.txt")
format.i = ReadStringFormat(file)
If file
    string = ReadString(file, format)
    CloseFile(file)
EndIf

Debug(string)

CallDebugger
The file test.txt will produce a correctly formatted UTF-8 file:

Code: Select all

hexdump -C test.txt (UTF8)
00000000  39 2e 39 2e 39 2e 39 2c  c3 a1 c3 a9 c3 ad c3 b3  |9.9.9.9,........|
00000010  c3 ba c3 b1 2c 58 58 2c  32 30 32 35              |....,XX,2025|
0000001c
chardetect test.txt 
test.txt: utf-8 with confidence 0.99
However ReadStringFormat() detects an ASCII file and return:

Code: Select all

9.9.9.9,áéíóúñ,XX,2025
This is very problematic because when I try to process files from other programs PB gets often confused...
AZJIO
Addict
Addict
Posts: 2254
Joined: Sun May 14, 2017 1:48 am

Re: PB 6.30 b6 - ReadStringFormat() seemingly unreliable?

Post by AZJIO »

You need to write the BOM label, then it will be read.

Code: Select all

WriteStringFormat()
viewtopic.php?p=478874
viewtopic.php?p=617388#p617388
ColeopterusMaximus
User
User
Posts: 78
Joined: Fri Oct 29, 2010 11:29 am

Re: PB 6.30 b6 - ReadStringFormat() seemingly unreliable?

Post by ColeopterusMaximus »

AZJIO wrote: Mon Jan 12, 2026 4:06 am You need to write the BOM label, then it will be read.

Code: Select all

WriteStringFormat()
viewtopic.php?p=478874
viewtopic.php?p=617388#p617388
Thanks for replying.

Problem with BOM is that I receive files from 3rd parties and they sometimes are ASCII and sometimes UTF8 and they do not include a BOM, for other unrelated reasons I can't write a BOM in the output files, but fortunately Writing is not a problem as I always save my own data in UTF8 consistently.

I know what the problem is about the encoding, as I noticed it I wrote a similar function as in the example to scan the entire file, so this is not a problem for me, I'm not asking help to decode files.

The problem IMHO is that ReadStringFormat() is not reliable, I know that the correct way is scanning the entire file from top to bottom as stated above, but ReadStringFormat() should at least consider the first 1024 bytes of a file to determine the encoding and not be fooled by a few bytes. As it is ReadStringFormat() is worst than useless as it causes problems.

So maybe not a bug but an enhancement?
Axolotl
Addict
Addict
Posts: 921
Joined: Wed Dec 31, 2008 3:36 pm

Re: PB 6.30 b6 - ReadStringFormat() seemingly unreliable?

Post by Axolotl »

Hi there,
IMHO everything is described in the help section for the command ReadStringFormat(). Please pay particular attention to the remarks section.
Only the statement “...and tries to identify the String encoding used ...” leaves room for interpretation and expectations.
However, it is clear that the return value #PB_Ascii is equivalent to NO_BOM, even if there is no separate constant for it.

And of course, you can also submit a request for an improvement/extension in the wish list.
Just because it worked doesn't mean it works.
PureBasic 6.04 (x86) and <latest stable version and current alpha/beta> (x64) on Windows 11 Home. Now started with Linux (VM: Ubuntu 22.04).
AZJIO
Addict
Addict
Posts: 2254
Joined: Sun May 14, 2017 1:48 am

Re: PB 6.30 b6 - ReadStringFormat() seemingly unreliable?

Post by AZJIO »

ColeopterusMaximus wrote: Mon Jan 12, 2026 10:30 am The problem IMHO is that ReadStringFormat() is not reliable,
You don't understand the essence of this function. She reads the BOM tag and nothing else. This function does not try to find out which encoding the file is in. It reads the beginning of the file to check if it belongs to one of the possible BOM labels. These BOM labels are set by the file manufacturer himself to eliminate the need to guess the encoding. This label is specifically designed to tell file readers that the file is written in some format. If there is no such BOM label, then you either need to know what format the file is in, or use the functions to determine the encoding of the file. You wrote about 1024 bytes, but ReadStringFormat() reads no more than 4 bytes, since the BOM label is no more than 4 bytes long. The 1024 byte buffer can only be used by the encoding guessing function. The ReadStringFormat() function is not an encoding guessing function.

Look at the BOM table.


viewtopic.php?p=636306#p636306
viewtopic.php?p=617348#p617348

Code: Select all

Procedure.s OpenFileToVar(FilePath$)
	Protected length, oFile, bytes, *mem, Text$
	oFile = ReadFile(#PB_Any, FilePath$)
	If oFile
		g_Format = ReadStringFormat(oFile)
		length = Lof(oFile)
		*mem = AllocateMemory(length)
		If *mem
			bytes = ReadData(oFile, *mem, length)
			If bytes
				If g_Format = #PB_Ascii
					g_Format = dte::detectTextEncodingInBuffer(*mem, bytes, 0)
					If g_Format = #PB_Ascii
						Text$ = PeekS(*mem, bytes, #PB_Ascii)
					Else
						Text$ = PeekS(*mem, bytes, #PB_UTF8) ; если UTF8 без BOM
					EndIf
				Else
					; а ReadStringFormat() can give #PB_UTF16BE, #PB_UTF32, #PB_UTF32BE
					; although these formats are not popular, they probably won't meet, and we need to ignore them.
					Text$ = PeekS(*mem, bytes, g_Format)
				EndIf
			EndIf
			FreeMemory(*mem)
		EndIf
		CloseFile(oFile)
	EndIf
	ProcedureReturn Text$
EndProcedure
ColeopterusMaximus
User
User
Posts: 78
Joined: Fri Oct 29, 2010 11:29 am

Re: PB 6.30 b6 - ReadStringFormat() seemingly unreliable?

Post by ColeopterusMaximus »

Axolotl wrote: Mon Jan 12, 2026 11:05 am Hi there,
IMHO everything is described in the help section for the command ReadStringFormat(). Please pay particular attention to the remarks section.
You're right, I had a completely wrong idea of what ReadStringFormat() does, this comes in part because I wrote a library to manage files a very long time ago and I had it all abstracted in my head somehow what ReadStringFormat() did in a wrong way.

I started having problems because I have to deal with 3rd party files now and they are very inconsistent, a circumstance I didn't have to deal with in the past.

In my head I had assumed ReadStringFormat() was checking for the leading/intermediate bits to identify UTF vs ASCII and that is clearly not the case.
Axolotl wrote: Mon Jan 12, 2026 11:05 am Only the statement “...and tries to identify the String encoding used ...” leaves room for interpretation and expectations.
However, it is clear that the return value #PB_Ascii is equivalent to NO_BOM, even if there is no separate constant for it.

And of course, you can also submit a request for an improvement/extension in the wish list.

Yeah, you're 100% correct I was very confused about the true purpose of ReadStringFormat()

I solved the problem quickly and wrote a function that finds the format of a file either on the first 1024 bytes or on its entirety, it is not a difficult problem.

Again thanks for your attention, very much appreciated.
highend
Enthusiast
Enthusiast
Posts: 170
Joined: Tue Jun 17, 2014 4:49 pm

Re: PB 6.30 b6 - ReadStringFormat() seemingly unreliable?

Post by highend »

and wrote a function that finds the format of a file either on the first 1024 bytes or on its entirety
How about making it available for other users?
ColeopterusMaximus
User
User
Posts: 78
Joined: Fri Oct 29, 2010 11:29 am

Re: PB 6.30 b6 - ReadStringFormat() seemingly unreliable?

Post by ColeopterusMaximus »

highend wrote: Tue Jan 13, 2026 6:03 pm How about making it available for other users?
Not sure this would be of use to anybody, but here it goes:

Code: Select all

    ; Encode check methods
    #ENCODING_CHECK_SLOW = 0
    #ENCODING_CHECK_FAST = 1

    ; Identifies if the encoding of a file is UTF8 or ASCII
    ; by means of scanning the entire file or just the first
    ; chunk of the file.
    ; Returns -1 on error, #PB_UTF8 or #PB_ASCII on success
    Procedure.i _IdentifyEncoding(str_file_path.s, bol_quick.i = #ENCODING_CHECK_SLOW)
         Protected *ptr_file_buffer
         Protected int_buffer_size.i   = 1024
         Protected hnd_file.i          = 0
         Protected int_read.i          = 0
         Protected int_filesize.i      = 0
         Protected int_total_read.i    = 0
         Protected int_offset.i        = -1
         Protected int_state.i         = 0
         Protected int_result.i        = -1
         Protected asc_byte.a          = 0
         Protected int_leadingbytes.i  = 0
         Protected int_interby_count.i = 0

        Enumeration
            #ENCODINGID_START
            #ENCODINGID_READ_NEXT_FCHUNK
            #ENCODINGID_CHECK_NEXT_BYTE
            #ENCODINGID_FINISH
            #ENCODINGID_END
        EndEnumeration

        ; By default we assume ASCII
        int_result = #PB_Ascii

        int_state = #ENCODINGID_START
        While int_state <> #ENCODINGID_END
            Select int_state
                Case #ENCODINGID_START
                    hnd_file = ReadFile(#PB_Any, str_file_path)
                    If hnd_file
                        int_filesize = FileSize(str_file_path)
                        *ptr_file_buffer = AllocateMemory(int_buffer_size)
                        If Not *ptr_file_buffer
                            int_result = -1
                            int_state = #ENCODINGID_END
                            Continue
                        EndIf
                    Else
                        int_result = -1
                        int_state = #ENCODINGID_FINISH
                        Continue
                    EndIf
                    int_state = #ENCODINGID_READ_NEXT_FCHUNK

                Case #ENCODINGID_READ_NEXT_FCHUNK
                    If int_total_read < int_filesize
                        int_read  = ReadData(hnd_file, *ptr_file_buffer, int_buffer_size)
                        If int_read
                            int_total_read + int_read
                            int_state = #ENCODINGID_CHECK_NEXT_BYTE
                        EndIf
                    Else
                        int_state = #ENCODINGID_FINISH
                        Continue
                    EndIf

                Case #ENCODINGID_CHECK_NEXT_BYTE
                    int_offset + 1
                    If int_offset >= int_read
                        ; If we're decoding slow we do the entire file
                        ; Otherwise we just read the first chunk and finish
                        If bol_quick = #ENCODING_CHECK_SLOW
                            int_state = #ENCODINGID_READ_NEXT_FCHUNK
                        Else
                            int_state = #ENCODINGID_FINISH
                        EndIf
                        int_offset = -1
                        Continue
                    EndIf

                    asc_byte = PeekB(*ptr_file_buffer + int_offset)

                    ; ASCII
                    If asc_byte & %10000000 = 0
                        int_leadingbytes  = 0 ; no leading
                        int_interby_count = 0 ; reset intermediate count
                        int_state = #ENCODINGID_CHECK_NEXT_BYTE
                        Continue
                    EndIf

                    ; Intermediate byte
                    If asc_byte & %11000000 = %10000000
                        ; If we found a leading byte earlier we count as many
                        ; intermediate bytes as required.

                        If int_leadingbytes
                            ; There was a leading byte
                            int_interby_count + 1
                            ; if we found all intermediate bytes we have our format
                            If int_interby_count => int_leadingbytes
                                int_result = #PB_UTF8
                                int_state  = #ENCODINGID_FINISH
                                Continue
                            EndIf
                        Else
                            ; There was no leading
                            int_leadingbytes  = 0 ; no leading
                            int_interby_count = 0 ; reset intermediate count
                        EndIf

                        int_state = #ENCODINGID_CHECK_NEXT_BYTE
                        Continue
                    EndIf

                    ; 2-byte
                    If asc_byte & %11100000 = %11000000
                        int_leadingbytes = 1
                        int_state = #ENCODINGID_CHECK_NEXT_BYTE
                        Continue
                    EndIf

                    ; 3-byte
                    If asc_byte & %11110000 = %11100000
                        int_leadingbytes = 2
                        int_state = #ENCODINGID_CHECK_NEXT_BYTE
                        Continue
                    EndIf

                    ; 4-byte
                    If asc_byte & %11111000 = %11110000
                        int_leadingbytes = 3
                        int_state = #ENCODINGID_CHECK_NEXT_BYTE
                        Continue
                    EndIf

                Case #ENCODINGID_FINISH
                    If hnd_file
                        CloseFile(hnd_file)
                    EndIf
                    If *ptr_file_buffer
                        FreeMemory(*ptr_file_buffer)
                        *ptr_file_buffer = 0
                    EndIf
                    int_state = #ENCODINGID_END
            EndSelect
        Wend

        ProcedureReturn int_result
    EndProcedure
User avatar
Demivec
Addict
Addict
Posts: 4283
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: PB 6.30 b6 - ReadStringFormat() seemingly unreliable?

Post by Demivec »

highend wrote: Tue Jan 13, 2026 6:03 pm
and wrote a function that finds the format of a file either on the first 1024 bytes or on its entirety
How about making it available for other users?
Also this link has some potentially useful code :wink: :
Detecting Text File Encoding without BOM
Post Reply