Page 1 of 1
PB 6.30 b6 - ReadStringFormat() seemingly unreliable?
Posted: Mon Jan 12, 2026 2:22 am
by ColeopterusMaximus
ReadStringFormat() seems to be unreliable, is this a bug? Am I doing something wrong? Can someone else confirm?
If the below code is run:
Code: Select all
Define string.s="9.9.9.9,áéíóúñ,XX,2025"
Define file.i
Define format.i
file = CreateFile(#PB_Any, "test.txt", #PB_UTF8)
If file
WriteString(file, String, #PB_UTF8)
CloseFile(file)
EndIf
file = OpenFile(#PB_Any, "test.txt")
format.i = ReadStringFormat(file)
If file
string = ReadString(file, format)
CloseFile(file)
EndIf
Debug(string)
CallDebugger
The file test.txt will produce a correctly formatted UTF-8 file:
Code: Select all
hexdump -C test.txt (UTF8)
00000000 39 2e 39 2e 39 2e 39 2c c3 a1 c3 a9 c3 ad c3 b3 |9.9.9.9,........|
00000010 c3 ba c3 b1 2c 58 58 2c 32 30 32 35 |....,XX,2025|
0000001c
chardetect test.txt
test.txt: utf-8 with confidence 0.99
However ReadStringFormat() detects an ASCII file and return:
This is very problematic because when I try to process files from other programs PB gets often confused...
Re: PB 6.30 b6 - ReadStringFormat() seemingly unreliable?
Posted: Mon Jan 12, 2026 4:06 am
by AZJIO
You need to write the BOM label, then it will be read.
viewtopic.php?p=478874
viewtopic.php?p=617388#p617388
Re: PB 6.30 b6 - ReadStringFormat() seemingly unreliable?
Posted: Mon Jan 12, 2026 10:30 am
by ColeopterusMaximus
Thanks for replying.
Problem with BOM is that I receive files from 3rd parties and they sometimes are ASCII and sometimes UTF8 and they do not include a BOM, for other unrelated reasons I can't write a BOM in the output files, but fortunately Writing is not a problem as I always save my own data in UTF8 consistently.
I know what the problem is about the encoding, as I noticed it I wrote a similar function as in the example to scan the entire file, so this is not a problem for me, I'm not asking help to decode files.
The problem IMHO is that ReadStringFormat() is not reliable, I know that the correct way is scanning the entire file from top to bottom as stated above, but ReadStringFormat() should at least consider the first 1024 bytes of a file to determine the encoding and not be fooled by a few bytes. As it is ReadStringFormat() is worst than useless as it causes problems.
So maybe not a bug but an enhancement?
Re: PB 6.30 b6 - ReadStringFormat() seemingly unreliable?
Posted: Mon Jan 12, 2026 11:05 am
by Axolotl
Hi there,
IMHO everything is described in the help section for the command ReadStringFormat(). Please pay particular attention to the remarks section.
Only the statement “...and tries to identify the String encoding used ...” leaves room for interpretation and expectations.
However, it is clear that the return value #PB_Ascii is equivalent to NO_BOM, even if there is no separate constant for it.
And of course, you can also submit a request for an improvement/extension in the wish list.
Re: PB 6.30 b6 - ReadStringFormat() seemingly unreliable?
Posted: Mon Jan 12, 2026 1:28 pm
by AZJIO
ColeopterusMaximus wrote: Mon Jan 12, 2026 10:30 am
The problem IMHO is that ReadStringFormat() is not reliable,
You don't understand the essence of this function. She reads the BOM tag and nothing else. This function does not try to find out which encoding the file is in. It reads the beginning of the file to check if it belongs to one of the possible BOM labels. These BOM labels are set by the file manufacturer himself to eliminate the need to guess the encoding. This label is specifically designed to tell file readers that the file is written in some format. If there is no such BOM label, then you either need to know what format the file is in, or use the functions to determine the encoding of the file. You wrote about 1024 bytes, but ReadStringFormat() reads no more than 4 bytes, since the BOM label is no more than 4 bytes long. The 1024 byte buffer can only be used by the encoding guessing function. The ReadStringFormat() function is not an encoding guessing function.
Look at the
BOM table.
viewtopic.php?p=636306#p636306
viewtopic.php?p=617348#p617348
Code: Select all
Procedure.s OpenFileToVar(FilePath$)
Protected length, oFile, bytes, *mem, Text$
oFile = ReadFile(#PB_Any, FilePath$)
If oFile
g_Format = ReadStringFormat(oFile)
length = Lof(oFile)
*mem = AllocateMemory(length)
If *mem
bytes = ReadData(oFile, *mem, length)
If bytes
If g_Format = #PB_Ascii
g_Format = dte::detectTextEncodingInBuffer(*mem, bytes, 0)
If g_Format = #PB_Ascii
Text$ = PeekS(*mem, bytes, #PB_Ascii)
Else
Text$ = PeekS(*mem, bytes, #PB_UTF8) ; если UTF8 без BOM
EndIf
Else
; а ReadStringFormat() can give #PB_UTF16BE, #PB_UTF32, #PB_UTF32BE
; although these formats are not popular, they probably won't meet, and we need to ignore them.
Text$ = PeekS(*mem, bytes, g_Format)
EndIf
EndIf
FreeMemory(*mem)
EndIf
CloseFile(oFile)
EndIf
ProcedureReturn Text$
EndProcedure
Re: PB 6.30 b6 - ReadStringFormat() seemingly unreliable?
Posted: Tue Jan 13, 2026 10:01 am
by ColeopterusMaximus
Axolotl wrote: Mon Jan 12, 2026 11:05 am
Hi there,
IMHO everything is described in the help section for the command ReadStringFormat(). Please pay particular attention to the remarks section.
You're right, I had a completely wrong idea of what ReadStringFormat() does, this comes in part because I wrote a library to manage files a very long time ago and I had it all abstracted in my head somehow what ReadStringFormat() did in a wrong way.
I started having problems because I have to deal with 3rd party files now and they are very inconsistent, a circumstance I didn't have to deal with in the past.
In my head I had assumed ReadStringFormat() was checking for the leading/intermediate bits to identify UTF vs ASCII and that is clearly not the case.
Axolotl wrote: Mon Jan 12, 2026 11:05 am
Only the statement “...and tries to identify the String encoding used ...” leaves room for interpretation and expectations.
However, it is clear that the return value #PB_Ascii is equivalent to NO_BOM, even if there is no separate constant for it.
And of course, you can also submit a request for an improvement/extension in the wish list.
Yeah, you're 100% correct I was very confused about the true purpose of ReadStringFormat()
I solved the problem quickly and wrote a function that finds the format of a file either on the first 1024 bytes or on its entirety, it is not a difficult problem.
Again thanks for your attention, very much appreciated.
Re: PB 6.30 b6 - ReadStringFormat() seemingly unreliable?
Posted: Tue Jan 13, 2026 6:03 pm
by highend
and wrote a function that finds the format of a file either on the first 1024 bytes or on its entirety
How about making it available for other users?
Re: PB 6.30 b6 - ReadStringFormat() seemingly unreliable?
Posted: Thu Jan 15, 2026 5:04 pm
by ColeopterusMaximus
highend wrote: Tue Jan 13, 2026 6:03 pm
How about making it available for other users?
Not sure this would be of use to anybody, but here it goes:
Code: Select all
; Encode check methods
#ENCODING_CHECK_SLOW = 0
#ENCODING_CHECK_FAST = 1
; Identifies if the encoding of a file is UTF8 or ASCII
; by means of scanning the entire file or just the first
; chunk of the file.
; Returns -1 on error, #PB_UTF8 or #PB_ASCII on success
Procedure.i _IdentifyEncoding(str_file_path.s, bol_quick.i = #ENCODING_CHECK_SLOW)
Protected *ptr_file_buffer
Protected int_buffer_size.i = 1024
Protected hnd_file.i = 0
Protected int_read.i = 0
Protected int_filesize.i = 0
Protected int_total_read.i = 0
Protected int_offset.i = -1
Protected int_state.i = 0
Protected int_result.i = -1
Protected asc_byte.a = 0
Protected int_leadingbytes.i = 0
Protected int_interby_count.i = 0
Enumeration
#ENCODINGID_START
#ENCODINGID_READ_NEXT_FCHUNK
#ENCODINGID_CHECK_NEXT_BYTE
#ENCODINGID_FINISH
#ENCODINGID_END
EndEnumeration
; By default we assume ASCII
int_result = #PB_Ascii
int_state = #ENCODINGID_START
While int_state <> #ENCODINGID_END
Select int_state
Case #ENCODINGID_START
hnd_file = ReadFile(#PB_Any, str_file_path)
If hnd_file
int_filesize = FileSize(str_file_path)
*ptr_file_buffer = AllocateMemory(int_buffer_size)
If Not *ptr_file_buffer
int_result = -1
int_state = #ENCODINGID_END
Continue
EndIf
Else
int_result = -1
int_state = #ENCODINGID_FINISH
Continue
EndIf
int_state = #ENCODINGID_READ_NEXT_FCHUNK
Case #ENCODINGID_READ_NEXT_FCHUNK
If int_total_read < int_filesize
int_read = ReadData(hnd_file, *ptr_file_buffer, int_buffer_size)
If int_read
int_total_read + int_read
int_state = #ENCODINGID_CHECK_NEXT_BYTE
EndIf
Else
int_state = #ENCODINGID_FINISH
Continue
EndIf
Case #ENCODINGID_CHECK_NEXT_BYTE
int_offset + 1
If int_offset >= int_read
; If we're decoding slow we do the entire file
; Otherwise we just read the first chunk and finish
If bol_quick = #ENCODING_CHECK_SLOW
int_state = #ENCODINGID_READ_NEXT_FCHUNK
Else
int_state = #ENCODINGID_FINISH
EndIf
int_offset = -1
Continue
EndIf
asc_byte = PeekB(*ptr_file_buffer + int_offset)
; ASCII
If asc_byte & %10000000 = 0
int_leadingbytes = 0 ; no leading
int_interby_count = 0 ; reset intermediate count
int_state = #ENCODINGID_CHECK_NEXT_BYTE
Continue
EndIf
; Intermediate byte
If asc_byte & %11000000 = %10000000
; If we found a leading byte earlier we count as many
; intermediate bytes as required.
If int_leadingbytes
; There was a leading byte
int_interby_count + 1
; if we found all intermediate bytes we have our format
If int_interby_count => int_leadingbytes
int_result = #PB_UTF8
int_state = #ENCODINGID_FINISH
Continue
EndIf
Else
; There was no leading
int_leadingbytes = 0 ; no leading
int_interby_count = 0 ; reset intermediate count
EndIf
int_state = #ENCODINGID_CHECK_NEXT_BYTE
Continue
EndIf
; 2-byte
If asc_byte & %11100000 = %11000000
int_leadingbytes = 1
int_state = #ENCODINGID_CHECK_NEXT_BYTE
Continue
EndIf
; 3-byte
If asc_byte & %11110000 = %11100000
int_leadingbytes = 2
int_state = #ENCODINGID_CHECK_NEXT_BYTE
Continue
EndIf
; 4-byte
If asc_byte & %11111000 = %11110000
int_leadingbytes = 3
int_state = #ENCODINGID_CHECK_NEXT_BYTE
Continue
EndIf
Case #ENCODINGID_FINISH
If hnd_file
CloseFile(hnd_file)
EndIf
If *ptr_file_buffer
FreeMemory(*ptr_file_buffer)
*ptr_file_buffer = 0
EndIf
int_state = #ENCODINGID_END
EndSelect
Wend
ProcedureReturn int_result
EndProcedure
Re: PB 6.30 b6 - ReadStringFormat() seemingly unreliable?
Posted: Sat Jan 17, 2026 2:54 pm
by Demivec
highend wrote: Tue Jan 13, 2026 6:03 pm
and wrote a function that finds the format of a file either on the first 1024 bytes or on its entirety
How about making it available for other users?
Also this link has some potentially useful code

:
Detecting Text File Encoding without BOM