Page 1 of 1

How to find out if a UTF-8 file is with BOM or not.

Posted: Sat Mar 11, 2023 6:17 pm
by Janni
Hi,
I want to check if a file is with BOM or not.

If BOM the first three hex chars will be EF - BB - BF

How to check that in PB? I found a way to find the first hex, but what about the second and third ?

Code: Select all

file.i = OpenFile(#PB_Any, "DemoWithBOM.txt")
signature.b = ReadByte(file)
CloseFile(file)
Debug Hex(signature, #PB_Byte) ; will display EF
Thanks in advance

Re: How to find out if a UTF-8 file is with BOM or not.

Posted: Sat Mar 11, 2023 6:27 pm
by HeX0R

Re: How to find out if a UTF-8 file is with BOM or not.

Posted: Sat Mar 11, 2023 6:57 pm
by Janni
ohh, there is a built in function :D I totally missed that....

hmm to interpret the return values ? I got values like 2 (BOM) and 24 (not bom)

Re: How to find out if a UTF-8 file is with BOM or not.

Posted: Sat Mar 11, 2023 7:51 pm
by Demivec
Janni wrote: Sat Mar 11, 2023 6:57 pm ohh, there is a built in function :D I totally missed that....

hmm to interpret the return values ? I got values like 2 (BOM) and 24 (not bom)
Compare the return value to the value of the constants listed in the link to the documentation above under the heading 'Return Value'. :wink:

Re: How to find out if a UTF-8 file is with BOM or not.

Posted: Fri Jan 19, 2024 2:46 pm
by dige
Hi folks,

I have just noticed that UTF-8 files without BOM are recognised as ASCII.

Code: Select all

FileID = ReadFile(#PB_Any, file, #PB_File_SharedRead )
If FileID
   FF = ReadStringFormat(FileID)
   Debug FF
   CloseFile(FileID)
EndIf
Notepadd++ or TotalCommander, for example, recognises the file type correctly. This means that there must be a way to recognise UTF-8 even without a BOM.

But how? Or is this a bug?

To test this, simply create a file with Notepad++ and save it as ASCII, UTF-8 and UTF-8 BOM.
ReadStringFormat() will recognise the Ascii and the UTF-8 file as #PB_Ascii and only UTF-8 BOM as UTF8-BOM.

Or just take the files from here:
https://u.pcloud.link/publink/show?code ... VItYmmPMry

Kind regards

Dige

Re: How to find out if a UTF-8 file is with BOM or not.

Posted: Fri Jan 19, 2024 5:43 pm
by Little John
This means that there must be a way to recognise UTF-8 even without a BOM.
AFAIR Demivec posted a procedure here that does do so.

Re: How to find out if a UTF-8 file is with BOM or not.

Posted: Sat Jan 20, 2024 12:02 am
by Demivec
Little John wrote: Fri Jan 19, 2024 5:43 pm
This means that there must be a way to recognise UTF-8 even without a BOM.
AFAIR Demivec posted a procedure here that does do so.
Here's the link: https://www.purebasic.fr/english/viewtopic.php?p=479049&hilit=Text+file#p479049