How to find out if a UTF-8 file is with BOM or not.

Just starting out? Need help? Post your questions and find answers here.
User avatar
Janni
Enthusiast
Enthusiast
Posts: 127
Joined: Mon Feb 21, 2022 5:58 pm
Location: Norway

How to find out if a UTF-8 file is with BOM or not.

Post by Janni »

Hi,
I want to check if a file is with BOM or not.

If BOM the first three hex chars will be EF - BB - BF

How to check that in PB? I found a way to find the first hex, but what about the second and third ?

Code: Select all

file.i = OpenFile(#PB_Any, "DemoWithBOM.txt")
signature.b = ReadByte(file)
CloseFile(file)
Debug Hex(signature, #PB_Byte) ; will display EF
Thanks in advance
Spec: Linux Mint 20.3 Cinnamon, i7-3770K, 16GB RAM, RTX 2070 Super
User avatar
Janni
Enthusiast
Enthusiast
Posts: 127
Joined: Mon Feb 21, 2022 5:58 pm
Location: Norway

Re: How to find out if a UTF-8 file is with BOM or not.

Post by Janni »

ohh, there is a built in function :D I totally missed that....

hmm to interpret the return values ? I got values like 2 (BOM) and 24 (not bom)
Spec: Linux Mint 20.3 Cinnamon, i7-3770K, 16GB RAM, RTX 2070 Super
User avatar
Demivec
Addict
Addict
Posts: 4281
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: How to find out if a UTF-8 file is with BOM or not.

Post by Demivec »

Janni wrote: Sat Mar 11, 2023 6:57 pm ohh, there is a built in function :D I totally missed that....

hmm to interpret the return values ? I got values like 2 (BOM) and 24 (not bom)
Compare the return value to the value of the constants listed in the link to the documentation above under the heading 'Return Value'. :wink:
dige
Addict
Addict
Posts: 1417
Joined: Wed Apr 30, 2003 8:15 am
Location: Germany
Contact:

Re: How to find out if a UTF-8 file is with BOM or not.

Post by dige »

Hi folks,

I have just noticed that UTF-8 files without BOM are recognised as ASCII.

Code: Select all

FileID = ReadFile(#PB_Any, file, #PB_File_SharedRead )
If FileID
   FF = ReadStringFormat(FileID)
   Debug FF
   CloseFile(FileID)
EndIf
Notepadd++ or TotalCommander, for example, recognises the file type correctly. This means that there must be a way to recognise UTF-8 even without a BOM.

But how? Or is this a bug?

To test this, simply create a file with Notepad++ and save it as ASCII, UTF-8 and UTF-8 BOM.
ReadStringFormat() will recognise the Ascii and the UTF-8 file as #PB_Ascii and only UTF-8 BOM as UTF8-BOM.

Or just take the files from here:
https://u.pcloud.link/publink/show?code ... VItYmmPMry

Kind regards

Dige
"Daddy, I'll run faster, then it is not so far..."
Little John
Addict
Addict
Posts: 4805
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: How to find out if a UTF-8 file is with BOM or not.

Post by Little John »

This means that there must be a way to recognise UTF-8 even without a BOM.
AFAIR Demivec posted a procedure here that does do so.
User avatar
Demivec
Addict
Addict
Posts: 4281
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: How to find out if a UTF-8 file is with BOM or not.

Post by Demivec »

Little John wrote: Fri Jan 19, 2024 5:43 pm
This means that there must be a way to recognise UTF-8 even without a BOM.
AFAIR Demivec posted a procedure here that does do so.
Here's the link: https://www.purebasic.fr/english/viewtopic.php?p=479049&hilit=Text+file#p479049
Post Reply