ReadString issue with UTF-16 files

Just starting out? Need help? Post your questions and find answers here.
Saladlam
New User
New User
Posts: 2
Joined: Wed Mar 21, 2018 12:04 pm

ReadString issue with UTF-16 files

Post by Saladlam »

ReadString() does not process correctly UTF-16 files. If you have a Unicode (UTF-16) text file like
*
Text 1
*
Text 2
*
Text 3
and want to count the occurences of '*' in the file with

Code: Select all

Define Zk.s, Za
ReadFile(1, "D:\Texte\Asteriskus.txt", #PB_Unicode)
While Not Eof(1)
  Zk=ReadString(1)
  If Zk="*"
    Za+1
  EndIf
Wend
Debug Za
CloseFile(1)
(D:\Texte\Asteriskus.txt is the text file from above) one would expect Debug Za printing out '3'; instead of this '2' is displayed.

This bug does not occur after converting the file converted in ANSI oder UTF-8 file format and changing the flag in the ReadFile command to #PB_Ascii and #PB_UTF8, respectively. This behaviour fits to the fact that IncludeFile does not support Unicode (Russian, Chinese) filenames.
User avatar
Demivec
Addict
Addict
Posts: 4091
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: ReadString issue with UTF-16 files

Post by Demivec »

Does the file have a BOM?
Fred
Administrator
Administrator
Posts: 16687
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Re: ReadString issue with UTF-16 files

Post by Fred »

Note: IncludeFile and this issue isn't related. Please put your file and a small working snippet somewhere so we can test.
User avatar
kenmo
Addict
Addict
Posts: 1967
Joined: Tue Dec 23, 2003 3:54 am

Re: ReadString issue with UTF-16 files

Post by kenmo »

Add

Code: Select all

ReadStringFormat(1)
after your ReadFile() line.

Your text file has a 2-byte invisible BOM at the beginning, which is being read as part of the first "*" line, so it does not equal "*".

ReadStringFormat() will move the file pointer 2 bytes into the file, verify with

Code: Select all

Debug Loc(1)
Fred
Administrator
Administrator
Posts: 16687
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Re: ReadString issue with UTF-16 files

Post by Fred »

Can someone else confirm this bug ?
User avatar
GG
Enthusiast
Enthusiast
Posts: 258
Joined: Tue Jul 26, 2005 12:02 pm
Location: Lieusaint (77), France

Re: ReadString issue with UTF-16 files

Post by GG »

Saladlam wrote:ReadString() does not process correctly UTF-16 files. If you have a Unicode (UTF-16) text file like
*
Text 1
*
Text 2
*
Text 3
and want to count the occurences of '*' in the file with

Code: Select all

Define Zk.s, Za
ReadFile(1, "D:\Texte\Asteriskus.txt", #PB_Unicode)
While Not Eof(1)
  Zk=ReadString(1)
  If Zk="*"
    Za+1
  EndIf
Wend
Debug Za
CloseFile(1)
(D:\Texte\Asteriskus.txt is the text file from above) one would expect Debug Za printing out '3'; instead of this '2' is displayed.

This bug does not occur after converting the file converted in ANSI oder UTF-8 file format and changing the flag in the ReadFile command to #PB_Ascii and #PB_UTF8, respectively. This behaviour fits to the fact that IncludeFile does not support Unicode (Russian, Chinese) filenames.
Same behavior here with Windows 10, Pb 5.71X64.
Purebasic 6.04 64 bits - Windows 11 Pro 64 bits 23H2
PeDe
Enthusiast
Enthusiast
Posts: 123
Joined: Sun Nov 26, 2017 3:13 pm
Location: Vienna
Contact:

Re: ReadString issue with UTF-16 files

Post by PeDe »

PB 5.72 LTS Beta 1 32/64-Bit - Windows 7 64-Bit

Unicode 16 LE BOM = 2
Unicode 16 LE = 3
Unicode 16 BE BOM = 0
Unicode 16 BE = 0

Peter
Little John
Addict
Addict
Posts: 4527
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: ReadString issue with UTF-16 files

Post by Little John »

Saladlam wrote:If you have a Unicode (UTF-16) text file
This is no precise description of the file format.

In your code, you are using #PB_Unicode. And in the documentation of ReadStringFormat(), it reads that #PB_Unicode is for UTF-16 (little endian) files, not for UTF-16 (big endian) files.

When I save the text that you provided as UTF-16 (little endian) without BOM, then your code displays "3", as expected.
When I save the text that you provided as UTF-16 (little endian) with BOM, then after inserting the line

Code: Select all

ReadStringFormat(1)
directly after the ReadFile() line (as kenmo already suggested), the code also displays "3" (tested with PB 5.71 LTS on Windows).

The best practice is to use ReadStringFormat() always: It is sometimes essential, and it never hurts.
Fred wrote:Can someone else confirm this bug ?
The OP did not provide sufficient information. With the information that we have so far, I can't see a bug.
User avatar
kenmo
Addict
Addict
Posts: 1967
Joined: Tue Dec 23, 2003 3:54 am

Re: ReadString issue with UTF-16 files

Post by kenmo »

99% sure it's no bug :)

If he said it was reading 0 asterisks, it might be a PB bug, or likely a UTF-16 Big Endian file (read as Little Endian)

Since he said it was reading 2 of 3 asterisks, I am very sure the file was UTF-16 LE with BOM, and the BOM being part of the first line caused it to not match "*".
Not a bug, just need to be aware of file's BOMs.
Fred
Administrator
Administrator
Posts: 16687
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Re: ReadString issue with UTF-16 files

Post by Fred »

Thanks for confirming !
Post Reply