Page 1 of 1
ReadString issue with UTF-16 files
Posted: Thu Apr 19, 2018 2:04 am
by Saladlam
ReadString() does not process correctly UTF-16 files. If you have a Unicode (UTF-16) text file like
*
Text 1
*
Text 2
*
Text 3
and want to count the occurences of '*' in the file with
Code: Select all
Define Zk.s, Za
ReadFile(1, "D:\Texte\Asteriskus.txt", #PB_Unicode)
While Not Eof(1)
Zk=ReadString(1)
If Zk="*"
Za+1
EndIf
Wend
Debug Za
CloseFile(1)
(D:\Texte\Asteriskus.txt is the text file from above) one would expect
Debug Za printing out '3'; instead of this '2' is displayed.
This bug does not occur after converting the file converted in ANSI oder UTF-8 file format and changing the flag in the ReadFile command to
#PB_Ascii and
#PB_UTF8, respectively. This behaviour fits to the fact that
IncludeFile does not support Unicode (Russian, Chinese) filenames.
Re: ReadString issue with UTF-16 files
Posted: Thu Apr 19, 2018 5:06 am
by Demivec
Does the file have a BOM?
Re: ReadString issue with UTF-16 files
Posted: Thu Apr 19, 2018 8:17 am
by Fred
Note: IncludeFile and this issue isn't related. Please put your file and a small working snippet somewhere so we can test.
Re: ReadString issue with UTF-16 files
Posted: Thu Apr 19, 2018 5:12 pm
by kenmo
Add
after your ReadFile() line.
Your text file has a 2-byte invisible BOM at the beginning, which is being read as part of the first "*" line, so it does not equal "*".
ReadStringFormat() will move the file pointer 2 bytes into the file, verify with
Re: ReadString issue with UTF-16 files
Posted: Wed Feb 05, 2020 11:54 am
by Fred
Can someone else confirm this bug ?
Re: ReadString issue with UTF-16 files
Posted: Wed Feb 05, 2020 12:23 pm
by GG
Saladlam wrote:ReadString() does not process correctly UTF-16 files. If you have a Unicode (UTF-16) text file like
*
Text 1
*
Text 2
*
Text 3
and want to count the occurences of '*' in the file with
Code: Select all
Define Zk.s, Za
ReadFile(1, "D:\Texte\Asteriskus.txt", #PB_Unicode)
While Not Eof(1)
Zk=ReadString(1)
If Zk="*"
Za+1
EndIf
Wend
Debug Za
CloseFile(1)
(D:\Texte\Asteriskus.txt is the text file from above) one would expect
Debug Za printing out '3'; instead of this '2' is displayed.
This bug does not occur after converting the file converted in ANSI oder UTF-8 file format and changing the flag in the ReadFile command to
#PB_Ascii and
#PB_UTF8, respectively. This behaviour fits to the fact that
IncludeFile does not support Unicode (Russian, Chinese) filenames.
Same behavior here with Windows 10, Pb 5.71X64.
Re: ReadString issue with UTF-16 files
Posted: Wed Feb 05, 2020 12:54 pm
by PeDe
PB 5.72 LTS Beta 1 32/64-Bit - Windows 7 64-Bit
Unicode 16 LE BOM = 2
Unicode 16 LE = 3
Unicode 16 BE BOM = 0
Unicode 16 BE = 0
Peter
Re: ReadString issue with UTF-16 files
Posted: Wed Feb 05, 2020 4:58 pm
by Little John
Saladlam wrote:If you have a Unicode (UTF-16) text file
This is no precise description of the file format.
In your code, you are using
#PB_Unicode. And in the documentation of ReadStringFormat(), it reads that
#PB_Unicode is for
UTF-16 (little endian) files,
not for
UTF-16 (big endian) files.
When I save the text that you provided as
UTF-16 (little endian) without BOM, then your code displays "3", as expected.
When I save the text that you provided as
UTF-16 (little endian) with BOM, then after inserting the line
directly after the ReadFile() line (as kenmo already suggested), the code also displays "3" (tested with PB 5.71 LTS on Windows).
The best practice is to
use ReadStringFormat() always: It is sometimes essential, and it never hurts.
Fred wrote:Can someone else confirm this bug ?
The OP did not provide sufficient information. With the information that we have so far, I can't see a bug.
Re: ReadString issue with UTF-16 files
Posted: Wed Feb 05, 2020 6:08 pm
by kenmo
99% sure it's no bug
If he said it was reading 0 asterisks, it might be a PB bug, or likely a UTF-16 Big Endian file (read as Little Endian)
Since he said it was reading 2 of 3 asterisks, I am very sure the file was UTF-16 LE with BOM, and the BOM being part of the first line caused it to not match "*".
Not a bug, just need to be aware of file's BOMs.
Re: ReadString issue with UTF-16 files
Posted: Wed Feb 05, 2020 9:23 pm
by Fred
Thanks for confirming !