Page 1 of 1
Get file content (UTF-8 and UTF-16)?
Posted: Fri Mar 25, 2022 9:22 am
by jacky
Hi,
this procedure works fine (and fast even for larger files) but only for UTF-8.
Is there a way to change it so that it can read both encodings (UTF-8 with or without BOM AND UTF-16 LE with BOM)?
Code: Select all
Procedure.s GetFileContent(file.s)
Protected.i size
size = FileSize(file)
If size = -1
ProcedureReturn ""
EndIf
Protected.i hFile
Protected.s content
hFile = ReadFile(#PB_Any, file, #PB_File_SharedRead)
If hFile
Protected.i length, bytes
Protected *buffer
If size >= 1
length = Lof(hFile)
*buffer = AllocateMemory(length)
If *buffer
bytes = ReadData(hFile, *buffer, length)
content = PeekS(*buffer, length, #PB_UTF8)
FreeMemory(*buffer)
EndIf
EndIf
CloseFile(hFile)
EndIf
ProcedureReturn content
EndProcedure
Re: Get file content (UTF-8 and UTF-16)?
Posted: Fri Mar 25, 2022 11:54 am
by acreis
Re: Get file content (UTF-8 and UTF-16)?
Posted: Fri Mar 25, 2022 12:13 pm
by infratec
Try this:
Code: Select all
Procedure.s GetFileContent(file.s)
Protected.i size
size = FileSize(file)
If size = -1
ProcedureReturn ""
EndIf
Protected.i hFile
Protected.s content
hFile = ReadFile(#PB_Any, file, #PB_File_SharedRead)
If hFile
Protected.i length, bytes, bom, offset
Protected *buffer
If size >= 1
length = Lof(hFile)
bom = ReadStringFormat(hFile)
offset = Loc(hFile)
length - offset
*buffer = AllocateMemory(length)
Debug MemorySize(*buffer)
If *buffer
If ReadData(hFile, *buffer, length) = length
If BOM = #PB_Unicode
content = PeekS(*buffer, length / 2, #PB_Unicode)
Else
content = PeekS(*buffer, length, #PB_UTF8|#PB_ByteLength)
EndIf
EndIf
FreeMemory(*buffer)
EndIf
EndIf
CloseFile(hFile)
EndIf
ProcedureReturn content
EndProcedure
Filename$ = OpenFileRequester("Choose a text file", "", "TXT|*.txt", 0)
If Filename$
Debug GetFileContent(Filename$)
EndIf
Btw. you had a bug: the flag #PB_ByteLength is needed if you read #PB_UTF8 in this way.
Re: Get file content (UTF-8 and UTF-16)?
Posted: Fri Mar 25, 2022 10:15 pm
by jacky
Thanks to both of you!
@infratec
Are you sure that
Code: Select all
If bom = #PB_Unicode
content = PeekS(*buffer + offset, (length - offset) / 2, #PB_Unicode)
Else
content = PeekS(*buffer + offset, (length - offset), #PB_UTF8|#PB_ByteLength)
EndIf
is correct?
For an UTF-16 LE BOM file with this content:
Code: Select all
English : hello world
Russian : Привет мир
Chinese : 你好世界
Greek : Γειά σου Κόσμε
Japanese: こんにちは世界
Korean : 안녕하세요 세계
German : hallo welt
It will cut off the first "E" in the first line...
and for the same file with UTF-8 BOM it will cut off "Eng"
I think it should be:
Code: Select all
If bom = #PB_Unicode
content = PeekS(*buffer, (length - offset) / 2, #PB_Unicode)
Else
content = PeekS(*buffer, (length - offset), #PB_UTF8|#PB_ByteLength)
EndIf
At least the output for both test files looks correct in these cases...
Re: Get file content (UTF-8 and UTF-16)?
Posted: Fri Mar 25, 2022 11:39 pm
by infratec
You were right, my code was buggy.
The german help text of ReadData() was missleading.
I read that the whole file is loaded, but the file is loaded from the current file position onwards.
I corrected my code above and tested it.
I also placed a 'bug report' in the documentation section.
But that's a reason why a 'working code' is needed.
I was to lazy to write a test code, so the bug came into the forum.
Re: Get file content (UTF-8 and UTF-16)?
Posted: Fri Mar 25, 2022 11:52 pm
by infratec
Btw. this is much easier and more elegant:
Code: Select all
Procedure.s GetFileContent(file.s)
Protected.i size
size = FileSize(file)
If size = -1
ProcedureReturn ""
EndIf
Protected.i hFile
Protected.s content
hFile = ReadFile(#PB_Any, file, #PB_File_SharedRead)
If hFile
bom = ReadStringFormat(hFile)
content = ReadString(hFile, bom|#PB_File_IgnoreEOL)
CloseFile(hFile)
EndIf
ProcedureReturn content
EndProcedure
Filename$ = OpenFileRequester("Choose a text file", "", "TXT|*.txt", 0)
If Filename$
Debug GetFileContent(Filename$)
EndIf
Re: Get file content (UTF-8 and UTF-16)?
Posted: Sat Mar 26, 2022 12:45 am
by jacky
Thanks for correcting it. It now works as expected
Btw. this is much easier and more elegant
Absolutely! But don't dare to try that one on larger files...
E.g. a 10 MB test file
In memory variant: 86ms
"easier and more elegant": 4secs, 463ms
Compiled via the beta 5 C backend (x64)
Re: Get file content (UTF-8 and UTF-16)?
Posted: Sat Mar 26, 2022 1:11 pm
by infratec
Ah...
then it looks like that the flag #PB_File_IgnoreEOL only tells the procedure that it should read
all lines one after an other and concat them, which is slow.
Re: Get file content (UTF-8 and UTF-16)?
Posted: Sat Mar 26, 2022 2:08 pm
by jacky
Yeah, it seems that's the way how it works.
If PB would internally use a fast concatenation method that procedure wouldn't fall back in execution time that much...
Re: Get file content (UTF-8 and UTF-16)?
Posted: Sun Apr 10, 2022 9:22 am
by Rinzwind
jacky wrote: Sat Mar 26, 2022 2:08 pm
Yeah, it seems that's the way how it works.
If PB would internally use a fast concatenation method that procedure wouldn't fall back in execution time that much...
Hence why I reported it as bug mon5s ago, but it was not considered as such by Fred. This is post number x about finding out by chance how slow the option #PB_File_IgnoreEOL is. Freds argument was: it does not use much memory for larger files. However, in that case it is just utterly too slow because of bad performance of string concatenation. And at the end you still have the whole file text in one variable, so seems like a nonsense argument to me. Fix the lib would be my advise,
Re: Get file content (UTF-8 and UTF-16)?
Posted: Sun Apr 10, 2022 10:44 am
by infratec
No, the argument is not nonsense.
Id you read first the file complete in a buffer and use then PeekS(),
you need the buffer memory AND the string memory.
So if you read a 500MB ASCII file, you need 500MB buffer + 1000MB by the string.
If you read only small chunks of 1kb you need 1kB + 1000MB string.
You save nearly 500MB, which can be essential.
The argument is valid.
Re: Get file content (UTF-8 and UTF-16)?
Posted: Sun Apr 10, 2022 11:26 am
by Joris
If anyone can post here a few small files with those different formats it wood be a good place to examine the existing problems and there solutions. Or is there a good link to such files ?
Thanks
Re: Get file content (UTF-8 and UTF-16)?
Posted: Sun Apr 10, 2022 1:24 pm
by infratec
Simply use a good text editor and save them with the different codings. (PSPad, Notepad2, ...)
This is faster then downloding files.
But the speed problem has nothing to do with the file format, only with the size.
Re: Get file content (UTF-8 and UTF-16)?
Posted: Sun Apr 10, 2022 5:11 pm
by Rinzwind
Well, the lack of performance makes it unusable either way when reading a huge file. So the buildin function should be improved. The simple string concatenation it supposedly uses is easy to improve upon. Im fine with less memory usage, but its implementation is severely lacking and makes some programs needlessly slow with the programmer being ignorant cause he thinks using a buildin function to read the whole file would be fastest.
Re: Get file content (UTF-8 and UTF-16)?
Posted: Mon Apr 11, 2022 9:15 am
by Fred
I will take a look to increase the speed of this special case.