Page 1 of 1

Get file content (UTF-8 and UTF-16)?

Posted: Fri Mar 25, 2022 9:22 am
by jacky
Hi,

this procedure works fine (and fast even for larger files) but only for UTF-8.

Is there a way to change it so that it can read both encodings (UTF-8 with or without BOM AND UTF-16 LE with BOM)?

Code: Select all

    Procedure.s GetFileContent(file.s)
      Protected.i size

      size = FileSize(file)
      If size = -1
        ProcedureReturn ""
      EndIf

      Protected.i hFile
      Protected.s content

      hFile = ReadFile(#PB_Any, file, #PB_File_SharedRead)
      If hFile
        Protected.i length, bytes
        Protected *buffer

        If size >= 1
          length  = Lof(hFile)
          *buffer = AllocateMemory(length)

          If *buffer
            bytes   = ReadData(hFile, *buffer, length)
            content = PeekS(*buffer, length, #PB_UTF8)
            FreeMemory(*buffer)
          EndIf
        EndIf

        CloseFile(hFile)
      EndIf

      ProcedureReturn content
    EndProcedure

Re: Get file content (UTF-8 and UTF-16)?

Posted: Fri Mar 25, 2022 11:54 am
by acreis

Re: Get file content (UTF-8 and UTF-16)?

Posted: Fri Mar 25, 2022 12:13 pm
by infratec
Try this:

Code: Select all

Procedure.s GetFileContent(file.s)
  Protected.i size
  
  size = FileSize(file)
  If size = -1
    ProcedureReturn ""
  EndIf
  
  Protected.i hFile
  Protected.s content
  
  hFile = ReadFile(#PB_Any, file, #PB_File_SharedRead)
  If hFile
    Protected.i length, bytes, bom, offset
    Protected *buffer
    
    
    If size >= 1
      length  = Lof(hFile)

      bom = ReadStringFormat(hFile)
      offset = Loc(hFile)
      
      length - offset
      
      *buffer = AllocateMemory(length)
      Debug MemorySize(*buffer)
      If *buffer
        If ReadData(hFile, *buffer, length) = length
          If BOM = #PB_Unicode
            content = PeekS(*buffer, length / 2, #PB_Unicode)
          Else
            content = PeekS(*buffer, length, #PB_UTF8|#PB_ByteLength)
          EndIf
        EndIf
        FreeMemory(*buffer)
      EndIf
    EndIf
    
    CloseFile(hFile)
  EndIf
  
  ProcedureReturn content
EndProcedure

Filename$ = OpenFileRequester("Choose a text file", "", "TXT|*.txt", 0)
If Filename$
  Debug GetFileContent(Filename$)
EndIf
Btw. you had a bug: the flag #PB_ByteLength is needed if you read #PB_UTF8 in this way.

Re: Get file content (UTF-8 and UTF-16)?

Posted: Fri Mar 25, 2022 10:15 pm
by jacky
Thanks to both of you!

@infratec

Are you sure that

Code: Select all

            If bom = #PB_Unicode
              content = PeekS(*buffer + offset, (length - offset) / 2, #PB_Unicode)
            Else
              content = PeekS(*buffer + offset, (length - offset), #PB_UTF8|#PB_ByteLength)
            EndIf
is correct?

For an UTF-16 LE BOM file with this content:

Code: Select all

English : hello world
Russian : Привет мир
Chinese : 你好世界
Greek   : Γειά σου Κόσμε
Japanese: こんにちは世界
Korean  : 안녕하세요 세계
German  : hallo welt
It will cut off the first "E" in the first line...

and for the same file with UTF-8 BOM it will cut off "Eng"

I think it should be:

Code: Select all

            If bom = #PB_Unicode
              content = PeekS(*buffer, (length - offset) / 2, #PB_Unicode)
            Else
              content = PeekS(*buffer, (length - offset), #PB_UTF8|#PB_ByteLength)
            EndIf
At least the output for both test files looks correct in these cases...

Re: Get file content (UTF-8 and UTF-16)?

Posted: Fri Mar 25, 2022 11:39 pm
by infratec
You were right, my code was buggy.

The german help text of ReadData() was missleading.
I read that the whole file is loaded, but the file is loaded from the current file position onwards.

I corrected my code above and tested it.

I also placed a 'bug report' in the documentation section.

But that's a reason why a 'working code' is needed.
I was to lazy to write a test code, so the bug came into the forum.

Re: Get file content (UTF-8 and UTF-16)?

Posted: Fri Mar 25, 2022 11:52 pm
by infratec
Btw. this is much easier and more elegant:

Code: Select all

Procedure.s GetFileContent(file.s)
  Protected.i size
  
  size = FileSize(file)
  If size = -1
    ProcedureReturn ""
  EndIf
  
  Protected.i hFile
  Protected.s content
  
  hFile = ReadFile(#PB_Any, file, #PB_File_SharedRead)
  If hFile
    bom = ReadStringFormat(hFile)
    content = ReadString(hFile, bom|#PB_File_IgnoreEOL)
    CloseFile(hFile)
  EndIf
  
  ProcedureReturn content
EndProcedure

Filename$ = OpenFileRequester("Choose a text file", "", "TXT|*.txt", 0)
If Filename$
  Debug GetFileContent(Filename$)
EndIf

Re: Get file content (UTF-8 and UTF-16)?

Posted: Sat Mar 26, 2022 12:45 am
by jacky
Thanks for correcting it. It now works as expected :D
Btw. this is much easier and more elegant
Absolutely! But don't dare to try that one on larger files...

E.g. a 10 MB test file

In memory variant: 86ms
"easier and more elegant": 4secs, 463ms

Compiled via the beta 5 C backend (x64)

Re: Get file content (UTF-8 and UTF-16)?

Posted: Sat Mar 26, 2022 1:11 pm
by infratec
Ah...

then it looks like that the flag #PB_File_IgnoreEOL only tells the procedure that it should read
all lines one after an other and concat them, which is slow.

Re: Get file content (UTF-8 and UTF-16)?

Posted: Sat Mar 26, 2022 2:08 pm
by jacky
Yeah, it seems that's the way how it works.

If PB would internally use a fast concatenation method that procedure wouldn't fall back in execution time that much...

Re: Get file content (UTF-8 and UTF-16)?

Posted: Sun Apr 10, 2022 9:22 am
by Rinzwind
jacky wrote: Sat Mar 26, 2022 2:08 pm Yeah, it seems that's the way how it works.

If PB would internally use a fast concatenation method that procedure wouldn't fall back in execution time that much...
Hence why I reported it as bug mon5s ago, but it was not considered as such by Fred. This is post number x about finding out by chance how slow the option #PB_File_IgnoreEOL is. Freds argument was: it does not use much memory for larger files. However, in that case it is just utterly too slow because of bad performance of string concatenation. And at the end you still have the whole file text in one variable, so seems like a nonsense argument to me. Fix the lib would be my advise,

Re: Get file content (UTF-8 and UTF-16)?

Posted: Sun Apr 10, 2022 10:44 am
by infratec
No, the argument is not nonsense.

Id you read first the file complete in a buffer and use then PeekS(),
you need the buffer memory AND the string memory.

So if you read a 500MB ASCII file, you need 500MB buffer + 1000MB by the string.
If you read only small chunks of 1kb you need 1kB + 1000MB string.
You save nearly 500MB, which can be essential.

The argument is valid.

Re: Get file content (UTF-8 and UTF-16)?

Posted: Sun Apr 10, 2022 11:26 am
by Joris
If anyone can post here a few small files with those different formats it wood be a good place to examine the existing problems and there solutions. Or is there a good link to such files ?

Thanks

Re: Get file content (UTF-8 and UTF-16)?

Posted: Sun Apr 10, 2022 1:24 pm
by infratec
:?: :?: :?:

Simply use a good text editor and save them with the different codings. (PSPad, Notepad2, ...)
This is faster then downloding files.

But the speed problem has nothing to do with the file format, only with the size.

Re: Get file content (UTF-8 and UTF-16)?

Posted: Sun Apr 10, 2022 5:11 pm
by Rinzwind
Well, the lack of performance makes it unusable either way when reading a huge file. So the buildin function should be improved. The simple string concatenation it supposedly uses is easy to improve upon. Im fine with less memory usage, but its implementation is severely lacking and makes some programs needlessly slow with the programmer being ignorant cause he thinks using a buildin function to read the whole file would be fastest.

Re: Get file content (UTF-8 and UTF-16)?

Posted: Mon Apr 11, 2022 9:15 am
by Fred
I will take a look to increase the speed of this special case.