Get file content (UTF-8 and UTF-16)?

jacky · Post by **jacky** » Fri Mar 25, 2022 9:22 am

Hi,

this procedure works fine (and fast even for larger files) but only for UTF-8.

Is there a way to change it so that it can read both encodings (UTF-8 with or without BOM AND UTF-16 LE with BOM)?

Code: Select all

    Procedure.s GetFileContent(file.s)
      Protected.i size

      size = FileSize(file)
      If size = -1
        ProcedureReturn ""
      EndIf

      Protected.i hFile
      Protected.s content

      hFile = ReadFile(#PB_Any, file, #PB_File_SharedRead)
      If hFile
        Protected.i length, bytes
        Protected *buffer

        If size >= 1
          length  = Lof(hFile)
          *buffer = AllocateMemory(length)

          If *buffer
            bytes   = ReadData(hFile, *buffer, length)
            content = PeekS(*buffer, length, #PB_UTF8)
            FreeMemory(*buffer)
          EndIf
        EndIf

        CloseFile(hFile)
      EndIf

      ProcedureReturn content
    EndProcedure

acreis · Post by **acreis** » Fri Mar 25, 2022 11:54 am

Please look at:

https://www.purebasic.fr/english/viewtopic.php?t=64385

infratec · Post by **infratec** » Fri Mar 25, 2022 12:13 pm

Try this:

Code: Select all

Procedure.s GetFileContent(file.s)
  Protected.i size
  
  size = FileSize(file)
  If size = -1
    ProcedureReturn ""
  EndIf
  
  Protected.i hFile
  Protected.s content
  
  hFile = ReadFile(#PB_Any, file, #PB_File_SharedRead)
  If hFile
    Protected.i length, bytes, bom, offset
    Protected *buffer
    
    
    If size >= 1
      length  = Lof(hFile)

      bom = ReadStringFormat(hFile)
      offset = Loc(hFile)
      
      length - offset
      
      *buffer = AllocateMemory(length)
      Debug MemorySize(*buffer)
      If *buffer
        If ReadData(hFile, *buffer, length) = length
          If BOM = #PB_Unicode
            content = PeekS(*buffer, length / 2, #PB_Unicode)
          Else
            content = PeekS(*buffer, length, #PB_UTF8|#PB_ByteLength)
          EndIf
        EndIf
        FreeMemory(*buffer)
      EndIf
    EndIf
    
    CloseFile(hFile)
  EndIf
  
  ProcedureReturn content
EndProcedure

Filename$ = OpenFileRequester("Choose a text file", "", "TXT|*.txt", 0)
If Filename$
  Debug GetFileContent(Filename$)
EndIf

Btw. you had a bug: the flag #PB_ByteLength is needed if you read #PB_UTF8 in this way.

jacky · Post by **jacky** » Fri Mar 25, 2022 10:15 pm

Thanks to both of you!

@infratec

Are you sure that

Code: Select all

            If bom = #PB_Unicode
              content = PeekS(*buffer + offset, (length - offset) / 2, #PB_Unicode)
            Else
              content = PeekS(*buffer + offset, (length - offset), #PB_UTF8|#PB_ByteLength)
            EndIf

is correct?

For an UTF-16 LE BOM file with this content:

Code: Select all

English : hello world
Russian : Привет мир
Chinese : 你好世界
Greek   : Γειά σου Κόσμε
Japanese: こんにちは世界
Korean  : 안녕하세요 세계
German  : hallo welt

It will cut off the first "E" in the first line...

and for the same file with UTF-8 BOM it will cut off "Eng"

I think it should be:

Code: Select all

            If bom = #PB_Unicode
              content = PeekS(*buffer, (length - offset) / 2, #PB_Unicode)
            Else
              content = PeekS(*buffer, (length - offset), #PB_UTF8|#PB_ByteLength)
            EndIf

At least the output for both test files looks correct in these cases...

infratec · Post by **infratec** » Fri Mar 25, 2022 11:39 pm

You were right, my code was buggy.

The german help text of ReadData() was missleading.
I read that the whole file is loaded, but the file is loaded from the current file position onwards.

I corrected my code above and tested it.

I also placed a 'bug report' in the documentation section.

But that's a reason why a 'working code' is needed.
I was to lazy to write a test code, so the bug came into the forum.

infratec · Post by **infratec** » Fri Mar 25, 2022 11:52 pm

Btw. this is much easier and more elegant:

Code: Select all

Procedure.s GetFileContent(file.s)
  Protected.i size
  
  size = FileSize(file)
  If size = -1
    ProcedureReturn ""
  EndIf
  
  Protected.i hFile
  Protected.s content
  
  hFile = ReadFile(#PB_Any, file, #PB_File_SharedRead)
  If hFile
    bom = ReadStringFormat(hFile)
    content = ReadString(hFile, bom|#PB_File_IgnoreEOL)
    CloseFile(hFile)
  EndIf
  
  ProcedureReturn content
EndProcedure

Filename$ = OpenFileRequester("Choose a text file", "", "TXT|*.txt", 0)
If Filename$
  Debug GetFileContent(Filename$)
EndIf

jacky · Post by **jacky** » Sat Mar 26, 2022 12:45 am

Thanks for correcting it. It now works as expected

Btw. this is much easier and more elegant

Absolutely! But don't dare to try that one on larger files...

E.g. a 10 MB test file

In memory variant: 86ms
"easier and more elegant": 4secs, 463ms

Compiled via the beta 5 C backend (x64)

infratec · Post by **infratec** » Sat Mar 26, 2022 1:11 pm

Ah...

then it looks like that the flag #PB_File_IgnoreEOL only tells the procedure that it should read
all lines one after an other and concat them, which is slow.

jacky · Post by **jacky** » Sat Mar 26, 2022 2:08 pm

Yeah, it seems that's the way how it works.

If PB would internally use a fast concatenation method that procedure wouldn't fall back in execution time that much...

Rinzwind · Post by **Rinzwind** » Sun Apr 10, 2022 9:22 am

jacky wrote: Sat Mar 26, 2022 2:08 pm Yeah, it seems that's the way how it works.

If PB would internally use a fast concatenation method that procedure wouldn't fall back in execution time that much...

Hence why I reported it as bug mon5s ago, but it was not considered as such by Fred. This is post number x about finding out by chance how slow the option #PB_File_IgnoreEOL is. Freds argument was: it does not use much memory for larger files. However, in that case it is just utterly too slow because of bad performance of string concatenation. And at the end you still have the whole file text in one variable, so seems like a nonsense argument to me. Fix the lib would be my advise,

infratec · Post by **infratec** » Sun Apr 10, 2022 10:44 am

No, the argument is not nonsense.

Id you read first the file complete in a buffer and use then PeekS(),
you need the buffer memory AND the string memory.

So if you read a 500MB ASCII file, you need 500MB buffer + 1000MB by the string.
If you read only small chunks of 1kb you need 1kB + 1000MB string.
You save nearly 500MB, which can be essential.

The argument is valid.

Joris · Post by **Joris** » Sun Apr 10, 2022 11:26 am

If anyone can post here a few small files with those different formats it wood be a good place to examine the existing problems and there solutions. Or is there a good link to such files ?

Thanks

infratec · Post by **infratec** » Sun Apr 10, 2022 1:24 pm

Simply use a good text editor and save them with the different codings. (PSPad, Notepad2, ...)
This is faster then downloding files.

But the speed problem has nothing to do with the file format, only with the size.

Rinzwind · Post by **Rinzwind** » Sun Apr 10, 2022 5:11 pm

Well, the lack of performance makes it unusable either way when reading a huge file. So the buildin function should be improved. The simple string concatenation it supposedly uses is easy to improve upon. Im fine with less memory usage, but its implementation is severely lacking and makes some programs needlessly slow with the programmer being ignorant cause he thinks using a buildin function to read the whole file would be fastest.

Post by **Fred** » Mon Apr 11, 2022 9:15 am

I will take a look to increase the speed of this special case.

PureBasic Forums - English

Get file content (UTF-8 and UTF-16)?

Get file content (UTF-8 and UTF-16)?

Re: Get file content (UTF-8 and UTF-16)?

Re: Get file content (UTF-8 and UTF-16)?

Re: Get file content (UTF-8 and UTF-16)?

Re: Get file content (UTF-8 and UTF-16)?

Re: Get file content (UTF-8 and UTF-16)?

Re: Get file content (UTF-8 and UTF-16)?

Re: Get file content (UTF-8 and UTF-16)?

Re: Get file content (UTF-8 and UTF-16)?

Re: Get file content (UTF-8 and UTF-16)?

Re: Get file content (UTF-8 and UTF-16)?

Re: Get file content (UTF-8 and UTF-16)?

Re: Get file content (UTF-8 and UTF-16)?

Re: Get file content (UTF-8 and UTF-16)?

Re: Get file content (UTF-8 and UTF-16)?