Get file content (UTF-8 and UTF-16)?

Just starting out? Need help? Post your questions and find answers here.
jacky
User
User
Posts: 66
Joined: Mon Jan 21, 2019 1:41 pm

Get file content (UTF-8 and UTF-16)?

Post by jacky »

Hi,

this procedure works fine (and fast even for larger files) but only for UTF-8.

Is there a way to change it so that it can read both encodings (UTF-8 with or without BOM AND UTF-16 LE with BOM)?

Code: Select all

    Procedure.s GetFileContent(file.s)
      Protected.i size

      size = FileSize(file)
      If size = -1
        ProcedureReturn ""
      EndIf

      Protected.i hFile
      Protected.s content

      hFile = ReadFile(#PB_Any, file, #PB_File_SharedRead)
      If hFile
        Protected.i length, bytes
        Protected *buffer

        If size >= 1
          length  = Lof(hFile)
          *buffer = AllocateMemory(length)

          If *buffer
            bytes   = ReadData(hFile, *buffer, length)
            content = PeekS(*buffer, length, #PB_UTF8)
            FreeMemory(*buffer)
          EndIf
        EndIf

        CloseFile(hFile)
      EndIf

      ProcedureReturn content
    EndProcedure
acreis
Enthusiast
Enthusiast
Posts: 204
Joined: Fri Jun 01, 2012 12:20 am

Re: Get file content (UTF-8 and UTF-16)?

Post by acreis »

infratec
Always Here
Always Here
Posts: 7619
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: Get file content (UTF-8 and UTF-16)?

Post by infratec »

Try this:

Code: Select all

Procedure.s GetFileContent(file.s)
  Protected.i size
  
  size = FileSize(file)
  If size = -1
    ProcedureReturn ""
  EndIf
  
  Protected.i hFile
  Protected.s content
  
  hFile = ReadFile(#PB_Any, file, #PB_File_SharedRead)
  If hFile
    Protected.i length, bytes, bom, offset
    Protected *buffer
    
    
    If size >= 1
      length  = Lof(hFile)

      bom = ReadStringFormat(hFile)
      offset = Loc(hFile)
      
      length - offset
      
      *buffer = AllocateMemory(length)
      Debug MemorySize(*buffer)
      If *buffer
        If ReadData(hFile, *buffer, length) = length
          If BOM = #PB_Unicode
            content = PeekS(*buffer, length / 2, #PB_Unicode)
          Else
            content = PeekS(*buffer, length, #PB_UTF8|#PB_ByteLength)
          EndIf
        EndIf
        FreeMemory(*buffer)
      EndIf
    EndIf
    
    CloseFile(hFile)
  EndIf
  
  ProcedureReturn content
EndProcedure

Filename$ = OpenFileRequester("Choose a text file", "", "TXT|*.txt", 0)
If Filename$
  Debug GetFileContent(Filename$)
EndIf
Btw. you had a bug: the flag #PB_ByteLength is needed if you read #PB_UTF8 in this way.
Last edited by infratec on Fri Mar 25, 2022 11:38 pm, edited 1 time in total.
jacky
User
User
Posts: 66
Joined: Mon Jan 21, 2019 1:41 pm

Re: Get file content (UTF-8 and UTF-16)?

Post by jacky »

Thanks to both of you!

@infratec

Are you sure that

Code: Select all

            If bom = #PB_Unicode
              content = PeekS(*buffer + offset, (length - offset) / 2, #PB_Unicode)
            Else
              content = PeekS(*buffer + offset, (length - offset), #PB_UTF8|#PB_ByteLength)
            EndIf
is correct?

For an UTF-16 LE BOM file with this content:

Code: Select all

English : hello world
Russian : Привет мир
Chinese : 你好世界
Greek   : Γειά σου Κόσμε
Japanese: こんにちは世界
Korean  : 안녕하세요 세계
German  : hallo welt
It will cut off the first "E" in the first line...

and for the same file with UTF-8 BOM it will cut off "Eng"

I think it should be:

Code: Select all

            If bom = #PB_Unicode
              content = PeekS(*buffer, (length - offset) / 2, #PB_Unicode)
            Else
              content = PeekS(*buffer, (length - offset), #PB_UTF8|#PB_ByteLength)
            EndIf
At least the output for both test files looks correct in these cases...
infratec
Always Here
Always Here
Posts: 7619
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: Get file content (UTF-8 and UTF-16)?

Post by infratec »

You were right, my code was buggy.

The german help text of ReadData() was missleading.
I read that the whole file is loaded, but the file is loaded from the current file position onwards.

I corrected my code above and tested it.

I also placed a 'bug report' in the documentation section.

But that's a reason why a 'working code' is needed.
I was to lazy to write a test code, so the bug came into the forum.
infratec
Always Here
Always Here
Posts: 7619
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: Get file content (UTF-8 and UTF-16)?

Post by infratec »

Btw. this is much easier and more elegant:

Code: Select all

Procedure.s GetFileContent(file.s)
  Protected.i size
  
  size = FileSize(file)
  If size = -1
    ProcedureReturn ""
  EndIf
  
  Protected.i hFile
  Protected.s content
  
  hFile = ReadFile(#PB_Any, file, #PB_File_SharedRead)
  If hFile
    bom = ReadStringFormat(hFile)
    content = ReadString(hFile, bom|#PB_File_IgnoreEOL)
    CloseFile(hFile)
  EndIf
  
  ProcedureReturn content
EndProcedure

Filename$ = OpenFileRequester("Choose a text file", "", "TXT|*.txt", 0)
If Filename$
  Debug GetFileContent(Filename$)
EndIf
jacky
User
User
Posts: 66
Joined: Mon Jan 21, 2019 1:41 pm

Re: Get file content (UTF-8 and UTF-16)?

Post by jacky »

Thanks for correcting it. It now works as expected :D
Btw. this is much easier and more elegant
Absolutely! But don't dare to try that one on larger files...

E.g. a 10 MB test file

In memory variant: 86ms
"easier and more elegant": 4secs, 463ms

Compiled via the beta 5 C backend (x64)
infratec
Always Here
Always Here
Posts: 7619
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: Get file content (UTF-8 and UTF-16)?

Post by infratec »

Ah...

then it looks like that the flag #PB_File_IgnoreEOL only tells the procedure that it should read
all lines one after an other and concat them, which is slow.
jacky
User
User
Posts: 66
Joined: Mon Jan 21, 2019 1:41 pm

Re: Get file content (UTF-8 and UTF-16)?

Post by jacky »

Yeah, it seems that's the way how it works.

If PB would internally use a fast concatenation method that procedure wouldn't fall back in execution time that much...
Rinzwind
Enthusiast
Enthusiast
Posts: 690
Joined: Wed Mar 11, 2009 4:06 pm
Location: NL

Re: Get file content (UTF-8 and UTF-16)?

Post by Rinzwind »

jacky wrote: Sat Mar 26, 2022 2:08 pm Yeah, it seems that's the way how it works.

If PB would internally use a fast concatenation method that procedure wouldn't fall back in execution time that much...
Hence why I reported it as bug mon5s ago, but it was not considered as such by Fred. This is post number x about finding out by chance how slow the option #PB_File_IgnoreEOL is. Freds argument was: it does not use much memory for larger files. However, in that case it is just utterly too slow because of bad performance of string concatenation. And at the end you still have the whole file text in one variable, so seems like a nonsense argument to me. Fix the lib would be my advise,
infratec
Always Here
Always Here
Posts: 7619
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: Get file content (UTF-8 and UTF-16)?

Post by infratec »

No, the argument is not nonsense.

Id you read first the file complete in a buffer and use then PeekS(),
you need the buffer memory AND the string memory.

So if you read a 500MB ASCII file, you need 500MB buffer + 1000MB by the string.
If you read only small chunks of 1kb you need 1kB + 1000MB string.
You save nearly 500MB, which can be essential.

The argument is valid.
Joris
Addict
Addict
Posts: 890
Joined: Fri Oct 16, 2009 10:12 am
Location: BE

Re: Get file content (UTF-8 and UTF-16)?

Post by Joris »

If anyone can post here a few small files with those different formats it wood be a good place to examine the existing problems and there solutions. Or is there a good link to such files ?

Thanks
Yeah I know, but keep in mind ... Leonardo da Vinci was also an autodidact.
infratec
Always Here
Always Here
Posts: 7619
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: Get file content (UTF-8 and UTF-16)?

Post by infratec »

:?: :?: :?:

Simply use a good text editor and save them with the different codings. (PSPad, Notepad2, ...)
This is faster then downloding files.

But the speed problem has nothing to do with the file format, only with the size.
Rinzwind
Enthusiast
Enthusiast
Posts: 690
Joined: Wed Mar 11, 2009 4:06 pm
Location: NL

Re: Get file content (UTF-8 and UTF-16)?

Post by Rinzwind »

Well, the lack of performance makes it unusable either way when reading a huge file. So the buildin function should be improved. The simple string concatenation it supposedly uses is easy to improve upon. Im fine with less memory usage, but its implementation is severely lacking and makes some programs needlessly slow with the programmer being ignorant cause he thinks using a buildin function to read the whole file would be fastest.
Fred
Administrator
Administrator
Posts: 18237
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Re: Get file content (UTF-8 and UTF-16)?

Post by Fred »

I will take a look to increase the speed of this special case.
Post Reply