Page 1 of 1
					
				Get file content (UTF-8 and UTF-16)?
				Posted: Fri Mar 25, 2022 9:22 am
				by jacky
				Hi,
this procedure works fine (and fast even for larger files) but only for UTF-8.
Is there a way to change it so that it can read both encodings (UTF-8 with or without BOM AND UTF-16 LE with BOM)?
Code: Select all
    Procedure.s GetFileContent(file.s)
      Protected.i size
      size = FileSize(file)
      If size = -1
        ProcedureReturn ""
      EndIf
      Protected.i hFile
      Protected.s content
      hFile = ReadFile(#PB_Any, file, #PB_File_SharedRead)
      If hFile
        Protected.i length, bytes
        Protected *buffer
        If size >= 1
          length  = Lof(hFile)
          *buffer = AllocateMemory(length)
          If *buffer
            bytes   = ReadData(hFile, *buffer, length)
            content = PeekS(*buffer, length, #PB_UTF8)
            FreeMemory(*buffer)
          EndIf
        EndIf
        CloseFile(hFile)
      EndIf
      ProcedureReturn content
    EndProcedure
 
			 
			
					
				Re: Get file content (UTF-8 and UTF-16)?
				Posted: Fri Mar 25, 2022 11:54 am
				by acreis
				
			 
			
					
				Re: Get file content (UTF-8 and UTF-16)?
				Posted: Fri Mar 25, 2022 12:13 pm
				by infratec
				Try this:
Code: Select all
Procedure.s GetFileContent(file.s)
  Protected.i size
  
  size = FileSize(file)
  If size = -1
    ProcedureReturn ""
  EndIf
  
  Protected.i hFile
  Protected.s content
  
  hFile = ReadFile(#PB_Any, file, #PB_File_SharedRead)
  If hFile
    Protected.i length, bytes, bom, offset
    Protected *buffer
    
    
    If size >= 1
      length  = Lof(hFile)
      bom = ReadStringFormat(hFile)
      offset = Loc(hFile)
      
      length - offset
      
      *buffer = AllocateMemory(length)
      Debug MemorySize(*buffer)
      If *buffer
        If ReadData(hFile, *buffer, length) = length
          If BOM = #PB_Unicode
            content = PeekS(*buffer, length / 2, #PB_Unicode)
          Else
            content = PeekS(*buffer, length, #PB_UTF8|#PB_ByteLength)
          EndIf
        EndIf
        FreeMemory(*buffer)
      EndIf
    EndIf
    
    CloseFile(hFile)
  EndIf
  
  ProcedureReturn content
EndProcedure
Filename$ = OpenFileRequester("Choose a text file", "", "TXT|*.txt", 0)
If Filename$
  Debug GetFileContent(Filename$)
EndIf
Btw. you had a bug: the flag #PB_ByteLength is needed if you read #PB_UTF8 in this way.
 
			 
			
					
				Re: Get file content (UTF-8 and UTF-16)?
				Posted: Fri Mar 25, 2022 10:15 pm
				by jacky
				Thanks to both of you!
@infratec
Are you sure that
Code: Select all
            If bom = #PB_Unicode
              content = PeekS(*buffer + offset, (length - offset) / 2, #PB_Unicode)
            Else
              content = PeekS(*buffer + offset, (length - offset), #PB_UTF8|#PB_ByteLength)
            EndIf
is correct?
For an UTF-16 LE BOM file with this content:
Code: Select all
English : hello world
Russian : Привет мир
Chinese : 你好世界
Greek   : Γειά σου Κόσμε
Japanese: こんにちは世界
Korean  : 안녕하세요 세계
German  : hallo welt
It will cut off the first "E" in the first line...
and for the same file with UTF-8 BOM it will cut off "Eng"
I think it should be:
Code: Select all
            If bom = #PB_Unicode
              content = PeekS(*buffer, (length - offset) / 2, #PB_Unicode)
            Else
              content = PeekS(*buffer, (length - offset), #PB_UTF8|#PB_ByteLength)
            EndIf
At least the output for both test files looks correct in these cases...
 
			 
			
					
				Re: Get file content (UTF-8 and UTF-16)?
				Posted: Fri Mar 25, 2022 11:39 pm
				by infratec
				You were right, my code was buggy.
The german help text of ReadData() was missleading.
I read that the whole file is loaded, but the file is loaded from the current file position onwards.
I corrected my code above and tested it.
I also placed a 'bug report' in the documentation section.
But that's a reason why a 'working code' is needed.
I was to lazy to write a test code, so the bug came into the forum.
			 
			
					
				Re: Get file content (UTF-8 and UTF-16)?
				Posted: Fri Mar 25, 2022 11:52 pm
				by infratec
				Btw. this is much easier and more elegant:
Code: Select all
Procedure.s GetFileContent(file.s)
  Protected.i size
  
  size = FileSize(file)
  If size = -1
    ProcedureReturn ""
  EndIf
  
  Protected.i hFile
  Protected.s content
  
  hFile = ReadFile(#PB_Any, file, #PB_File_SharedRead)
  If hFile
    bom = ReadStringFormat(hFile)
    content = ReadString(hFile, bom|#PB_File_IgnoreEOL)
    CloseFile(hFile)
  EndIf
  
  ProcedureReturn content
EndProcedure
Filename$ = OpenFileRequester("Choose a text file", "", "TXT|*.txt", 0)
If Filename$
  Debug GetFileContent(Filename$)
EndIf
 
			 
			
					
				Re: Get file content (UTF-8 and UTF-16)?
				Posted: Sat Mar 26, 2022 12:45 am
				by jacky
				Thanks for correcting it. It now works as expected 
Btw. this is much easier and more elegant
Absolutely! But don't dare to try that one on larger files...
E.g. a 10 MB test file
In memory variant: 86ms
"easier and more elegant": 4secs,  463ms
Compiled via the beta 5 C backend (x64)
 
			 
			
					
				Re: Get file content (UTF-8 and UTF-16)?
				Posted: Sat Mar 26, 2022 1:11 pm
				by infratec
				Ah...
then it looks like that the flag #PB_File_IgnoreEOL only tells the procedure that it should read
all lines one after an other and concat them, which is slow.
			 
			
					
				Re: Get file content (UTF-8 and UTF-16)?
				Posted: Sat Mar 26, 2022 2:08 pm
				by jacky
				Yeah, it seems that's the way how it works.
If PB would internally use a fast concatenation method that procedure wouldn't fall back in execution time that much...
			 
			
					
				Re: Get file content (UTF-8 and UTF-16)?
				Posted: Sun Apr 10, 2022 9:22 am
				by Rinzwind
				jacky wrote: Sat Mar 26, 2022 2:08 pm
Yeah, it seems that's the way how it works.
If PB would internally use a fast concatenation method that procedure wouldn't fall back in execution time that much...
 
Hence why I reported it as bug mon5s ago, but it was not considered as such by Fred. This is post number x about finding out by chance how slow the option #PB_File_IgnoreEOL is. Freds argument was: it does not use much memory for larger files. However, in that case it is just utterly too slow because of bad performance of string concatenation. And at the end you still have the whole file text in one variable, so seems like a nonsense argument to me. Fix the lib would be my advise,
 
			 
			
					
				Re: Get file content (UTF-8 and UTF-16)?
				Posted: Sun Apr 10, 2022 10:44 am
				by infratec
				No, the argument is not nonsense.
Id you read first the file complete in a buffer and use then PeekS(),
you need the buffer memory AND the string memory.
So if you read a 500MB ASCII file, you need 500MB buffer + 1000MB by the string.
If you read  only small chunks of 1kb you need 1kB + 1000MB string.
You save nearly 500MB, which can be essential.
The argument is valid.
			 
			
					
				Re: Get file content (UTF-8 and UTF-16)?
				Posted: Sun Apr 10, 2022 11:26 am
				by Joris
				If anyone can post here a few small files with those different formats it wood be a good place to examine the existing problems and there solutions. Or is there a good link to such files ? 
Thanks
			 
			
					
				Re: Get file content (UTF-8 and UTF-16)?
				Posted: Sun Apr 10, 2022 1:24 pm
				by infratec
				
  
  
 
Simply use a good text editor and save them with the different codings. (PSPad, Notepad2, ...)
This is faster then downloding files.
But the speed problem has nothing to do with the file format, only with the size.
 
			 
			
					
				Re: Get file content (UTF-8 and UTF-16)?
				Posted: Sun Apr 10, 2022 5:11 pm
				by Rinzwind
				Well, the lack of performance makes it unusable either way when reading a huge file. So the buildin function should be improved. The simple string concatenation it supposedly uses is easy to improve upon. Im fine with less memory usage, but its implementation is severely lacking and makes some programs needlessly slow with the programmer being ignorant cause he thinks using a buildin function to read the whole file would be fastest.
			 
			
					
				Re: Get file content (UTF-8 and UTF-16)?
				Posted: Mon Apr 11, 2022 9:15 am
				by Fred
				I will take a look to increase the speed of this special case.