Page 1 of 2

Reading huge files

Posted: Tue Jun 23, 2020 5:30 pm
by olmak
Hi. all! There is such a task - to read very large files (up to 10 GB) for further processing.
An array element is created for each line of the file.
After reading the forums I got such a fragment :

Code: Select all

Procedure ReadFileIntoArray(file$, Array StringArray.s(1), Separator.s = " ")
  Define String.s ; The line to which we copy the memory area containing the entire read file
  Protected S.String, *S.Integer = @S
  Protected.i countFileString, i, pos_separ, slen
  
  file_handler=ReadFile(#PB_Any,file$)
  If file_handler
    ReadStringFormat(file_handler) 
    lengthFile.q=(Lof(file_handler)-Loc(file_handler)) -2 
    If lengthFile>0
      pointMemForReadFile=AllocateMemory(lengthFile) 
      If pointMemForReadFile 
        numberBytesReadingFromFile = ReadData(file_handler,pointMemForReadFile,lengthFile)    
        String=PeekS(pointMemForReadFile,MemorySize(pointMemForReadFile),#PB_UTF8)   
        countFileString = CountString(String, Separator)                             
        slen = Len(Separator)                    
        ReDim StringArray(countFileString)       
        *S\i = @String                           
        While i < countFileString                
          pos_separ = FindString(S\s, Separator) separator "Separator"
          StringArray(i) = PeekS(*S\i, pos_separ - 1)    
          *S\i + (pos_separ + slen - 1) << #PB_Compiler_Unicode 
          i + 1                                                 
        Wend                                     
        StringArray(i) = S\s                     
        *S\i = 0                                 
        FreeMemory(pointMemForReadFile)            
        String=""                                
      Else 
        Debug "Memory Allocation error"
      EndIf
    EndIf
    CloseFile(file_handler)
  EndIf
EndProcedure

Dim LogString$(0); An array in which each line of the file will be read
ReadFileIntoArray("e:\0YP\Purebasic\LogAnalyzer\Log\test.log", LogString$() , Chr(10))
CountLogString=ArraySize(LogString$())  ; The number of lines in the read file
Debug LogString$(0) ; Print the first line of the file
Debug LogString$(CountLogString) ; Print the last line of the file
The problem is that if the file size grows somewhere around 242 or higher, the program stops with an error
[14:42:52] [ERROR] Invalid memory access. (write error at address 0)
The error is mainly in the line: String=PeekS(pointMemForReadFile,MemorySize(pointMemForReadFile),#PB_UTF8)
Sometimes in line : StringArray(i) = S\s
The amount of available memory was calculated during operation using MemoryStatus (#PB_System_FreePhysical) and it is about 10GB
I have no experience and good knowledge of working with memory, just the usual line-by-line reading of files is absolutely not suitable
by speed. Need constructive advice on how to best solve the problem.

Re: Reading huge files

Posted: Tue Jun 23, 2020 6:00 pm
by skywalk
pos_separ = FindString(S\s, Separator) ;separator "Separator"
You should post working code.
As you found, this approach is not feasible for very large files.
You could loop through "block sizes" of your choosing.
Search the forum, this has been done sooooo many times.

Re: Reading huge files

Posted: Tue Jun 23, 2020 6:31 pm
by olmak
Sorry for the typo, I removed comments and something left. As I understand it, you are advised to process data in batches. I also thought about it, but I hoped that maybe there is some more elegant solution. In my case, the size of the String variable into which I copy the entire file from memory is a critical factor. This is strange, the Purebasic manual says that the String type has no size restrictions. Anyway, thanks for the answer

Re: Reading huge files

Posted: Tue Jun 23, 2020 6:41 pm
by Kiffi
@olmak: Let me get this straight. First you load the large file (up to 10 GB) into memory, then you copy the memory contents into a string, and then from the string into a string array? How much memory does your computer have?

Re: Reading huge files

Posted: Tue Jun 23, 2020 7:20 pm
by olmak
Kiffi wrote:@olmak: Let me get this straight. First you load the large file (up to 10 GB) into memory, then you copy the memory contents into a string, and then from the string into a string array? How much memory does your computer have?
Yes, exactly. The computer has 16 GB of RAM. And now I'm talking about files of at least 5 GB. 10 GB - this is in the future

Re: Reading huge files

Posted: Tue Jun 23, 2020 7:21 pm
by Saki
That's a strange code.
It will also take far too much time !
I think the approach is primarily wrong.
With the strings, so, it's not gonna work.

Best Regards Saki

Re: Reading huge files

Posted: Tue Jun 23, 2020 8:15 pm
by Marc56us
Yes, string in PB has no limit, this is ASCIIZ like C (see Null-terminated string)

If file is textfile, Try normal ReadString then with #PB_File_IgnoreEOL

Code: Select all

If Not OpenFile(0, GetTemporaryDirectory() + "File_10_MB.txt") ; 10 MB
     Debug "File not found"
     End
 EndIf
 
Start = ElapsedMilliseconds()
Debug "Reading..."
While Not Eof(0)
     Txt$ = ReadString(0, #PB_Ascii | #PB_File_IgnoreEOL)
Wend
CloseFile(0)
Debug "Done."
Debug FormatNumber((ElapsedMilliseconds() - Start) / 1000, 2) + " secs"
Debug "Len string Txt$ : " + FormatNumber(Len(Txt$), 0)
On my i7 @3.2 GHz - Windows 10x64 - SSD

Code: Select all

Reading...
Done.
5.09 secs
Len string Txt$ : 10,577,903
:wink:

Re: Reading huge files

Posted: Tue Jun 23, 2020 8:21 pm
by Saki
FindString always searches for the end of the string first, again and again.

And 5 seconds for 10mb is very much, much too much.

Re: Reading huge files

Posted: Tue Jun 23, 2020 9:02 pm
by mk-soft
10GB Textfile -> 20GB Unicode RAM

Re: Reading huge files

Posted: Tue Jun 23, 2020 9:23 pm
by Saki
You must read the file binary, not as string !
Then, you must search your separators, also binary, not as string !
Put on demand your data sets in a list :wink:

But honestly, a 10gb text file seems a bit strange to me.

Re: Reading huge files

Posted: Wed Jun 24, 2020 7:42 am
by Marc56us
Saki wrote:But honestly, a 10gb text file seems a bit strange to me.
Log files and database dumps are often much larger than that. Server logging systems rotate files daily or by size, but the files are still large.

Big text files are the basic "food" for system administrators.
Most often this type of file is processed by flow and with specialized tools (Perl, Grep, AWK etc (yes, these tools also exist under Windows))
But sometimes it is necessary to edit it in its entirety, even if tools such as Grep make it possible to display any number of lines before and after the searched text.

We've been doing this for years, even on machines with less RAM than the file size (some text editors or system will swap line blocks in RAM).

In PB, one can use Scintilla which has much greater capabilities and is faster than the editor gadget.

:wink:

Re: Reading huge files

Posted: Wed Jun 24, 2020 9:12 am
by Saki
Hi,
OK, in principle it works, but the string handling of PB has to be considered.
Also the PB string handling does not automatically release strings, so a lot of Ram is quickly lost.
The way it is done now with the strings is absolutely not possible.

The EditorGadget should not be a problem, 20 to 30mb can be easily handled.

Scintilla, yes, but I don't think it's that fast.

There is a code in the forum which reads large CSV Data Base files quickly.
If you look at something you will find a lot and you don't have to do it yourself.

It seems to be a special module which was needed for Andre's GeoWorld V2.

Since Andre writes that it works fine, this should solve the import problem.

http://forums.purebasic.com/english/vie ... 12&t=70684

Re: Reading huge files

Posted: Wed Jun 24, 2020 9:44 am
by BarryG
Saki wrote:PB string handling does not automatically release strings, so a lot of Ram is quickly lost.
This was fixed with the 5.72 release -> viewtopic.php?p=518399#p518399

Re: Reading huge files

Posted: Wed Jun 24, 2020 10:11 am
by Saki
That seems to work better now.
But, it also seems that about 2GB are still being eaten.
That is not exactly little.

Just a little heads-up:
The maximum possible string length is about 1e9*2 Bytes :wink:
So you can not load strings larger about 1GB.
You can also no longer work with it, so from 50mb upwards is no more fun.

Re: Reading huge files

Posted: Wed Jun 24, 2020 1:22 pm
by NicTheQuick
What exactly do you want to achieve?