Reading huge files

HeX0R · Post by **HeX0R** » Wed Jun 24, 2020 2:44 pm

Are those text files from logging actions?
I had a similar tool to do, here is how I did it:
1.) Read a few lines of the file from the beginning, find the averaged line length (not needed, if all your lines are of same length, was not true in my case).
2.) If logging entries occure always at a fixed time, get the time between two entries also from the few lines you read (was true in my case).
3.) Set yourself a span, where you want to look at the data, e.g. some date and a time, then look 50 steps behind and 50 steps before.
4.) Use the averaged line length and jump to that "estimated" position in the file.
5.) Read up to the next line feed to be sure, you got a complete line.
6.) Check the date of that entry, and go back to 4 until you've reached the correct entry.

This is, when done correctly, extremely fast, needs almost no memory, and I could even show a live graph, while moving the position back or forth.

skywalk · Post by **skywalk** » Wed Jun 24, 2020 3:21 pm

Yes, many approaches better than reading an entire log file into memory.

Another approach is to continually add the log to a database.
Then, run SQL queries as needed.

Saki · Post by **Saki** » Wed Jun 24, 2020 4:07 pm

Hi
Of course it is better to load only a defined part of the file.
But that was not the question and it was expected that he would see the need to load the file completely.
He meant he want it for post-processing, this can be many things !
Also the structure of the file is still unknown, just because it is line oriented as usual.
If he wants to search the whole log for a certain entry as quickly as possible, there is no reason not to load the file completely.
So every existing line is surely captured and can be extracted easily in case of a hit.
Outputting the hits in the EditorGadget is simple.
The simplest solution, 15 - 30 minutes of programming with the information and code he has now !
32GB Ram and a fast processor should already be available for 10GB.
In his first post it was already clear that this is not possible.
The approach was absolutely wrong, it would never have worked.
One should also deal with the pitfalls and problems of implementation
be familiar.
But it is certainly a good exercise to deal with this problem.
I don't think it will be a project with professional requirements.

Best Regards Saki

NicTheQuick · Post by **NicTheQuick** » Wed Jun 24, 2020 4:53 pm

If you want to work with large binary data and need to access it like it is in memory, you can use Memory mapped files on Unix systems. On Windows you can use a similar approach. See the links and follow the links.

Saki · Post by **Saki** » Wed Jun 24, 2020 5:08 pm

Hi,
thanks.
But I think his code is absolutely easy to implement with PB features and it will run flawlessly.
He just has to know how to do it and know the pitfalls.
Also, a little bit of knowhow is required of course

NicTheQuick · Post by **NicTheQuick** » Wed Jun 24, 2020 5:19 pm

Of course it is doable in pure Purebasic. But I can not see the speedup when you first have to copy the whole file to memory and convert it into Unicode which costs twice the space and finally split it up into lines. Just do it in one go. While reading the file chunk-wise, split it into lines and do your stuff until you get the EOL.

But I need the whole specification, to be able to help pushing it in the right direction.

For example is it necessary to be an Array or can it also be a LinkedList?

Saki · Post by **Saki** » Wed Jun 24, 2020 5:40 pm

Hi, yes, of course it can also be completely binary.
It's simply based on the idea of giving him the easiest possible solution, according to his question.
The previous reading of the whole file is absolutely superfluous, you can just do all on demand.
But let's wait for him to get back.
Myself have no further interest in the matter, they are basics, I just wanted to give him some little info.

Best Regards Saki

olmak · Post by **olmak** » Wed Jun 24, 2020 7:45 pm

I understand that it is necessary to detail the conditions of the task. Often I have to get proxy log information from servers that I support. There are several hundred of them, and the sizes of daily logs vary from several hundred megabytes to several gigabytes and several billion lines. Each record contains client requests: login, ip, time, url, size of received-transmitted information, etc.
Information should be analyzed with the possibility of obtaining reports on the use of resources by users. At first I wanted to just read logs and form arrays of strings that can be processed as you like based on the results of reading the lines. But very quickly it became clear that the scheme should be like this: reading logs with copying entries to the database (in this case I plan mariadb) with the ability to SQL queries using DBMS. The task at the moment is only in the maximum fast reading of files. After talking on another forum So far I came to such a piece of code for reading files:

Code: Select all

Global NewList LogString.s()
Procedure ReadFileByStringIntoArray(file$);, Array StringArray.s(1))
  i=0
  file_handler=ReadFile(#PB_Any,file$)
    While Eof(file_handler) = 0           
      AddElement(LogString())
      LogString() = ReadString(file_handler)
      i=i+1
    Wend
    CloseFile(file_handler)  
EndProcedure
 
 DisableDebugger

 
fileName$="D:\Text1.txt"
 
fileSizeMb.q = (FileSize(Filename$)/1024) /1024
StartTime=Date()
ReadFileByStringIntoArray(fileName$);, LogString$() )
TotalTime=Date()-StartTime
SpeedProc_Mb_s=fileSizeMb / TotalTime
CountLogString=ListSize(LogString()) ;ArraySize(LogString$())  
SpeedProc_String_s = CountLogString / TotalTime
EnableDebugger
Debug "Total time = " + Str(TotalTime) + "sec"
Debug "Processing speed " + SpeedProc_Mb_s + "mb/s"
Debug "Processing speed " + SpeedProc_String_s + "line/sec"
FirstElement(LogString())
Debug LogString();LogString$(0) ; Print the first line of the file 
LastElement(LogString()) 
Debug LogString();LogString$(CountLogString) ; Print the last line of the file

Processing speed seems acceptable, On a file 3.07 GB (The log is located on the ssd drive) The results are as follows:
Total time = 35sec
Processing Speed 89mb/s
Processing Speed 588698 line/sec

Although in real code, instead of assigning the read line to the list item,
probably parsing the field fields and writing them to the database

Marc56us · Post by **Marc56us** » Wed Jun 24, 2020 8:28 pm

olmak wrote:I understand that it is necessary to detail the conditions of the task. Often I have to get proxy log information from servers that I support. There are several hundred of them, and the sizes of daily logs vary from several hundred megabytes to several gigabytes and several billion lines. Each record contains client requests: login, ip, time, url, size of received-transmitted information, etc.
Information should be analyzed with the possibility of obtaining reports on the use of resources by users. At first I wanted to just read logs and form arrays of strings that can be processed as you like based on the results of reading the lines. But very quickly it became clear that the scheme should be like this: reading logs with copying entries to the database (in this case I plan mariadb) with the ability to SQL queries using DBMS. The task at the moment is only in the maximum fast reading of files. After talking on another forum So far I came to such a piece of code for reading files:
Code: Select all
Although in real code, instead of assigning the read line to the list item,
probably parsing the field fields and writing them to the database

In this case, il will be faster to directly import data in database then drop unwanted datas.
LOAD DATA INFILE

NicTheQuick · Post by **NicTheQuick** » Wed Jun 24, 2020 11:57 pm

I don't know why you want to reinvent the wheel. Is this just an exercise for you? Because there are plenty of well known tools which do exactly what you want.
One major example would be Elasticsearch, and for your scenario there would be Log Monitoring.
But there also other major tools which can do what you want.

Or you could also use inotifywatch to wait for changes in log files an then grep the things you need and pull it over to your main monitoring system over ssh or a database connection.

olmak · Post by **olmak** » Thu Jun 25, 2020 5:39 am

Marc56us wrote: In this case, il will be faster to directly import data in database then drop unwanted datas.
LOAD DATA INFILE

Thanks for that, I'll try

olmak · Post by **olmak** » Mon Jul 13, 2020 7:58 pm

Marc56us wrote: In this case, il will be faster to directly import data in database then drop unwanted datas.
LOAD DATA INFILE

The task came precisely to the fact that you need to import a large file into the database. For work with large data arrays, I chose Mariadb. I use the native mode of working with the Maria.db database (UseMySQLDatabase ()). But I got a new problem - the Load Data infile command does not work.
I can delete the database and tables from Purebasic — I can create it, modify the tables using the insert command, and I cannot import data into the table using the Load Data infile command.

PureBasic Forums - English

Reading huge files

Re: Reading huge files

Re: Reading huge files

Re: Reading huge files

Re: Reading huge files

Re: Reading huge files

Re: Reading huge files

Re: Reading huge files

Re: Reading huge files

Re: Reading huge files

Re: Reading huge files

Re: Reading huge files

Re: Reading huge files