Reading huge files

Everything else that doesn't fall into one of the other PB categories.
User avatar
HeX0R
Addict
Addict
Posts: 992
Joined: Mon Sep 20, 2004 7:12 am
Location: Hell

Re: Reading huge files

Post by HeX0R »

Are those text files from logging actions?
I had a similar tool to do, here is how I did it:
1.) Read a few lines of the file from the beginning, find the averaged line length (not needed, if all your lines are of same length, was not true in my case).
2.) If logging entries occure always at a fixed time, get the time between two entries also from the few lines you read (was true in my case).
3.) Set yourself a span, where you want to look at the data, e.g. some date and a time, then look 50 steps behind and 50 steps before.
4.) Use the averaged line length and jump to that "estimated" position in the file.
5.) Read up to the next line feed to be sure, you got a complete line.
6.) Check the date of that entry, and go back to 4 until you've reached the correct entry.


This is, when done correctly, extremely fast, needs almost no memory, and I could even show a live graph, while moving the position back or forth.
User avatar
skywalk
Addict
Addict
Posts: 4003
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: Reading huge files

Post by skywalk »

Yes, many approaches better than reading an entire log file into memory. :shock:
Another approach is to continually add the log to a database.
Then, run SQL queries as needed.
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
User avatar
Saki
Addict
Addict
Posts: 830
Joined: Sun Apr 05, 2020 11:28 am
Location: Pandora

Re: Reading huge files

Post by Saki »

Hi
Of course it is better to load only a defined part of the file.
But that was not the question and it was expected that he would see the need to load the file completely.
He meant he want it for post-processing, this can be many things !
Also the structure of the file is still unknown, just because it is line oriented as usual.
If he wants to search the whole log for a certain entry as quickly as possible, there is no reason not to load the file completely.
So every existing line is surely captured and can be extracted easily in case of a hit.
Outputting the hits in the EditorGadget is simple.
The simplest solution, 15 - 30 minutes of programming with the information and code he has now !
32GB Ram and a fast processor should already be available for 10GB.
In his first post it was already clear that this is not possible.
The approach was absolutely wrong, it would never have worked.
One should also deal with the pitfalls and problems of implementation
be familiar.
But it is certainly a good exercise to deal with this problem.
I don't think it will be a project with professional requirements.

Best Regards Saki
地球上の平和
User avatar
NicTheQuick
Addict
Addict
Posts: 1227
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: Reading huge files

Post by NicTheQuick »

If you want to work with large binary data and need to access it like it is in memory, you can use Memory mapped files on Unix systems. On Windows you can use a similar approach. See the links and follow the links.
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
User avatar
Saki
Addict
Addict
Posts: 830
Joined: Sun Apr 05, 2020 11:28 am
Location: Pandora

Re: Reading huge files

Post by Saki »

Hi,
thanks.
But I think his code is absolutely easy to implement with PB features and it will run flawlessly.
He just has to know how to do it and know the pitfalls.
Also, a little bit of knowhow is required of course :wink:
地球上の平和
User avatar
NicTheQuick
Addict
Addict
Posts: 1227
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: Reading huge files

Post by NicTheQuick »

Of course it is doable in pure Purebasic. But I can not see the speedup when you first have to copy the whole file to memory and convert it into Unicode which costs twice the space and finally split it up into lines. Just do it in one go. While reading the file chunk-wise, split it into lines and do your stuff until you get the EOL.

But I need the whole specification, to be able to help pushing it in the right direction.

For example is it necessary to be an Array or can it also be a LinkedList?
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
User avatar
Saki
Addict
Addict
Posts: 830
Joined: Sun Apr 05, 2020 11:28 am
Location: Pandora

Re: Reading huge files

Post by Saki »

Hi, yes, of course it can also be completely binary.
It's simply based on the idea of giving him the easiest possible solution, according to his question.
The previous reading of the whole file is absolutely superfluous, you can just do all on demand.
But let's wait for him to get back.
Myself have no further interest in the matter, they are basics, I just wanted to give him some little info.

Best Regards Saki
地球上の平和
olmak
User
User
Posts: 14
Joined: Thu Aug 11, 2016 4:00 am

Re: Reading huge files

Post by olmak »

I understand that it is necessary to detail the conditions of the task. Often I have to get proxy log information from servers that I support. There are several hundred of them, and the sizes of daily logs vary from several hundred megabytes to several gigabytes and several billion lines. Each record contains client requests: login, ip, time, url, size of received-transmitted information, etc.
Information should be analyzed with the possibility of obtaining reports on the use of resources by users. At first I wanted to just read logs and form arrays of strings that can be processed as you like based on the results of reading the lines. But very quickly it became clear that the scheme should be like this: reading logs with copying entries to the database (in this case I plan mariadb) with the ability to SQL queries using DBMS. The task at the moment is only in the maximum fast reading of files. After talking on another forum So far I came to such a piece of code for reading files:

Code: Select all

Global NewList LogString.s()
Procedure ReadFileByStringIntoArray(file$);, Array StringArray.s(1))
  i=0
  file_handler=ReadFile(#PB_Any,file$)
    While Eof(file_handler) = 0           
      AddElement(LogString())
      LogString() = ReadString(file_handler)
      i=i+1
    Wend
    CloseFile(file_handler)  
EndProcedure
 
 DisableDebugger

 
fileName$="D:\Text1.txt"
 
fileSizeMb.q = (FileSize(Filename$)/1024) /1024
StartTime=Date()
ReadFileByStringIntoArray(fileName$);, LogString$() )
TotalTime=Date()-StartTime
SpeedProc_Mb_s=fileSizeMb / TotalTime
CountLogString=ListSize(LogString()) ;ArraySize(LogString$())  
SpeedProc_String_s = CountLogString / TotalTime
EnableDebugger
Debug "Total time = " + Str(TotalTime) + "sec"
Debug "Processing speed " + SpeedProc_Mb_s + "mb/s"
Debug "Processing speed " + SpeedProc_String_s + "line/sec"
FirstElement(LogString())
Debug LogString();LogString$(0) ; Print the first line of the file 
LastElement(LogString()) 
Debug LogString();LogString$(CountLogString) ; Print the last line of the file
Processing speed seems acceptable, On a file 3.07 GB (The log is located on the ssd drive) The results are as follows:
Total time = 35sec
Processing Speed 89mb/s
Processing Speed 588698 line/sec

Although in real code, instead of assigning the read line to the list item,
probably parsing the field fields and writing them to the database
Marc56us
Addict
Addict
Posts: 1479
Joined: Sat Feb 08, 2014 3:26 pm

Re: Reading huge files

Post by Marc56us »

olmak wrote:I understand that it is necessary to detail the conditions of the task. Often I have to get proxy log information from servers that I support. There are several hundred of them, and the sizes of daily logs vary from several hundred megabytes to several gigabytes and several billion lines. Each record contains client requests: login, ip, time, url, size of received-transmitted information, etc.
Information should be analyzed with the possibility of obtaining reports on the use of resources by users. At first I wanted to just read logs and form arrays of strings that can be processed as you like based on the results of reading the lines. But very quickly it became clear that the scheme should be like this: reading logs with copying entries to the database (in this case I plan mariadb) with the ability to SQL queries using DBMS. The task at the moment is only in the maximum fast reading of files. After talking on another forum So far I came to such a piece of code for reading files: Although in real code, instead of assigning the read line to the list item,
probably parsing the field fields and writing them to the database
In this case, il will be faster to directly import data in database then drop unwanted datas.
LOAD DATA INFILE
User avatar
NicTheQuick
Addict
Addict
Posts: 1227
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: Reading huge files

Post by NicTheQuick »

I don't know why you want to reinvent the wheel. Is this just an exercise for you? Because there are plenty of well known tools which do exactly what you want.
One major example would be Elasticsearch, and for your scenario there would be Log Monitoring.
But there also other major tools which can do what you want.

Or you could also use inotifywatch to wait for changes in log files an then grep the things you need and pull it over to your main monitoring system over ssh or a database connection.
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
olmak
User
User
Posts: 14
Joined: Thu Aug 11, 2016 4:00 am

Re: Reading huge files

Post by olmak »

Marc56us wrote: In this case, il will be faster to directly import data in database then drop unwanted datas.
LOAD DATA INFILE
Thanks for that, I'll try
olmak
User
User
Posts: 14
Joined: Thu Aug 11, 2016 4:00 am

Re: Reading huge files

Post by olmak »

Marc56us wrote: In this case, il will be faster to directly import data in database then drop unwanted datas.
LOAD DATA INFILE
The task came precisely to the fact that you need to import a large file into the database. For work with large data arrays, I chose Mariadb. I use the native mode of working with the Maria.db database (UseMySQLDatabase ()). But I got a new problem - the Load Data infile command does not work.
I can delete the database and tables from Purebasic — I can create it, modify the tables using the insert command, and I cannot import data into the table using the Load Data infile command.
Post Reply