Page 1 of 2

quick reading a file of 15 GByte.... how?

Posted: Sun Mar 24, 2024 7:58 pm
by wimapon
Hi folks,
I have to read a very big file.
first action is to select lines with a certain string.
then take other actions.

If a select only the reading line, i have this:

Code: Select all

   If ReadFile(0, file$) 
     While Eof(0) = 0
       regel$ = ReadString(0)   
       ;  do something  
     Wend
       CloseFile(0)
   Else 
     MessageRequester("Information","Kan "+ file$ +  "     niet openen")  
   EndIf
     
   CloseFile(99)


This take a few minutes..... is too long

Is there a quicker method of reading?


the files looks like this: ( but then 15 GByte long) ( it is called a .csv file)

7330159897,1710707880,DK5HH/3,JO43jb,-5,3.570107,PA0SLT,JO33kg,43,0,130,100,3,,1
7330159654,1710707880,DK6DV,JO31lk,-17,3.570112,PA0SLT,JO33kg,43,0,204,178,3,2.6.1,1
7330160649,1710707880,DK6UG,JN49cm,18,3.570108,PA0SLT,JO33kg,43,0,427,167,3,,1
7330164131,1710707880,DK8FT,JN58oe,2,3.570107,PA0SLT,JO33kg,43,0,642,150,3,,1
7330166195,1710707880,DL0HT,JO43jb,16,3.570107,PA0SLT,JO33kg,43,0,130,100,3,,1
7330163233,1710707880,DL0PF,JN68rn,-2,3.57012,PA0SLT,JO33kg,43,0,697,136,3,v1.2.51,1
7330165078,1710707880,DL2ZZ,JO31lo,-3,3.570114,PA0SLT,JO33kg,43,0,185,178,3,,1
7330163492,1710707880,DL3TU,JN48mm,6,3.570116,PA0SLT,JO33kg,43,0,550,163,3,,1
7330167099,1710707880,DL4RU,JN69cr,7,3.570107,PA0SLT,JO33kg,43,0,540,135,3,1.4A Kiwi,1
7330160053,1710707880,DL5ALW,JO51pd,7,3.570109,PA0SLT,JO33kg,43,0,383,126,3,2.6.1,1
7330161831,1710707880,DL7AUQ,JO62vt,-6,3.570108,PA0SLT,JO33kg,43,0,465,94,3,v1.2.49,1
7330162232,1710707880,DL8BBY/1,JO43jb,7,3.570109,PA0SLT,JO33kg,43,0,130,100,3,2.6.

thank you very much for help

Wim

Re: quick reading a file of 15 GByte.... how?

Posted: Sun Mar 24, 2024 8:12 pm
by spikey
Have a look at this thread, it presents several possible alternatives: https://www.purebasic.fr/english/viewtopic.php?p=590388

Re: quick reading a file of 15 GByte.... how?

Posted: Sun Mar 24, 2024 9:22 pm
by AZJIO
For what purpose do you want to load a 15 GB file into memory? For example, a video player does not load the file into memory when watching a movie.

The difference is that if you have a UTF-8 file, it will be converted to #PB_Unicode and will occupy 2 times the size in memory. You can read the file in chunks and perform some actions, such as searching, but with the condition that you need to look for #CRLF at the end of the file fragment in order to move the file pointer back and read the next piece of data from the beginning of the line.

ReadString() is slow for large files because it reads the file line by line. It's better to use ReadData(), but then you won't have direct access to the data to work with string functions. If you use PeekS(), then you will have a duplicate file 2 times, one is the result of the ReadData() function, the other 2 times larger is the result of the PeekS() function, in total you will have 45 GB of RAM.

Code: Select all

#BufferSize = 200
Global *mem
Define File$, idFile, bytes, Format, length
Define *c.Character

File$ = OpenFileRequester("Choose File", "", "Text (.txt)|*.txt|PureBasic (.pb)|*.pb|All files (*.*)|*.*", 0)
If Asc(File$)
	idFile = ReadFile(#PB_Any, File$)
	If idFile
		Format = ReadStringFormat(idFile)
		length = Lof(idFile)
		*mem = AllocateMemory(#BufferSize * 2)
		If *mem
			While Loc(idFile) < length
				bytes = ReadData(idFile, *mem, #BufferSize)
				If Not bytes
					Break
				EndIf
				If MessageRequester("Next piece?", PeekS(*mem, bytes, Format), #PB_MessageRequester_YesNo) = #PB_MessageRequester_No
					Break
				EndIf
			Wend
			FreeMemory(*mem)
		EndIf
		CloseFile(idFile)
	EndIf
EndIf
It doesn't work perfectly, but I hope the meaning is clear.

Code: Select all

#BufferSize = 200
Global *mem, idFile, Format
Define File$, bytes, length, offset

Declare Find2(bytes)

File$ = OpenFileRequester("Choose File", "", "Text (.txt)|*.txt|PureBasic (.pb)|*.pb|All files (*.*)|*.*", 0)
If Asc(File$)
	idFile = ReadFile(#PB_Any, File$)
	If idFile
		Format = ReadStringFormat(idFile)
		length = Lof(idFile)
		*mem = AllocateMemory(#BufferSize + 2)
		If *mem
			While Loc(idFile) < length
				bytes = ReadData(idFile, *mem, #BufferSize)
				If Not bytes
					Break
				EndIf
				offset = Find2(bytes)
				Debug offset
				FileSeek(idFile, offset, #PB_Relative)
			Wend
			FreeMemory(*mem)
		EndIf
		CloseFile(idFile)
	EndIf
EndIf


Procedure Find2(bytes)
	Protected i, offset, IsFound
	Protected *b.Byte

	*b = *mem + bytes - SizeOf(Byte)

	For i = bytes - 1 To 1 Step -1
; 		Debug *b\b
		If *b\b = #LF Or *b\b = #CR
			*b\b = 0
			IsFound = 1
			offset = i - bytes + 1
			Break
		EndIf
		*b - SizeOf(Byte)
	Next

; 	For i = bytes - 1 To 1 Step -1
; ; 		Debug *b\b
; 		If *b\b = #LF
; 			*b - SizeOf(Byte)
; 			If *b\b = #CR
; 				*b\b = 0
; 				IsFound = 1
; 				offset = i - bytes + 1
; 				Break
; 			EndIf
; 		EndIf
; 		*b - SizeOf(Byte)
; 	Next

	If IsFound
		If MessageRequester("Next piece?", "|" + PeekS(*mem, - 1, Format) + "|", #PB_MessageRequester_YesNo) = #PB_MessageRequester_No
			FreeMemory(*mem)
			CloseFile(idFile)
			End
		EndIf
	Else
		Debug "not found"
		MessageRequester("Not found or last", "|" + PeekS(*mem, - 1, Format) + "|")
	EndIf

	ProcedureReturn offset
EndProcedure

Re: quick reading a file of 15 GByte.... how?

Posted: Mon Mar 25, 2024 4:26 am
by idle
Its fine to use readstring as you are for now as your not loading the whole file into memory just scanning it, but do you need to do multiple queries at run time?
If speeds a concern parsing the file you can also divide it into multiple threads and then optimize the readstring once you've got the required logic sorted.

If you need to do multiple queries at run time you will want to index it and it would still take up a lot of memory.
Simple way use a map of terms into a list or an array of instances

7330159897,1710707880,DK5HH/3,JO43jb,-5,3.570107,PA0SLT,JO33kg,43,0,130,100,3,,1

so if PA0SLT was the field of interest you could use a map to a linked list and store the loc(file) or previous loc(file)
to retrieve the line via seek file at run time.

Re: quick reading a file of 15 GByte.... how?

Posted: Mon Mar 25, 2024 7:57 am
by wimapon
I am a simple programmer and my style is coming from the years 1980
i like to keep things simple.


let me be clear....

the huge file contains a lot of lines.

i like to extract only the lines containing "PA0SLT" and write them to an other file.

so the result will be a small file containing only lines with "PA0SLT" in it.
In that file i willl do my things.

It looks like a simple problem. But it takes about 2 minutes calculating on my computer to get the lines.
and that is long.

Wim

Re: quick reading a file of 15 GByte.... how?

Posted: Mon Mar 25, 2024 8:08 am
by idle
If you only have an HDD it will take a while reading and writing to file as you propose but if you have an SSD you should be able to do it in ~10 to 20 seconds.

Re: quick reading a file of 15 GByte.... how?

Posted: Mon Mar 25, 2024 8:19 am
by infratec
You should definately import this file in a SQLite database.
If this is done once, you have very fast access to all data you want.

Re: quick reading a file of 15 GByte.... how?

Posted: Mon Mar 25, 2024 8:25 am
by idle
infratec wrote: Mon Mar 25, 2024 8:19 am You should definately import this file in a SQLite database.
If this is done once, you have very fast access to all data you want.
It would certainly make sense if run time query is required

Re: quick reading a file of 15 GByte.... how?

Posted: Mon Mar 25, 2024 8:29 am
by useful
wimapon wrote: Sun Mar 24, 2024 7:58 pm ...
first action is to select lines with a certain string.
then take other actions.
...
If the time spent is critical for you, then this is not a one-time analysis, but a periodic one. In order to make recommendations, I need to have an idea of the required requests. And most likely we will talk about DBMS, which were created primarily to solve such problems. I.e., the accumulation of information should occur with parallel indexing of important columns for subsequent analysis.

p.s.
wimapon wrote: Mon Mar 25, 2024 7:57 am ... and my style is coming from the years 1980
My experience dates back to the late 70s :)

Re: quick reading a file of 15 GByte.... how?

Posted: Mon Mar 25, 2024 10:10 am
by Fred
You can also try to increase the internal filebuffer to 1MB to see if it does any difference: https://www.purebasic.com/documentation ... ssize.html

Re: quick reading a file of 15 GByte.... how?

Posted: Mon Mar 25, 2024 10:16 am
by infratec
A 'slow' example.
But if you imported the CSV file once, it should be very fast to find the call dates to someone.

Code: Select all

EnableExplicit






UseSQLiteDatabase()

Define.i DB, Records, File
Define Filename$, Tablename$, SQL$, Line$

Filename$ = OpenFileRequester("Choose a DB", "", "SQLite|*.db3;*.sqlite;*.sdb", 0)
If Filename$
  
  If GetExtensionPart(Filename$) = ""
    Filename$ + ".sqlite"
  EndIf
  
  If FileSize(Filename$) < 0
    DB = CreateFile(#PB_Any, Filename$)
    If DB
      CloseFile(DB)
    EndIf
  EndIf
  
  DB = OpenDatabase(#PB_Any, Filename$, "", "")
  If DB
    Tablename$ = GetFilePart(Filename$, #PB_FileSystem_NoExtension)
    
    SQL$ = "CREATE TABLE IF NOT EXISTS " + Tablename$ + " (F01 NUMERIC, Timestamp INTEGER, Sign1 TEXT, Sign2 TEXT, F05 INTEGER, Freq REAL, Sign3 TEXT, Sign4 Text, F09 NUMERIC, F10 NUMERIC, F11 NUMERIC, F12 NUMERIC, F13 NUMERIC, Version TEXT, F15 NUMERIC)"
    DatabaseUpdate(DB, SQL$)
    
    SQL$ = "CREATE INDEX IF NOT EXISTS Sign3 ON " + Tablename$ + " (Sign3)"
    DatabaseUpdate(DB, SQL$)
    
    SQL$ = "SELECT COUNT(*) FROM " + Tablename$
    If DatabaseQuery(DB, SQL$)
      If NextDatabaseRow(DB)
        Debug "There are " + GetDatabaseString(DB, 0) + " records in the table"
      EndIf
      FinishDatabaseQuery(DB)
      
      If MessageRequester("Choose", "Import a CSV file?", #PB_MessageRequester_YesNo) = #PB_MessageRequester_Yes
        Filename$ = OpenFileRequester("Choose a CSV file", "", "CSV|*.csv", 0)
        If Filename$
          File = ReadFile(#PB_Any, Filename$)
          If File
            DatabaseUpdate(DB, "BEGIN TRANSACTION")
            While Not Eof(File)
              Line$ = ReadString(File)
              If Line$ <> ""
                SQL$ = "INSERT INTO " + Tablename$ + " VALUES ("
                SQL$ + StringField(Line$, 1, ",") + ","
                SQL$ + StringField(Line$, 2, ",") + ","
                SQL$ + "'" + StringField(Line$, 3, ",") + "',"
                SQL$ + "'" + StringField(Line$, 4, ",") + "',"
                SQL$ + StringField(Line$, 5, ",") + ","
                SQL$ + StringField(Line$, 6, ",") + ","
                SQL$ + "'" + StringField(Line$, 7, ",") + "',"
                SQL$ + "'" + StringField(Line$, 8, ",") + "',"
                SQL$ + StringField(Line$, 9, ",") + ","
                SQL$ + StringField(Line$, 10, ",") + ","
                SQL$ + StringField(Line$, 11, ",") + ","
                SQL$ + StringField(Line$, 12, ",") + ","
                SQL$ + StringField(Line$, 13, ",") + ","
                SQL$ + "'" + StringField(Line$, 14, ",") + "',"
                SQL$ + StringField(Line$, 15, ",")
                SQL$ + ")"
;                Debug SQL$
                If DatabaseUpdate(DB, SQL$) = 0
                  Debug DatabaseError()
                EndIf
              EndIf
            Wend
            DatabaseUpdate(DB, "COMMIT")
            CloseFile(File)
          EndIf
        EndIf
      EndIf
      
    Else
      Debug DatabaseError()
    EndIf
    
    SQL$ = "SELECT Timestamp, Sign1, Freq FROM " + Tablename$ + " WHERE Sign3 = 'PA0SLT'"
    If DatabaseQuery(DB, SQL$)
      While NextDatabaseRow(DB)
        Debug FormatDate("%yyyy.%mm.%dd %hh:%ii:%ss", GetDatabaseQuad(DB, 0)) + " " + GetDatabaseString(DB, 1) + " @ " + StrF(GetDatabaseFloat(DB, 2)) + "MHz"
      Wend
      FinishDatabaseQuery(DB)
    Else
      Debug DatabaseError()
    EndIf
    
    CloseDatabase(DB)
  EndIf
  
EndIf

I added an index on the field Sign1 to speed up the current select.

Btw. you can use better field names :wink:

Re: quick reading a file of 15 GByte.... how?

Posted: Mon Mar 25, 2024 10:51 am
by wimapon
Hi Infratec, long time no see !

to start again: here is the whole situation:

Here is the inputfile , the whole program and the outputfile.

input-file: contains normally 300000000 lines
;=========================================
7330159897,1710707880,DK5HH/3,JO43jb,-5,3.570107,PA0AF,JO33kg,43,0,130,100,3,,1
7330159654,1710707880,DK6DV,JO31lk,-17,3.570112,K4AN,JO33kg,43,0,204,178,3,2.6.1,1
7330160649,1710707880,DK6UG,JN49cm,18,3.570108,L3R,JO33kg,43,0,427,167,3,,1
7330164131,1710707880,DK8FT,JN58oe,2,3.570107,OZ7IT,JO33kg,43,0,642,150,3,,1
7330166195,1710707880,DL0HT,JO43jb,16,3.570107,K9AN,JO33kg,43,0,130,100,3,,1
7330163233,1710707880,DL0PF,JN68rn,-2,3.57012,PA0SLT,JO33kg,43,0,697,136,3,v1.2.51,1
7330165078,1710707880,DL2ZZ,JO31lo,-3,3.570114,ON5KI,JO33kg,43,0,185,178,3,,1
7330163492,1710707880,DL3TU,JN48mm,6,3.570116,LA3JJ,JO33kg,43,0,550,163,3,,1
7330167099,1710707880,DL4RU,JN69cr,7,3.570107,PD0OHW,JO33kg,43,0,540,135,3,1.4A Kiwi,1
7330160053,1710707880,DL5ALW,JO51pd,7,3.570109,PA0SLT,JO33kg,43,0,383,126,3,2.6.1,1
7330161831,1710707880,DL7AUQ,JO62vt,-6,3.570108,PA0JEN,JO33kg,43,0,465,94,3,v1.2.49,1
7330162232,1710707880,DL8BBY/1,JO43jb,7,3.570109,W4R,JO33kg,43,0,130,100,3,2.6.1,1


Code: Select all

; the whole program
;======================
CreateFile(99,"output.txt")
inputfile$ = "demo.csv"         


 If ReadFile(0, inputfile$) 
     While Eof(0) = 0
       regel$ = ReadString(0)      
       If FindString(regel$,"PA0SLT") <> 0 
          WriteStringN(99,regel$)   
       EndIf            
     Wend
     CloseFile(0)
   Else 
     MessageRequester("Information","Kan "+ file$ +  "     niet openen")  
 EndIf
     
       
   CloseFile(99)


The output-file normally 2000 lines
===============================
7330163233,1710707880,DL0PF,JN68rn,-2,3.57012,PA0SLT,JO33kg,43,0,697,136,3,v1.2.51,1
7330160053,1710707880,DL5ALW,JO51pd,7,3.570109,PA0SLT,JO33kg,43,0,383,126,3,2.6.1,1


i use a ssd hard disk and a reasonable fast computer.
The whole process takes about 2 minutes.
i would like to speed this up.....
( i think: starting a database program, filling it up, and asking my question will take more time
than my program???? am i wrong??)

all my work will be with the outputfile... that is fast anough.


sorry, maybe i was not clear anough.

Wim

Re: quick reading a file of 15 GByte.... how?

Posted: Mon Mar 25, 2024 11:07 am
by useful
wimapon wrote: Mon Mar 25, 2024 10:51 am ... maybe i was not clear anough.
...
You haven't answered the main question. If it is needed once, then what difference does it take 2 minutes or 2 seconds for the process. If this is needed many times often, then the cost of another data organization (DBMS, for example) they will definitely justify themselves.

p.s. For example, FileBuffersSize(0, 1000000) gives an increase of 10%, you can probably speed up by another couple of dozen percent, but the main question remains

Re: quick reading a file of 15 GByte.... how?

Posted: Mon Mar 25, 2024 12:37 pm
by wimapon
It depends of the kind of processing i am doing.
something between 2 and 10 times a day when i am processing.
And it is rather irritating to wait 2 minutes when you are busy and thinking.

Re: quick reading a file of 15 GByte.... how?

Posted: Mon Mar 25, 2024 12:51 pm
by infratec
I edit my example above.

You can create once a database file and afterwards it select your requested entries.
You can check how long it takes when you don't import the CSV file again.

But ... the debug output needs more time then writing the data in a new file.