Multithreaded read/write/parsing of files

bbanelli · Post by **bbanelli** » Sun Nov 11, 2018 8:49 pm

Greetings to all,

I have a large file, 150GB+, that needs to be parsed. So, that probably excludes reading whole file to memory since it would require quite a machine.

Would it be better to write chunks of file to memory and than let thread do the work while reading rest of the file and repeat procedure until the end of file? Or read a part of file and than split worker threads to work with that string List?

Could someone share a bit of code to see how to protect (and maintain continuous) read of file and write to output file in such case(s)?

Thanks in advance,

Bruno

skywalk · Post by **skywalk** » Mon Nov 12, 2018 3:17 am

Read it into a sqlite/mysql db, then you can query it. Parsing entirely to memory will be temporary and slow as it grows.

NicTheQuick · Post by **NicTheQuick** » Mon Nov 12, 2018 10:21 am

As I know the file functions are already cached. So it should not make too much difference between reading directly from the file compared to reading the whole file to memory and parse it there. Especially if you do sequential reading you should not need to cache big chunks of the file.

fabulouspaul · Post by **fabulouspaul** » Mon Nov 12, 2018 1:15 pm

Using an SQLite database as skywalk mentioned should be especially usefull when parsing thru the file more than once.

Maybe it even works if you create an in-memory db (i never had such big amounts of data so i can't say if the SQLite-engine swaps data from RAM to disk in that case).

When using a virtual table you could also use SQLite full-text-search.

Code: Select all

EnableExplicit

UseSQLiteDatabase()

Enumeration 
  #db_handle
EndEnumeration

If OpenDatabase(#db_handle, ":memory:", "", "", #PB_Database_SQLite)
  If DatabaseUpdate(#db_handle, "create virtual table test using fts4(column1 varchar(100), column2 varchar(100), column3 varchar(100));")
    ; ... 
    If DatabaseUpdate(#db_handle, "insert into test values('Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam', " +
                                  "'nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam', " +
                                  "'erat, sed diam voluptua. At vero eos et accusam et justo duo');")    
    
      If DatabaseQuery(#db_handle, "select * from test where test match 'eos';")
        ; ...
      Else
        Debug "error searching"
      EndIf
    Else
      Debug "error inserting data"
    EndIf   
  Else
    Debug "error creating virtual table"
  EndIf
Else
  Debug "error opening database in memory"
EndIf

End

...just my 2 cent.

NicTheQuick · Post by **NicTheQuick** » Mon Nov 12, 2018 1:29 pm

Depending on how complex your parsing process is there should be no benefit in splitting the file in certain parts and using more than one worker. Reading from and writing to the file will be the slowest part in your algorithm. So in my opinion it is perfectly sufficient to use one parser thread which reads the file one by one line.

Independently from this can you please explain your problem in more detail?

PureBasic Forums - English

Multithreaded read/write/parsing of files

Multithreaded read/write/parsing of files

Re: Multithreaded read/write/parsing of files

Re: Multithreaded read/write/parsing of files

Re: Multithreaded read/write/parsing of files

Re: Multithreaded read/write/parsing of files