Multithreaded read/write/parsing of files

Just starting out? Need help? Post your questions and find answers here.
User avatar
bbanelli
Enthusiast
Enthusiast
Posts: 543
Joined: Tue May 28, 2013 10:51 pm
Location: Europe
Contact:

Multithreaded read/write/parsing of files

Post by bbanelli »

Greetings to all,

I have a large file, 150GB+, that needs to be parsed. So, that probably excludes reading whole file to memory since it would require quite a machine.

Would it be better to write chunks of file to memory and than let thread do the work while reading rest of the file and repeat procedure until the end of file? Or read a part of file and than split worker threads to work with that string List?

Could someone share a bit of code to see how to protect (and maintain continuous) read of file and write to output file in such case(s)?

Thanks in advance,

Bruno
"If you lie to the compiler, it will get its revenge."
Henry Spencer
https://www.pci-z.com/
User avatar
skywalk
Addict
Addict
Posts: 3972
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: Multithreaded read/write/parsing of files

Post by skywalk »

Read it into a sqlite/mysql db, then you can query it. Parsing entirely to memory will be temporary and slow as it grows.
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
User avatar
NicTheQuick
Addict
Addict
Posts: 1224
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: Multithreaded read/write/parsing of files

Post by NicTheQuick »

As I know the file functions are already cached. So it should not make too much difference between reading directly from the file compared to reading the whole file to memory and parse it there. Especially if you do sequential reading you should not need to cache big chunks of the file.
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
fabulouspaul
User
User
Posts: 34
Joined: Sun Nov 23, 2014 1:18 pm

Re: Multithreaded read/write/parsing of files

Post by fabulouspaul »

Using an SQLite database as skywalk mentioned should be especially usefull when parsing thru the file more than once.

Maybe it even works if you create an in-memory db (i never had such big amounts of data so i can't say if the SQLite-engine swaps data from RAM to disk in that case).

When using a virtual table you could also use SQLite full-text-search.

Code: Select all

EnableExplicit

UseSQLiteDatabase()

Enumeration 
  #db_handle
EndEnumeration

If OpenDatabase(#db_handle, ":memory:", "", "", #PB_Database_SQLite)
  If DatabaseUpdate(#db_handle, "create virtual table test using fts4(column1 varchar(100), column2 varchar(100), column3 varchar(100));")
    ; ... 
    If DatabaseUpdate(#db_handle, "insert into test values('Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam', " +
                                  "'nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam', " +
                                  "'erat, sed diam voluptua. At vero eos et accusam et justo duo');")    
    
      If DatabaseQuery(#db_handle, "select * from test where test match 'eos';")
        ; ...
      Else
        Debug "error searching"
      EndIf
    Else
      Debug "error inserting data"
    EndIf   
  Else
    Debug "error creating virtual table"
  EndIf
Else
  Debug "error opening database in memory"
EndIf

End
...just my 2 cent.
User avatar
NicTheQuick
Addict
Addict
Posts: 1224
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: Multithreaded read/write/parsing of files

Post by NicTheQuick »

Depending on how complex your parsing process is there should be no benefit in splitting the file in certain parts and using more than one worker. Reading from and writing to the file will be the slowest part in your algorithm. So in my opinion it is perfectly sufficient to use one parser thread which reads the file one by one line.

Independently from this can you please explain your problem in more detail?
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
Post Reply