Large amount of data. Suggestions anyone?

Just starting out? Need help? Post your questions and find answers here.
MrCor
User
User
Posts: 21
Joined: Tue Dec 01, 2015 8:31 pm

Large amount of data. Suggestions anyone?

Post by MrCor »

Hi guys and dolls,

I am looking for coding suggestions concerning a certain case.

- a textfile (UTF-8)
- filesize is 426MB
- 11.183.522 lines
- each line contains 10-35 characters
- the file is static. This means it is not changing over time.

I want to find a certain string in this textfile. Using Readstring() takes a lot of time to find the string. If I want to find more strings, things take an even larger amount of time as a total .The number of strings to be found will get more in the future and at this moment there are already over 2.000.
So far I have tried several codes, including putting that large file into an array to shorten looptime. This seem to work but when I approach the array, I get a memory error.
No code has worked so far. It gets all sorts of memory errors (not at the same point) or, if I create an exe, crashing programs. Also the debugger tells me the program quits unexpectedly (I dind't expect it eighter ;)). This also happens at different stages.
It seems as if the Windows memorymanager cannot handle this large amount of data???

Any suggestions anyone? Don't ask me what the program is for, it doesn't matter. Thanks in advance for your trouble.

MrCor
ricardo_sdl
Enthusiast
Enthusiast
Posts: 141
Joined: Sat Sep 21, 2019 4:24 pm

Re: Large amount of data. Suggestions anyone?

Post by ricardo_sdl »

You can try SQLite, as in memory database or a file database.
You can check my games at:
https://ricardo-sdl.itch.io/
miskox
Enthusiast
Enthusiast
Posts: 107
Joined: Sun Aug 27, 2017 7:37 pm
Location: Slovenia

Re: Large amount of data. Suggestions anyone?

Post by miskox »

Can you use FIND or FINDSTR from DOS? Try with them if they are quick.

Saso
User avatar
NicTheQuick
Addict
Addict
Posts: 1504
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: Large amount of data. Suggestions anyone?

Post by NicTheQuick »

With a Suffix tree you can find a lot of search patterns in a big text.
Another good search algorithm for smaller texts is the Knuth–Morris–Pratt algorithm
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
User avatar
spikey
Enthusiast
Enthusiast
Posts: 750
Joined: Wed Sep 22, 2010 1:17 pm
Location: United Kingdom

Re: Large amount of data. Suggestions anyone?

Post by spikey »

SQLite includes full text search functions too, see https://www.sqlite.org/fts3.html.
infratec
Always Here
Always Here
Posts: 7577
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: Large amount of data. Suggestions anyone?

Post by infratec »

A short test:

Code: Select all

EnableExplicit

ImportC ""
  sqlite3_backup_init(pDest, zDestName.p-utf8, pSource, zSourceName.p-utf8)
  sqlite3_backup_step(sqlite3_backup, nPage)
  sqlite3_backup_finish(sqlite3_backup)
  sqlite3_errcode(db)
EndImport

;The following backs up one SQlite database to a second SQLite database. Can be used to dump a memory based SQLite database to disc
;or can dump a disc-based SQLite database to a memory one etc.
;Any existing data in the destination database will be deleted.
;Returns #True if successful.
Procedure.i SQLite_BackupSqliteDatabase(sourceDB.i, destinationDB.i)
  
  Protected result, backUp
  
  
  If IsDatabase(sourceDB) And IsDatabase(destinationDB)
    backUp = sqlite3_backup_init(DatabaseID(destinationDB), "main", DatabaseID(sourceDB), "main")
    If backUp
      sqlite3_backup_step(backUp, -1)
      If sqlite3_backup_finish(backUp) = 0 ;#SQLITE_OK
        result = #True   
      EndIf
    EndIf
  EndIf
  
  ProcedureReturn result
  
EndProcedure




Define.i File, DBFile, DB
Define.q StartTime, EndTime
Define Filename$, Line$, Found$, Search$, DBFilename$


UseSQLiteDatabase()

Filename$ = OpenFileRequester("Choose a file", "", "Searchfile|*.txt;*.db3", 0)
If Filename$
  
  If GetExtensionPart(Filename$) = "db3"
    DBFile = OpenDatabase(#PB_Any, Filename$, "", "")
    If DBFile
      DB = OpenDatabase(#PB_Any, ":memory:", "", "")
      If DB
        SQLite_BackupSqliteDatabase(DBFile, DB)
      EndIf
      CloseDatabase(DBFile)
    EndIf
  Else
    DB = OpenDatabase(#PB_Any, ":memory:", "", "")
    If DB
      DatabaseUpdate(DB, "CREATE TABLE file (line text)")
      File = ReadFile(#PB_Any, Filename$)
      If File
        DatabaseUpdate(DB, "BEGIN")
        While Not Eof(File)
          Line$ = ReadString(File)
          Line$ = ReplaceString(Line$, "'", "''")
          DatabaseUpdate(DB, "INSERT INTO file VALUES('" + Line$ + "')")
        Wend
        DatabaseUpdate(DB, "COMMIT")
        ;DatabaseUpdate(DB, "CREATE INDEX idx_line ON file (line)")
        CloseFile(File)
        
        DBFilename$ = GetPathPart(Filename$) + GetFilePart(Filename$, #PB_FileSystem_NoExtension) + ".db3"
        File = CreateFile(#PB_Any, DBFilename$)
        If File
          CloseFile(File)
          DBFile = OpenDatabase(#PB_Any, DBFilename$, "", "")
          SQLite_BackupSqliteDatabase(DB, DBFile)
          CloseDatabase(DBFile)
        EndIf
      EndIf
    EndIf
  EndIf
  
  If IsDatabase(DB)
    Repeat
      Search$ = InputRequester("Textsearch", "Search for:", "")
      If Search$ <> ""
        Debug Search$
        StartTime = ElapsedMilliseconds()
        If DatabaseQuery(DB, "SELECT rowid, line FROM file WHERE line LIKE '%" + Search$ + "%'")
          Found$ = ""
          While NextDatabaseRow(DB)
            Found$ + RSet(GetDatabaseString(DB, 0), 5) + ": " + GetDatabaseString(DB, 1) + #LF$
          Wend
          FinishDatabaseQuery(DB)
          EndTime = ElapsedMilliseconds()
          Found$ = "Needed: " + Str(EndTime - StartTime) + "ms" + #LF$ + #LF$ + Found$
          MessageRequester("Locations", Found$)
        EndIf
      EndIf
    Until Search$ = ""
    CloseDatabase(DB)
  EndIf
  
EndIf
Once the file is opened, you can open the db3 file in the next run.
If the found locations are to many, the MessageRequester() does not open.

The search is done in the memory database.
If your RAM is low, you can try how fast it is with the disk database.

At the moment the search is case sensitive.
If you don't want this you need LCase() and LOWER(). But this costs speed.
User avatar
idle
Always Here
Always Here
Posts: 5836
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Large amount of data. Suggestions anyone?

Post by idle »

MrCor wrote: Tue Oct 18, 2022 2:15 pm Hi guys and dolls,

I am looking for coding suggestions concerning a certain case.

- a textfile (UTF-8)
- filesize is 426MB
- 11.183.522 lines
- each line contains 10-35 characters
- the file is static. This means it is not changing over time.

I want to find a certain string in this textfile.
Really need more information on what you want to do.

Are you looking up many strings frequently? IF No, Use an SSE string find for < 16 byte strings or booyer-moore for > 16
viewtopic.php?t=62519

Do the look up strings normally exist in the set? if yes map(), If no a Trie() as they bail out early
viewtopic.php?t=79453

Or Infratecs sqlite DB above.

If you need a high look up rate, use a bloom filter before going into slower memory structures or file searching. lookup rates for 32 byte strings is around 50 million per second per thread. look up times are ~10 - 20 nano seconds.
viewtopic.php?p=573334

Code: Select all

XIncludeFile "bloom.pbi" 
EnableExplicit 

Global bloom.ibloom  
Global ers.d = 1.0 / 1024    ;max errors   
Global size = 1<<22               
Global path$,filename$,out$,search$,time$,len,st,et,bfound,a,ct

path$ = GetPathPart(ProgramFilename()) + "bloom.blm"
If FileSize(path$) > 0 
  bloom = Bloom_load(path$) 
Else 
  bloom = Bloom_New(size,ers) ;size must be power of 2 
  Filename$ = OpenFileRequester("Choose a file", "", "Searchfile|*.txt;", 0)
  If Filename$
    If ReadFile(0,Filename$) 
      Repeat 
        out$ = ReadString(0,#PB_UTF8) 
        bloom\Set(@out$,StringByteLength(out$))
      Until Eof(0)
      out$=""
    EndIf 
  EndIf 
  path$ = GetPathPart(ProgramFilename()) + "bloom.blm"
  len = bloom\Save(path$) 
  MessageRequester("bloom","created bloom filter: " + StrF(len / (1024*1024.0),2) + " mb") 
EndIf 

If bloom 
  Search$ = InputRequester("Textsearch", "Search for:", "")
  If Search$ <> ""
    st = ElapsedMilliseconds() 
    len = StringByteLength(Search$)
    For a = 0 To 1000000          ;loop 1m times average time for one lookup is nano seconds 
      bfound = bloom\Get(@Search$,len) 
    Next  
    et = ElapsedMilliseconds() 
   
    If bfound 
      time$ + StrF(et-st) + " nano seconds to find " + search$  + #CRLF$  
    EndIf 
    SetClipboardText(time$) 
    MessageRequester("time to search through " + Str(ct) + " items", time$)
  EndIf 
  bloom\Free()   
EndIf 
Squint only takes nano seconds to look up or enumerate item from millions of items
viewtopic.php?p=586544
a map will achieve the same rates but if items don't exist squint is faster

lookup is looped 1,000,000 times. time is equivalent to nanoseconds per look up
enumeration is looped 1000 times equivalent to micro seconds per enumeration.
57 nano seconds to find rockbo not found
3 micro seconds to enum 3 items
rockboat.net 19262089
rockborn.net 19262102
rockbottomrx.com 1166911

Code: Select all

XIncludeFile "Squint3.pbi"

UseModule SQUINT 
EnableExplicit 

Structure item 
  key.s 
  pos.i
EndStructure   

Global sq.isquint = SquintNew()
Global NewList found.item() 
Global ct,et,et1,st,Search$,Out$,Filename$,a,time$,bfound   

Filename$ = OpenFileRequester("Choose a file", "", "Searchfile|*.txt;", 0)
If Filename$
  If ReadFile(0,Filename$) 
    Repeat 
      out$ = ReadString(0,#PB_UTF8) 
      sq\Set(0,@out$,Loc(0)) 
      ct+1 
    Until Eof(0)
    out$=""
  EndIf 
EndIf 

Procedure CBSquint(*key,value,*userData)
  AddElement(found()) 
  found()\key = PeekS(*key,-1,#PB_UTF8)
  found()\pos = value 
EndProcedure

Search$ = InputRequester("Textsearch", "Search for:", "")
 
If Search$ <> ""
  ClearList(found())
  
  st = ElapsedMilliseconds() 
  
  For a = 0 To 1000
    ClearList(found())  
    sq\Enum(@search$,@CBSquint()) 
    out$="" 
  Next 
  
  et = ElapsedMilliseconds() 
  
  For a = 0 To 1000000
    bfound = sq\get(0,@Search$) 
  Next  
  
  et1 = ElapsedMilliseconds() 
  
  ForEach found() 
    out$ + found()\key + " " + found()\pos + #CRLF$ 
  Next 
   
  time$ = StrF(et-st) + " micro seconds to enum " + Str(ListSize(found())) + " items " + #CRLF$
  If bfound 
    time$ + StrF(et1-et) + " nano seconds to find " + search$ + " found"  + #CRLF$  
  Else 
    time$ + Str(et1-et) + " nano seconds to find " +  search$ + " not found"  + #CRLF$  
  EndIf   
    
  SetClipboardText(time$ + out$) 
  
  MessageRequester("time to search " + Str(ct) + " items", time$ + out$)
  
EndIf 

sq\Free()

User avatar
GedB
Addict
Addict
Posts: 1313
Joined: Fri May 16, 2003 3:47 pm
Location: England
Contact:

Re: Large amount of data. Suggestions anyone?

Post by GedB »

Have you tried using the Regular Expression library?

https://www.purebasic.com/documentation ... index.html
Post Reply