MrCor wrote: Tue Oct 18, 2022 2:15 pm
Hi guys and dolls,
I am looking for coding suggestions concerning a certain case.
- a textfile (UTF-8)
- filesize is 426MB
- 11.183.522 lines
- each line contains 10-35 characters
- the file is static. This means it is not changing over time.
I want to find a certain string in this textfile. 
 
Really need more information on what you want to do. 
Are you looking up many strings frequently? IF No, Use an SSE string find for < 16 byte strings or booyer-moore for > 16
viewtopic.php?t=62519    
Do the look up strings normally exist in the set? if yes map(), If no a Trie() as they bail out early      
viewtopic.php?t=79453   
Or Infratecs sqlite DB above.  
If you need a high look up rate, use a bloom filter before going into slower memory structures or file searching. lookup rates for 32 byte strings is around 50 million per second per thread. look up times are ~10 - 20 nano seconds.      
viewtopic.php?p=573334
Code: Select all
XIncludeFile "bloom.pbi" 
EnableExplicit 
Global bloom.ibloom  
Global ers.d = 1.0 / 1024    ;max errors   
Global size = 1<<22               
Global path$,filename$,out$,search$,time$,len,st,et,bfound,a,ct
path$ = GetPathPart(ProgramFilename()) + "bloom.blm"
If FileSize(path$) > 0 
  bloom = Bloom_load(path$) 
Else 
  bloom = Bloom_New(size,ers) ;size must be power of 2 
  Filename$ = OpenFileRequester("Choose a file", "", "Searchfile|*.txt;", 0)
  If Filename$
    If ReadFile(0,Filename$) 
      Repeat 
        out$ = ReadString(0,#PB_UTF8) 
        bloom\Set(@out$,StringByteLength(out$))
      Until Eof(0)
      out$=""
    EndIf 
  EndIf 
  path$ = GetPathPart(ProgramFilename()) + "bloom.blm"
  len = bloom\Save(path$) 
  MessageRequester("bloom","created bloom filter: " + StrF(len / (1024*1024.0),2) + " mb") 
EndIf 
If bloom 
  Search$ = InputRequester("Textsearch", "Search for:", "")
  If Search$ <> ""
    st = ElapsedMilliseconds() 
    len = StringByteLength(Search$)
    For a = 0 To 1000000          ;loop 1m times average time for one lookup is nano seconds 
      bfound = bloom\Get(@Search$,len) 
    Next  
    et = ElapsedMilliseconds() 
   
    If bfound 
      time$ + StrF(et-st) + " nano seconds to find " + search$  + #CRLF$  
    EndIf 
    SetClipboardText(time$) 
    MessageRequester("time to search through " + Str(ct) + " items", time$)
  EndIf 
  bloom\Free()   
EndIf 
Squint only takes nano seconds to look up or enumerate item from millions of items
viewtopic.php?p=586544
a map will achieve the same rates but if items don't exist squint is faster  
lookup is looped 1,000,000 times. time is equivalent to nanoseconds per look up
enumeration is looped 1000 times equivalent to micro seconds per enumeration. 
    
57 nano seconds to find rockbo not found
3 micro seconds to enum 3 items 
rockboat.net 19262089
rockborn.net 19262102
rockbottomrx.com 1166911
  
Code: Select all
XIncludeFile "Squint3.pbi"
UseModule SQUINT 
EnableExplicit 
Structure item 
  key.s 
  pos.i
EndStructure   
Global sq.isquint = SquintNew()
Global NewList found.item() 
Global ct,et,et1,st,Search$,Out$,Filename$,a,time$,bfound   
Filename$ = OpenFileRequester("Choose a file", "", "Searchfile|*.txt;", 0)
If Filename$
  If ReadFile(0,Filename$) 
    Repeat 
      out$ = ReadString(0,#PB_UTF8) 
      sq\Set(0,@out$,Loc(0)) 
      ct+1 
    Until Eof(0)
    out$=""
  EndIf 
EndIf 
Procedure CBSquint(*key,value,*userData)
  AddElement(found()) 
  found()\key = PeekS(*key,-1,#PB_UTF8)
  found()\pos = value 
EndProcedure
Search$ = InputRequester("Textsearch", "Search for:", "")
 
If Search$ <> ""
  ClearList(found())
  
  st = ElapsedMilliseconds() 
  
  For a = 0 To 1000
    ClearList(found())  
    sq\Enum(@search$,@CBSquint()) 
    out$="" 
  Next 
  
  et = ElapsedMilliseconds() 
  
  For a = 0 To 1000000
    bfound = sq\get(0,@Search$) 
  Next  
  
  et1 = ElapsedMilliseconds() 
  
  ForEach found() 
    out$ + found()\key + " " + found()\pos + #CRLF$ 
  Next 
   
  time$ = StrF(et-st) + " micro seconds to enum " + Str(ListSize(found())) + " items " + #CRLF$
  If bfound 
    time$ + StrF(et1-et) + " nano seconds to find " + search$ + " found"  + #CRLF$  
  Else 
    time$ + Str(et1-et) + " nano seconds to find " +  search$ + " not found"  + #CRLF$  
  EndIf   
    
  SetClipboardText(time$ + out$) 
  
  MessageRequester("time to search " + Str(ct) + " items", time$ + out$)
  
EndIf 
sq\Free()