MrCor wrote: Tue Oct 18, 2022 2:15 pm
Hi guys and dolls,
I am looking for coding suggestions concerning a certain case.
- a textfile (UTF-8)
- filesize is 426MB
- 11.183.522 lines
- each line contains 10-35 characters
- the file is static. This means it is not changing over time.
I want to find a certain string in this textfile.
Really need more information on what you want to do.
Are you looking up many strings frequently? IF No, Use an SSE string find for < 16 byte strings or booyer-moore for > 16
viewtopic.php?t=62519
Do the look up strings normally exist in the set? if yes map(), If no a Trie() as they bail out early
viewtopic.php?t=79453
Or Infratecs sqlite DB above.
If you need a high look up rate, use a bloom filter before going into slower memory structures or file searching. lookup rates for 32 byte strings is around 50 million per second per thread. look up times are ~10 - 20 nano seconds.
viewtopic.php?p=573334
Code: Select all
XIncludeFile "bloom.pbi"
EnableExplicit
Global bloom.ibloom
Global ers.d = 1.0 / 1024 ;max errors
Global size = 1<<22
Global path$,filename$,out$,search$,time$,len,st,et,bfound,a,ct
path$ = GetPathPart(ProgramFilename()) + "bloom.blm"
If FileSize(path$) > 0
bloom = Bloom_load(path$)
Else
bloom = Bloom_New(size,ers) ;size must be power of 2
Filename$ = OpenFileRequester("Choose a file", "", "Searchfile|*.txt;", 0)
If Filename$
If ReadFile(0,Filename$)
Repeat
out$ = ReadString(0,#PB_UTF8)
bloom\Set(@out$,StringByteLength(out$))
Until Eof(0)
out$=""
EndIf
EndIf
path$ = GetPathPart(ProgramFilename()) + "bloom.blm"
len = bloom\Save(path$)
MessageRequester("bloom","created bloom filter: " + StrF(len / (1024*1024.0),2) + " mb")
EndIf
If bloom
Search$ = InputRequester("Textsearch", "Search for:", "")
If Search$ <> ""
st = ElapsedMilliseconds()
len = StringByteLength(Search$)
For a = 0 To 1000000 ;loop 1m times average time for one lookup is nano seconds
bfound = bloom\Get(@Search$,len)
Next
et = ElapsedMilliseconds()
If bfound
time$ + StrF(et-st) + " nano seconds to find " + search$ + #CRLF$
EndIf
SetClipboardText(time$)
MessageRequester("time to search through " + Str(ct) + " items", time$)
EndIf
bloom\Free()
EndIf
Squint only takes nano seconds to look up or enumerate item from millions of items
viewtopic.php?p=586544
a map will achieve the same rates but if items don't exist squint is faster
lookup is looped 1,000,000 times. time is equivalent to nanoseconds per look up
enumeration is looped 1000 times equivalent to micro seconds per enumeration.
57 nano seconds to find rockbo not found
3 micro seconds to enum 3 items
rockboat.net 19262089
rockborn.net 19262102
rockbottomrx.com 1166911
Code: Select all
XIncludeFile "Squint3.pbi"
UseModule SQUINT
EnableExplicit
Structure item
key.s
pos.i
EndStructure
Global sq.isquint = SquintNew()
Global NewList found.item()
Global ct,et,et1,st,Search$,Out$,Filename$,a,time$,bfound
Filename$ = OpenFileRequester("Choose a file", "", "Searchfile|*.txt;", 0)
If Filename$
If ReadFile(0,Filename$)
Repeat
out$ = ReadString(0,#PB_UTF8)
sq\Set(0,@out$,Loc(0))
ct+1
Until Eof(0)
out$=""
EndIf
EndIf
Procedure CBSquint(*key,value,*userData)
AddElement(found())
found()\key = PeekS(*key,-1,#PB_UTF8)
found()\pos = value
EndProcedure
Search$ = InputRequester("Textsearch", "Search for:", "")
If Search$ <> ""
ClearList(found())
st = ElapsedMilliseconds()
For a = 0 To 1000
ClearList(found())
sq\Enum(@search$,@CBSquint())
out$=""
Next
et = ElapsedMilliseconds()
For a = 0 To 1000000
bfound = sq\get(0,@Search$)
Next
et1 = ElapsedMilliseconds()
ForEach found()
out$ + found()\key + " " + found()\pos + #CRLF$
Next
time$ = StrF(et-st) + " micro seconds to enum " + Str(ListSize(found())) + " items " + #CRLF$
If bfound
time$ + StrF(et1-et) + " nano seconds to find " + search$ + " found" + #CRLF$
Else
time$ + Str(et1-et) + " nano seconds to find " + search$ + " not found" + #CRLF$
EndIf
SetClipboardText(time$ + out$)
MessageRequester("time to search " + Str(ct) + " items", time$ + out$)
EndIf
sq\Free()