Hi,
is it possible to use a tree-based search on a very large (about 5 GB) .txt
file to find a line in it that begins with a hash (md5)?
The result that should be returned should be the full line (hash|file name...|size).
E.g.
data.txt (UTF-8 encoded, no BOM)
...
54edea5c805e0b99fee035bb508de1f8|file name f|9254
974b5a3280e449c99921c7e9fbbd97e9|file name a|3549
dfdf2ec41e40b34a484f7163201fbd53|file name b|34
e2f648ae40d234a3892e1455b4dbbe05|file name x|24552
...
Lines are separated by CRLF and the entries are sorted by
their hash (ascending).
Any hints how this could be done in the least amount of time
(probably by using some kind of binary / tree-like search
algorithm)?
Binary search a sorted file?
Re: Binary search a sorted file?
a tree or trie based method would probably be a bit much with a 5gb file as they're loading into memory
I'd instead look at wilberts string find module and try the SSE parallel search or boyer moore
if SSE isn't available.
viewtopic.php?f=12&t=62519
I'd instead look at wilberts string find module and try the SSE parallel search or boyer moore
if SSE isn't available.
viewtopic.php?f=12&t=62519
Windows 11, Manjaro, Raspberry Pi OS


Re: Binary search a sorted file?
Depending on the amount and frequency of searching I think it would be useful to create an index of the file.
The index can be in memory or on disk. The index would include the the offset for each line. Based on a binary search of 4 000 000 000 hashes each search for a hash would need at most 32 comparisons.
The index can be in memory or on disk. The index would include the the offset for each line. Based on a binary search of 4 000 000 000 hashes each search for a hash would need at most 32 comparisons.
Re: Binary search a sorted file?
missed that it was sorted. Go with Demivec's suggestion.
rebuild the file so it's numeric and the entries are fixed length with an index's to file name in another file
then you can easily step through it with a binary search without having to build an index into the file
rebuild the file so it's numeric and the entries are fixed length with an index's to file name in another file
then you can easily step through it with a binary search without having to build an index into the file
Windows 11, Manjaro, Raspberry Pi OS


Re: Binary search a sorted file?
Solved it by using the (iterative) example code from
https://rosettacode.org/wiki/Binary_search#PureBasic
Maybe this could be more streamlined but in general
I'm happy with the solution. Finding a hash in a 25 GB
test file takes about 3 to 20 milliseconds (seek times
aren't relevant, the file is stored on a nvme ssd)...
https://rosettacode.org/wiki/Binary_search#PureBasic
Maybe this could be more streamlined but in general
I'm happy with the solution. Finding a hash in a 25 GB
test file takes about 3 to 20 milliseconds (seek times
aren't relevant, the file is stored on a nvme ssd)...
Code: Select all
; Return values
; -2 = Hashes can't be compared (non-equal lengths)
; -1 = Hash 1 contains at least one smaller ascii value
; 0 = Hashes are equal
; 1 = Hash 1 contains at least one larger ascii value
; hash1 = Hash to find
; hash2 = Hash from the current line
Procedure.i CompareHashes(hash1.s, hash2.s)
Protected.i i, result, lenHash1, ascHash1, ascHash2
; Hashes are equal
If hash1 = hash2
result = 0
; Hashes are different
Else
lenHash1 = Len(hash1)
; Hashes can be compared
If lenHash1 = Len(hash2)
For i = 1 To lenHash1
ascHash1 = Asc(Mid(hash1, i, 1))
ascHash2 = Asc(Mid(hash2, i, 1))
If ascHash1 < ascHash2
result = -1
Break
ElseIf ascHash1 > ascHash2
result = 1
Break
EndIf
Next
Else
result = -2
EndIf
EndIf
ProcedureReturn result
EndProcedure
Procedure.i FindHash(hFile, hash.s, low.q, high.q)
Protected.i ascChar
Protected.q mid, midPrev ; midPrev = the seek position before going back to find the last previous line feed
Protected.s line, curHash
While low <= high
mid = (low + high) / 2
FileSeek(hFile, mid)
ascChar = ReadCharacter(hFile, #PB_UTF8)
; Find the previous line feed (Ascii: 10) character (to be able to read the "current" line afterwards) or the
; beginning of the file
midPrev = mid
Repeat
mid - 1
FileSeek(hFile, mid)
ascChar = ReadCharacter(hFile, #PB_UTF8)
Until ascChar = 10 Or mid = 0
; ReadCharacter() moves the pointer position one step forward so we need to make sure to reset it to 0 if necessary.
; Otherwise we would be missing the first char of the file, leading to not comparable hashes...
If Loc(hFile) = 1
FileSeek(hFile, 0)
EndIf
line = ReadString(hFile, #PB_UTF8)
curHash = StringField(line, 1, ":")
Select CompareHashes(hash, curHash)
Case -1
high = mid - 1
Case 0
ProcedureReturn Val(StringField(line, 3, ":"))
Case 1
low = midPrev + 1
Case -2
ProcedureReturn -2
EndSelect
Wend
ProcedureReturn -1
EndProcedure
Define.i hFile, result
Define.s file, hashToFind
file = "E:\data.txt"
hashToFind = "974b5a3280e449c99921c7e9fbbd97e9"
hFile = ReadFile(#PB_Any, file, #PB_File_SharedRead|#PB_UTF8)
If hFile
result = FindHash(hFile, hashToFind, 0, Lof(hFile))
If result >= 1
Debug "Hash found!"
ElseIf result = -2
Debug "Hash comparison failed!"
Else
Debug "Hash NOT found!"
EndIf
CloseFile(hFile)
EndIf