Binary search a sorted file?

forumuser · Post by **forumuser** » Mon Mar 18, 2019 8:18 pm

Hi,

is it possible to use a tree-based search on a very large (about 5 GB) .txt
file to find a line in it that begins with a hash (md5)?
The result that should be returned should be the full line (hash|file name...|size).

E.g.
data.txt (UTF-8 encoded, no BOM)

...
54edea5c805e0b99fee035bb508de1f8|file name f|9254
974b5a3280e449c99921c7e9fbbd97e9|file name a|3549
dfdf2ec41e40b34a484f7163201fbd53|file name b|34
e2f648ae40d234a3892e1455b4dbbe05|file name x|24552
...

Lines are separated by CRLF and the entries are sorted by
their hash (ascending).

Any hints how this could be done in the least amount of time
(probably by using some kind of binary / tree-like search
algorithm)?

Post by **idle** » Mon Mar 18, 2019 10:06 pm

a tree or trie based method would probably be a bit much with a 5gb file as they're loading into memory
I'd instead look at wilberts string find module and try the SSE parallel search or boyer moore
if SSE isn't available.
viewtopic.php?f=12&t=62519

Demivec · Post by **Demivec** » Mon Mar 18, 2019 10:33 pm

Depending on the amount and frequency of searching I think it would be useful to create an index of the file.

The index can be in memory or on disk. The index would include the the offset for each line. Based on a binary search of 4 000 000 000 hashes each search for a hash would need at most 32 comparisons.

Post by **idle** » Mon Mar 18, 2019 11:28 pm

missed that it was sorted. Go with Demivec's suggestion.
rebuild the file so it's numeric and the entries are fixed length with an index's to file name in another file
then you can easily step through it with a binary search without having to build an index into the file

forumuser · Post by **forumuser** » Thu Mar 21, 2019 9:25 am

Solved it by using the (iterative) example code from
https://rosettacode.org/wiki/Binary_search#PureBasic

Maybe this could be more streamlined but in general
I'm happy with the solution. Finding a hash in a 25 GB
test file takes about 3 to 20 milliseconds (seek times
aren't relevant, the file is stored on a nvme ssd)...

Code: Select all


; Return values
; -2 = Hashes can't be compared (non-equal lengths)
; -1 = Hash 1 contains at least one smaller ascii value
;  0 = Hashes are equal
;  1 = Hash 1 contains at least one larger ascii value

; hash1 = Hash to find
; hash2 = Hash from the current line
Procedure.i CompareHashes(hash1.s, hash2.s)
  Protected.i i, result, lenHash1, ascHash1, ascHash2

  ; Hashes are equal
  If hash1 = hash2
    result = 0

  ; Hashes are different
  Else
    lenHash1 = Len(hash1)

    ; Hashes can be compared
    If lenHash1 = Len(hash2)
      For i = 1 To lenHash1
        ascHash1 = Asc(Mid(hash1, i, 1))
        ascHash2 = Asc(Mid(hash2, i, 1))

        If ascHash1 < ascHash2
          result = -1
          Break

        ElseIf ascHash1 > ascHash2
          result = 1
          Break
        EndIf
      Next

    Else
      result = -2
    EndIf
  EndIf

  ProcedureReturn result
EndProcedure


Procedure.i FindHash(hFile, hash.s, low.q, high.q)
  Protected.i ascChar
  Protected.q mid, midPrev ; midPrev = the seek position before going back to find the last previous line feed
  Protected.s line, curHash

  While low <= high
    mid = (low + high) / 2
    FileSeek(hFile, mid)
    ascChar = ReadCharacter(hFile, #PB_UTF8)

    ; Find the previous line feed (Ascii: 10) character (to be able to read the "current" line afterwards) or the
    ; beginning of the file
    midPrev = mid
    Repeat
      mid - 1
      FileSeek(hFile, mid)
      ascChar = ReadCharacter(hFile, #PB_UTF8)
    Until ascChar = 10 Or mid = 0

    ; ReadCharacter() moves the pointer position one step forward so we need to make sure to reset it to 0 if necessary.
    ; Otherwise we would be missing the first char of the file, leading to not comparable hashes...
    If Loc(hFile) = 1
      FileSeek(hFile, 0)
    EndIf

    line = ReadString(hFile, #PB_UTF8)
    curHash = StringField(line, 1, ":")

    Select CompareHashes(hash, curHash)
      Case -1
        high = mid - 1

      Case 0
        ProcedureReturn Val(StringField(line, 3, ":"))

      Case 1
        low = midPrev + 1

      Case -2
        ProcedureReturn -2
    EndSelect
  Wend

  ProcedureReturn -1
EndProcedure


Define.i hFile, result
Define.s file, hashToFind

file = "E:\data.txt"
hashToFind = "974b5a3280e449c99921c7e9fbbd97e9"

hFile = ReadFile(#PB_Any, file, #PB_File_SharedRead|#PB_UTF8)
If hFile
  result = FindHash(hFile, hashToFind, 0, Lof(hFile))
  If result >= 1
    Debug "Hash found!"
  ElseIf result = -2
    Debug "Hash comparison failed!"
  Else
    Debug "Hash NOT found!"
  EndIf
  CloseFile(hFile)
EndIf

PureBasic Forums - English

Binary search a sorted file?

Binary search a sorted file?

Re: Binary search a sorted file?

Re: Binary search a sorted file?

Re: Binary search a sorted file?

Re: Binary search a sorted file?