Binary search vs. Hash table

Trond · Post by **Trond** » Mon Aug 31, 2009 11:57 am

netmaestro wrote:
although not quite as fast as list insertion or sorted array search
I agree with you on the list insertion, but I can't on the sorted array search. ....

idle wrote:So for 10,000 elements a binary search is often faster but compared to 1,000,000 elements there's no comparison.

For 10 000 elements a binary search will definetely win. Of course, the hash table wins by a far amount when the number of items is absolutely enormous. But one million items is not enormous.

I tested my binary search against srod's hash functions.

Number of keys: 1 158 189 strings from a dictionary

Sorting and adding the keys and values to an array for binary search: 844 ms
Adding the keys and values to the hash table: 1625 ms

The lookup was performed 1 000 000 times with the same random keys for both algorithms.
Hash lookup: 1984 ms
Binary search lookup: 2250 ms

Over a trial of one million lookup, binary search loses by only 275 ms, or 0,000275 ms per lookup.

A "clear" win for the hash table, except that the hash table uses absolutely ridiculous amounts of memory. For making the hash table perform like this, I used a table size of two million (almost twice as large as the key set). This means 841 811 additional array elements compared to the binary search.
Also each hash item has a field for table id and the calculated hash. This is 8 bytes extra for each item.
And the hash items are stored in a liked list, again 8 bytes extra for each item.

So, if my calculations are correct, the hash table used an additional 841 811*4 + 8*1 158 189 + 8*1 158 189 bytes, or 20.8 megabytes compared to the binary search array. (Srod, please look over this.)

The size of the dictionary was only 11,5 mb.

If I lower the hash table size to 200 000 elemens, the hash table loses on the lookup, and still uses more memory than the sorted array.

Idle said there was no comparison with one million elements, I disagree. Both are fair contestants depending on your need.

As for PB's new map library. Insertion: 97 seconds. Lookup: 108 seconds. The performance seems horrific. Good thing it's a beta.

My benchmarking code:

Code: Select all


XIncludeFile "repository\hashtable.pb" ; srod's hashtable

Structure SPair
  Name.s
  Value.l
EndStructure

Procedure BS_GetItem(Array Arr.SPair(1), Max.l, Name.s)
  Protected Low, High, Mid
  High = Max
  While Low < High
    Mid = (Low + High) >> 1
    If Arr(Mid)\Name < Name
       Low = Mid + 1
    Else
       High = Mid
    EndIf
  Wend
  If Low < Max And Arr(Low)\Name = Name
    ProcedureReturn Low
  EndIf
  ProcedureReturn -1
EndProcedure

; Load data
Max = 50000
Global Dim Strings.s(Max)
ReadFile(0, "c:\dic-0294.txt")
While Not Eof(0)
  Strings(I) = ReadString(0)
  I + 1
  If I > Max
    Max + 50000
    ReDim Strings.s(Max)
  EndIf
Wend
Max = I - 1
ReDim Strings.s(Max)
MessageRequester("Data loaded", Str(Max) + " words loaded")
; 1 158 189 words loaded


; Add data to array
time = ElapsedMilliseconds()
SortArray(Strings(), #PB_Sort_Ascending)
Global Dim S2.SPair(I)
For I = 0 To Max
  S2(I)\Name = Strings(I)
  S2(I)\Value = I
Next
time = ElapsedMilliseconds()-time
MessageRequester("Data added", "Adding and sorting to array took " + Str(time) + "ms")

; Add data to srod hash table
time = ElapsedMilliseconds()
HTid = HT_CreateHashTable(2000000)
For i = 0 To Max
  HT_AddItem(HTid, Strings(I), I)
Next
time = ElapsedMilliseconds()-time
MessageRequester("Data added", "Data added to srod's hash table took " + Str(time) + "ms")

; Add Data To PB's map
compilerif 0
time = ElapsedMilliseconds()
NewMap Pbmap.i()
For I = 0 To Max
  Pbmap(Strings(I)) = I
Next
time = ElapsedMilliseconds()-time
MessageRequester("Data added", "Data added to map took " + Str(time) + "ms")
compilerendif

; All should of course lookup the same values!
#Tries = 1000000
Global Dim Lookups(#Tries)
For I = 0 To #Tries
  Lookups(I) = Random(Max)
Next

time = ElapsedMilliseconds()
For U = 0 To #Tries
  HT_GetItem(HTid, Strings(Lookups(U)))
Next
MessageRequester("Srod Hash lookup", Str(ElapsedMilliseconds()-time))

time = ElapsedMilliseconds()
For U = 0 To #Tries
  BS_GetItem(S2(), Max, Strings(Lookups(U)))
Next
MessageRequester("Binary search", Str(ElapsedMilliseconds()-time))

compilerif 0
time = ElapsedMilliseconds()
For U = 0 To #Tries
  Pbmap(Strings(Lookups(U)))
Next
MessageRequester("PB Map lookup", Str(ElapsedMilliseconds()-time))
compilerendif

srod · Post by **srod** » Mon Aug 31, 2009 12:05 pm

My hash table implementation will not be the best one to use Trond because it does spend a lot of time dealing with strings etc. It was always a case of convenience over speed with this particular library.

There are more streamlined implementations available which might better serve this kind of comparison.

**EDIT : ah you are using my old old old hash table which I no longer use! Right, ignore what I said above! Although, this old hash table code is probably even worse because it uses linked lists to deal with collisions.

Trond · Post by **Trond** » Mon Aug 31, 2009 12:08 pm

So, what do I use? Pb's map is a complete disaster in the benchmark.

srod · Post by **srod** » Mon Aug 31, 2009 12:25 pm

If you can make the text file available then I shall try my updated hash-table library if you like? It does use a binary search when dealing with collisions so the lookups might be quicker. Adding data is likely to be slower however.

cxAlex · Post by **cxAlex** » Mon Aug 31, 2009 12:27 pm

Her another HT to benchmark:

Test package: http://www.file-upload.net/download-185 ... t.zip.html
German Forum: http://www.purebasic.fr/german/viewtopic.php?t=18474

In your benchmark the table is 3 times faster than binaryserach at getting values and a little bit faster at adding values.

Basicaly the table is programmed from inc but i removed many features which i don't need and optimized it for speed.

Greets, Alex

Edit: I used this wordlist:

http://www.linux-pour-lesnuls.com/tradu ... c-0294.txt

Post by **Fred** » Mon Aug 31, 2009 12:41 pm

What's your hashtable internal lookup table size ? PB uses a 512 slot one for now, so for very big dictionnary, it will have a lot of collisions. This number could be changed in a later version.

srod · Post by **srod** » Mon Aug 31, 2009 12:42 pm

In the case of my implementations Fred, the user sets that when creating the hash table.

Post by **Fred** » Mon Aug 31, 2009 12:43 pm

So, it's '2000000' here ? No wonder it's way faster

.

srod · Post by **srod** » Mon Aug 31, 2009 12:45 pm

That would go some way towards explaining it yes!

Little John · Post by **Little John** » Mon Aug 31, 2009 12:49 pm

srod wrote:In the case of my implementations Fred, the user sets that when creating the hash table.

Maybe this can be added as an option for PB's maps as well:

Code: Select all

NewMap name() [, slots]

Regards, Little John

netmaestro · Post by **netmaestro** » Mon Aug 31, 2009 1:19 pm

Please let us specify a map size, Fred. Otherwise it's not good for much. To make a chess engine I need to allocate hundreds of mb for hashing.

Post by **Fred** » Mon Aug 31, 2009 1:25 pm

Yes, it's planned.

netmaestro · Post by **netmaestro** » Mon Aug 31, 2009 1:30 pm

@Fred: Perfect, that's good to hear.

@Trond: Thanks for the work you put into this analysis, it's useful.

Trond · Post by **Trond** » Mon Aug 31, 2009 1:38 pm

Here is the word list:
http://www.2shared.com/file/7492107/4d6116d1/list.html
The download link is at the bottom of the page.

I have to say, that I made this benchmark a long time ago, when the used hash table wasn't old, but new and shiny. Maybe some new hash tables can beat the old one?

And why not let PB's maps automatically expand?

srod · Post by **srod** » Mon Aug 31, 2009 1:42 pm

Automatically expanding the initial tables size after elements have been added can be a slow process, often requiring some kind of rehashing etc. I don't know of any PB implementation that allows that, although Freak has voiced the possibility of adding this to the map library.