Binary search vs. Hash table

Share your advanced PureBasic knowledge/code with the community.
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Binary search vs. Hash table

Post by Trond »

netmaestro wrote:
although not quite as fast as list insertion or sorted array search
I agree with you on the list insertion, but I can't on the sorted array search. ....
idle wrote:So for 10,000 elements a binary search is often faster but compared to 1,000,000 elements there's no comparison.
For 10 000 elements a binary search will definetely win. Of course, the hash table wins by a far amount when the number of items is absolutely enormous. But one million items is not enormous.

I tested my binary search against srod's hash functions.

Number of keys: 1 158 189 strings from a dictionary

Sorting and adding the keys and values to an array for binary search: 844 ms
Adding the keys and values to the hash table: 1625 ms

The lookup was performed 1 000 000 times with the same random keys for both algorithms.
Hash lookup: 1984 ms
Binary search lookup: 2250 ms

Over a trial of one million lookup, binary search loses by only 275 ms, or 0,000275 ms per lookup.

A "clear" win for the hash table, except that the hash table uses absolutely ridiculous amounts of memory. For making the hash table perform like this, I used a table size of two million (almost twice as large as the key set). This means 841 811 additional array elements compared to the binary search.
Also each hash item has a field for table id and the calculated hash. This is 8 bytes extra for each item.
And the hash items are stored in a liked list, again 8 bytes extra for each item.

So, if my calculations are correct, the hash table used an additional 841 811*4 + 8*1 158 189 + 8*1 158 189 bytes, or 20.8 megabytes compared to the binary search array. (Srod, please look over this.)

The size of the dictionary was only 11,5 mb.

If I lower the hash table size to 200 000 elemens, the hash table loses on the lookup, and still uses more memory than the sorted array.

Idle said there was no comparison with one million elements, I disagree. Both are fair contestants depending on your need.

As for PB's new map library. Insertion: 97 seconds. Lookup: 108 seconds. The performance seems horrific. Good thing it's a beta.

My benchmarking code:

Code: Select all


XIncludeFile "repository\hashtable.pb" ; srod's hashtable

Structure SPair
  Name.s
  Value.l
EndStructure

Procedure BS_GetItem(Array Arr.SPair(1), Max.l, Name.s)
  Protected Low, High, Mid
  High = Max
  While Low < High
    Mid = (Low + High) >> 1
    If Arr(Mid)\Name < Name
       Low = Mid + 1
    Else
       High = Mid
    EndIf
  Wend
  If Low < Max And Arr(Low)\Name = Name
    ProcedureReturn Low
  EndIf
  ProcedureReturn -1
EndProcedure

; Load data
Max = 50000
Global Dim Strings.s(Max)
ReadFile(0, "c:\dic-0294.txt")
While Not Eof(0)
  Strings(I) = ReadString(0)
  I + 1
  If I > Max
    Max + 50000
    ReDim Strings.s(Max)
  EndIf
Wend
Max = I - 1
ReDim Strings.s(Max)
MessageRequester("Data loaded", Str(Max) + " words loaded")
; 1 158 189 words loaded


; Add data to array
time = ElapsedMilliseconds()
SortArray(Strings(), #PB_Sort_Ascending)
Global Dim S2.SPair(I)
For I = 0 To Max
  S2(I)\Name = Strings(I)
  S2(I)\Value = I
Next
time = ElapsedMilliseconds()-time
MessageRequester("Data added", "Adding and sorting to array took " + Str(time) + "ms")

; Add data to srod hash table
time = ElapsedMilliseconds()
HTid = HT_CreateHashTable(2000000)
For i = 0 To Max
  HT_AddItem(HTid, Strings(I), I)
Next
time = ElapsedMilliseconds()-time
MessageRequester("Data added", "Data added to srod's hash table took " + Str(time) + "ms")

; Add Data To PB's map
compilerif 0
time = ElapsedMilliseconds()
NewMap Pbmap.i()
For I = 0 To Max
  Pbmap(Strings(I)) = I
Next
time = ElapsedMilliseconds()-time
MessageRequester("Data added", "Data added to map took " + Str(time) + "ms")
compilerendif

; All should of course lookup the same values!
#Tries = 1000000
Global Dim Lookups(#Tries)
For I = 0 To #Tries
  Lookups(I) = Random(Max)
Next

time = ElapsedMilliseconds()
For U = 0 To #Tries
  HT_GetItem(HTid, Strings(Lookups(U)))
Next
MessageRequester("Srod Hash lookup", Str(ElapsedMilliseconds()-time))

time = ElapsedMilliseconds()
For U = 0 To #Tries
  BS_GetItem(S2(), Max, Strings(Lookups(U)))
Next
MessageRequester("Binary search", Str(ElapsedMilliseconds()-time))

compilerif 0
time = ElapsedMilliseconds()
For U = 0 To #Tries
  Pbmap(Strings(Lookups(U)))
Next
MessageRequester("PB Map lookup", Str(ElapsedMilliseconds()-time))
compilerendif
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

My hash table implementation will not be the best one to use Trond because it does spend a lot of time dealing with strings etc. It was always a case of convenience over speed with this particular library.

There are more streamlined implementations available which might better serve this kind of comparison.

**EDIT : ah you are using my old old old hash table which I no longer use! Right, ignore what I said above! Although, this old hash table code is probably even worse because it uses linked lists to deal with collisions.
I may look like a mule, but I'm not a complete ass.
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Post by Trond »

So, what do I use? Pb's map is a complete disaster in the benchmark.
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

If you can make the text file available then I shall try my updated hash-table library if you like? It does use a binary search when dealing with collisions so the lookups might be quicker. Adding data is likely to be slower however.
I may look like a mule, but I'm not a complete ass.
cxAlex
User
User
Posts: 88
Joined: Fri Oct 24, 2008 11:29 pm
Location: Austria
Contact:

Post by cxAlex »

Her another HT to benchmark:

Test package: http://www.file-upload.net/download-185 ... t.zip.html
German Forum: http://www.purebasic.fr/german/viewtopic.php?t=18474

In your benchmark the table is 3 times faster than binaryserach at getting values and a little bit faster at adding values.

Basicaly the table is programmed from inc but i removed many features which i don't need and optimized it for speed.

Greets, Alex

Edit: I used this wordlist:

http://www.linux-pour-lesnuls.com/tradu ... c-0294.txt
Fred
Administrator
Administrator
Posts: 18153
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Post by Fred »

What's your hashtable internal lookup table size ? PB uses a 512 slot one for now, so for very big dictionnary, it will have a lot of collisions. This number could be changed in a later version.
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

In the case of my implementations Fred, the user sets that when creating the hash table.
I may look like a mule, but I'm not a complete ass.
Fred
Administrator
Administrator
Posts: 18153
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Post by Fred »

So, it's '2000000' here ? No wonder it's way faster ;).
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

That would go some way towards explaining it yes! :)
I may look like a mule, but I'm not a complete ass.
Little John
Addict
Addict
Posts: 4775
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Post by Little John »

srod wrote:In the case of my implementations Fred, the user sets that when creating the hash table.
Maybe this can be added as an option for PB's maps as well:

Code: Select all

NewMap name() [, slots]
Regards, Little John
User avatar
netmaestro
PureBasic Bullfrog
PureBasic Bullfrog
Posts: 8451
Joined: Wed Jul 06, 2005 5:42 am
Location: Fort Nelson, BC, Canada

Post by netmaestro »

Please let us specify a map size, Fred. Otherwise it's not good for much. To make a chess engine I need to allocate hundreds of mb for hashing.
BERESHEIT
Fred
Administrator
Administrator
Posts: 18153
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Post by Fred »

Yes, it's planned.
User avatar
netmaestro
PureBasic Bullfrog
PureBasic Bullfrog
Posts: 8451
Joined: Wed Jul 06, 2005 5:42 am
Location: Fort Nelson, BC, Canada

Post by netmaestro »

@Fred: Perfect, that's good to hear.

@Trond: Thanks for the work you put into this analysis, it's useful.
BERESHEIT
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Post by Trond »

Here is the word list:
http://www.2shared.com/file/7492107/4d6116d1/list.html
The download link is at the bottom of the page.

I have to say, that I made this benchmark a long time ago, when the used hash table wasn't old, but new and shiny. Maybe some new hash tables can beat the old one?

And why not let PB's maps automatically expand?
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

Automatically expanding the initial tables size after elements have been added can be a slow process, often requiring some kind of rehashing etc. I don't know of any PB implementation that allows that, although Freak has voiced the possibility of adding this to the map library.
I may look like a mule, but I'm not a complete ass.
Post Reply