Page 1 of 5
Hash tables.
Posted: Sun Nov 19, 2006 5:06 pm
by srod
Hi,
being in need of a dynamic means of storing data indexed on textual information, I turned to creating a simple (but hopefully effective

) library for creating and using 'hash tables'.
For those who've never heard of a hash table;
LOOK HERE (Hey Sparkie, that's not your website is it?

)
Basically then, a hash table acts like an array; but rather than indexing each entry of the array with an integer, a hash table allows you to use a string value.
E.g. I might wish to store peoples telephone numbers in an array. In order to then look-up Bob Smiths phone number, a hash table would allow something along the lines of
The great thing about hash tables, however, is that the speed of data access is far far greater than simply searching an array of strings etc. Indeed, if you've set aside 'sufficient' memory and the 'hashing algorithm' is a good 'un, then access can be almost immediate and requires a single 'lookup'. (Contrast this with a search through a string array etc!)
Of course there can be 'collisions', but a good library will have contingency plans for such cases where more than one item of data attempt to share the same 'index'. In the case of my library if, for example, 3 'keys' generate the same hash, then I shove the 3 accompanying values into a linked list etc.
Anyhow, allocating sufficient memory and a good hashing algorithm are the key.
**Continued below (I don't like long posts!)
Posted: Sun Nov 19, 2006 5:20 pm
by srod
**Continued from above**
My library currently uses the 'djb2' hashing algorithm which is very simple and rumoured to be very effective at generating hash indexes which are quite uniformly spread out (which is the ideal situation).
I've set the library up, however, so that you can very easily switch this algorithm for another one if you so wish.
The ideal situation is in knowing the maximum number of 'rows' of data which you wish to store and then allocating the correponding amount of memory. However, if you do not know the max, then simply allocate as much memory as you feel comfortable. The library will automatically handle all 'collisions'.
Commands
- HT_CreateHashTable(numberofindexes)
-the bigger the value of 'numberofindexes' the less chance of 'collision'.
Note that this value is not an upper-limit to the amount of data you can store in the hash table. There is no such limit (apart from the amount of physical memory!)
The return value is needed to identify the hash table with certain other functions etc.
- HT_AddItem(HTid, key$, userdata)
-'HTid' is the value returned from the HT_CreateHashTable() command.
- HT_IsKeyPresent(HTid, key$)
- HT_GetItem(HTid, key$)
- HT_RemoveItem(HTid, key$)
- HT_RemoveHashTable(HTid)
- HT_CreateHash(numelementsintable, key$)
Library + example in next post.
Posted: Sun Nov 19, 2006 5:23 pm
by srod
Example program:
Code: Select all
;HASH TABLES
;Stephen Rodriguez; November 2006.
;********************************************
;*Example. *
;********************************************
XIncludeFile "PB hashtables.pbi"
DisableExplicit
;Create a hash table of 100 indexes. (No real limit to how many tables you can create,
;heck you can have an array of them if you wish!)
;The 100 indexes is not a limit on the amount of entries we can add to the hash table.
;We could add 100000 'rows' of data if we wished. It is the number of unique hashes allowed
;in this table. The bigger the number of indexes, the less chance that two keys will produce
;the same index etc. This is known as a 'collision' which this library can handle without
;a problem.
;The more data added to a hash table, the more chance of a collision, which are inevitable!
HTid = HT_CreateHashTable(20)
;Let's add some data.
;Hash table entries are indexed by a string; here we use 5 basic 'key' values; 'entry1'
;'entry2' etc.
HT_AddItem(HTid, "entry1",100)
HT_AddItem(HTid, "entry2",200)
HT_AddItem(HTid, "entry3",300)
HT_AddItem(HTid, "entry4",400)
HT_AddItem(HTid, "entry5",500)
;Lets have a look at our data.
;Access is very fast indeed. Most access will be of order O(1). Worst case scenario is O(n),
;but this is very unlikely to ever happen.
For i = 1 To 5
Debug HT_GetItem(HTid, "entry"+Str(i))
Next
PB hashtables.pbi
Code: Select all
;===============================================
; Program: HASH TABLES
; Version: Beta 1.0.
; Date: November 19 2006
; Author: that genius srod!!! Ahem... Swap("Genius", "Mule")
; TargetOS: ALL.
; Compiler: PureBasic 4.0 for Windows.
; Threadsafe? On Windows, yes. Unsure about other OS's!!!
; Licence: Do as you wish with it! No warranties either implied
; or given... blah blah! :)
;===============================================
EnableExplicit
#HT_OVERWRITEDUPLICATES = #True ;Set to #FALSE if duplicates are to be ignored.
;-************************DECLARES
Declare.l HT_CreateHashTable(numberofindexes)
Declare HT_AddItem(HTid, key$, userdata)
Declare.l HT_IsKeyPresent(HTid, key$)
Declare.l HT_GetItem(HTid, key$)
Declare HT_RemoveItem(HTid, key$)
Declare HT_RemoveHashTable(HTid)
Declare.l HT_CreateHash(numelementsintable, key$)
;-************************STRUCTURES/GLOBALS
Structure _HTentry
tableid.l ;Used because all data for all hash tables is stored within the same linked list.
key.s ;Text from which hash is calculated. Stored because of possible 'collisions'.
hash.l ;Calculated hash code Required in order to deal with 'collisions' because all data is placed within the same linked list.
value.l
EndStructure
Global NewList _HTentries._HTentry()
Global _HTmutex=CreateMutex()
;-************************LIBRARY FUNCTIONS
;The following function creates a new hash table.
;If there is no error, it returns an identifier used in various other functions,
;else it returns zero.
Procedure.l HT_CreateHashTable(numberofindexes)
Protected returnvalue
If numberofindexes<=0 : ProcedureReturn 0 : EndIf
returnvalue=AllocateMemory((numberofindexes+1)*SizeOf(LONG))
If returnvalue
PokeL(returnvalue, numberofindexes) ;The 'zero' elements holds the number of elements in the hash table.
EndIf
ProcedureReturn returnvalue
EndProcedure
;The following function adds data to an existing hash table.
;If the key$ already exists, then we proceed as dictated by the value of #HT_OVERWRITEDUPLICATES.
;It is the programmer's responsibility to ensure that HTid references a valid hash table!
Procedure HT_AddItem(HTid, key$, value)
Protected numelements, hash, index, firstinlist, flag
numelements=PeekL(HTid)
If numelements
;Good to go! First job is to create the hash index.
hash=HT_CreateHash(numelements-1, key$)+1
;hash is now between 1 and 'numelements'.
;Now we need to add the data to the table.
;First check if there is already a list of elements (collision!)
index=HTid+hash*SizeOf(long)
firstinlist=PeekL(index)
LockMutex(_HTmutex) ;Ensure that only one thread has access to the linked list right now.
If firstinlist=0 ;This will be the first element with this hash.
ResetList(_HTentries())
AddElement(_HTentries())
PokeL(index, _HTentries())
_HTentries()\tableid=HTid
_HTentries()\key=key$
_HTentries()\hash=hash
_HTentries()\value=value
Else ;Collision!
;We need to check if the key already exists. If so, we proceed according to #HT_OVERWRITEDUPLICATES.
ChangeCurrentElement(_HTentries(),firstinlist)
Repeat
If _HTentries()\key=key$ And _HTentries()\tableid=HTid ;Key already exists!
flag=#True
;Check the value of #HT_OVERWRITEDUPLICATES.
If #HT_OVERWRITEDUPLICATES=#True
_HTentries()\value=value
EndIf
Break
ElseIf _HTentries()\hash<>hash Or _HTentries()\tableid<>HTid ;Gone too far through the list.
Break
EndIf
Until NextElement(_HTentries())=0
If flag=0 ;We need to create a new entry for the hash.
ChangeCurrentElement(_HTentries(),firstinlist)
InsertElement(_HTentries())
PokeL(index, _HTentries())
_HTentries()\tableid=HTid
_HTentries()\key=key$
_HTentries()\hash=hash
_HTentries()\value=value
EndIf
EndIf
UnlockMutex(_HTmutex)
EndIf
EndProcedure
;The following function checks a key value to see if it exists within the given hash table.
;A non-zero return indicates that the key is present within the hash table.
;It is the programmer's responsibility to ensure that HTid references a valid hash table!
Procedure.l HT_IsKeyPresent(HTid, key$)
Protected returnvalue, numelements, hash, index, firstinlist
numelements=PeekL(HTid)
If numelements
;Good to go! First job is to create the hash index.
hash=HT_CreateHash(numelements-1, key$)+1
;hash is now between 1 and 'numelements'.
index=HTid+hash*SizeOf(long)
firstinlist=PeekL(index)
If firstinlist
LockMutex(_HTmutex) ;Ensure that only one thread has access to the linked list right now.
ChangeCurrentElement(_HTentries(),firstinlist)
Repeat
If _HTentries()\key=key$ And _HTentries()\tableid=HTid ;Key already exists!
returnvalue=_HTentries()
Break
ElseIf _HTentries()\hash<>hash Or _HTentries()\tableid<>HTid ;Gone too far through the list.
Break
EndIf
Until NextElement(_HTentries())=0
UnlockMutex(_HTmutex)
EndIf
EndIf
ProcedureReturn returnvalue
EndProcedure
;The following function retrieves a data value from an existing hash table.
;It is the programmer's responsibility to ensure that HTid references a valid hash table!
;The return value does not indicate any error in the case that there is no appropriate
;entry in the underlying hash table.
;Instead, the developer should use the function HT_IsKeyPresent() function to test whether
;the key is present etc.
Procedure.l HT_GetItem(HTid, key$)
Protected returnvalue, numelements, listentry, flag
LockMutex(_HTmutex) ;Ensure that only one thread has access to the linked list right now.
;First identify the relevant entry in the hash table etc.
If HT_IsKeyPresent(HTid, key$)
returnvalue=_HTentries()\value
EndIf
UnlockMutex(_HTmutex)
ProcedureReturn returnvalue
EndProcedure
;The following function removes an entry from an existing hash table.
;It is the programmer's responsibility to ensure that HTid references a valid hash table!
Procedure HT_RemoveItem(HTid, key$)
Protected numelements, hash, index, firstinlist, lastinlist, item
numelements=PeekL(HTid)
If numelements
;Good to go! First job is to create the hash index.
hash=HT_CreateHash(numelements-1, key$)+1
;hash is now between 1 and 'numelements'.
;First check if there is list of elements with the corresponding hash.
index=HTid+hash*SizeOf(long)
firstinlist=PeekL(index)
If firstinlist
LockMutex(_HTmutex) ;Ensure that only one thread has access to the linked list right now.
ChangeCurrentElement(_HTentries(),firstinlist)
Repeat
If _HTentries()\key=key$ And _HTentries()\tableid=HTid ;Key found!
item=_HTentries()
ElseIf _HTentries()\hash<>hash Or _HTentries()\tableid<>HTid ;Gone too far through the list.
Break
EndIf
Until NextElement(_HTentries())=0
lastinlist=_HTentries()
If item ;If there is an entry to remove.
;We now have 3 pointers to elements within the list; firstinlist, item, lastinlist.
;We need to use these to update the list accordingly.
If item=firstinlist ;Need to alter the hash table entry to point to next item.
PokeL(index, PeekL(item-8))
EndIf
If firstinlist=lastinlist ;Need to flag that there are no entries with this hash code.
PokeL(index, 0)
EndIf
ChangeCurrentElement(_HTentries(),item)
DeleteElement(_HTentries())
EndIf
UnlockMutex(_HTmutex)
EndIf
EndIf
EndProcedure
;The following function completely a hash table.
;It is the programmer's responsibility to ensure that HTid references a valid hash table!
Procedure HT_RemoveHashTable(HTid)
Protected numelements
numelements=PeekL(HTid)
If numelements
LockMutex(_HTmutex) ;Ensure that only one thread has access to the linked list right now.
ResetList(_HTentries())
ForEach _HTentries()
If _HTentries()\tableid=HTid
DeleteElement(_HTentries())
EndIf
Next
UnlockMutex(_HTmutex)
FreeMemory(HTid)
EndIf
EndProcedure
;-************************HASH FUNCTION
;The following function creates the hash code.
;This can be altered to use a different alogorithm.
;The function MUST return a value between 0 and 'numelementsintable'.
;We currently use the 'djb2' algorithm.
;You do not need to worry about collisions, the library automatically takes care of this.
;NOTE, the following returns the same code when working in either ASCII or UNICODE.
;Probably best to ensure that any replacement algorithm does likewise, so avoid PeekB() etc.
;Either use PeekC() (and take care with pointers) or Asc().
Procedure.l HT_CreateHash(numelementsintable, key$)
Protected hash=5381, i
If key$
For i = 1 To Len(key$)
hash=(hash<<5)+hash + Asc(Mid(key$,i, 1)) ; hash * 33 + asc
Next
EndIf
If hash>=0
ProcedureReturn hash%(numelementsintable+1)
Else
ProcedureReturn -(hash%(numelementsintable+1))
EndIf
EndProcedure
I hope it's useful.
Regards.
Posted: Sun Nov 19, 2006 5:33 pm
by Dare
Very nice, very useful. Thanks.
Re: Hash tables.
Posted: Sun Nov 19, 2006 5:49 pm
by thefool
srod wrote:
The great thing about hash tables, however, is that the speed of data access is far far greater than simply searching an array of strings etc. Indeed, if you've set aside 'sufficient' memory and the 'hashing algorithm' is a good 'un, then access can be almost immediate and requires a single 'lookup'. (Contrast this with a search through a string array etc!)
I just wrote a very simple and very unoptimized array search routine, which allows to do just about the same stuff, and my code wins seriously with the debugger on.
So your hash algo must be a bad one
With debugger off, i can't test it! It simply crashes.
Sorry if i misunderstood something, like if you mentioned that your routine was slower than an actual array search or if it was not the point at all. Im pretty tired and i must admit i didn't read this very well.
Thanks for the interesting explanation anyway, i knew what hash tables were just never bothered to use them. But they do look interesting.
Posted: Sun Nov 19, 2006 5:52 pm
by srod
That is a weird crash.
What ever is going on probably explains the speed test!
I'll have a look.
Thanks.
**EDIT: got it. Will take a few minutes to fix.
Re: Hash tables.
Posted: Sun Nov 19, 2006 6:08 pm
by srod
thefool wrote:I just wrote a very simple and very unoptimized array search routine, which allows to do just about the same stuff, and my code wins seriously with the debugger on.
So your hash algo must be a bad one
With debugger off, i can't test it! It simply crashes.
The crash is fixed.
It was indeed the algo; I'd cocked it right up!
Any chance you can run your test again? Either that or pm me the test code and I'll have a tinker with it? The thing is that there probably needs to be a 'reasonable' amount of data before you'll see a noticeable difference in speed. My guess, about 20 or so entries.
I'll do some speed tests of my own.
Thanks again.
Posted: Sun Nov 19, 2006 6:34 pm
by thefool
Hi!
Now it works as it should
Currently with the 5 items, my code, at 99999 repeats of the for i=1 to 5...
is about 4-5 times faster.
I will pm you the code

Posted: Sun Nov 19, 2006 6:35 pm
by srod
@TheFool: I've just done some tests and the Hash Table completely blows away a straight foward string search if you give it enough work to do.
E.g.
Code: Select all
;HASH TABLES
;Stephen Rodriguez; November 2006.
;********************************************
;*Example. *
;********************************************
XIncludeFile "PB hashtables.pbi"
DisableExplicit
;Time a value lookup from a hash table with 500 elements.
HTid = HT_CreateHashTable(5000)
For i = 0 To 1000
HT_AddItem(HTid, "entry"+Str(i), i)
Next
StartTime = ElapsedMilliseconds()
For loop = 1 To 200 ;200 random searches
index=Random(1000)
x=HT_GetItem(HTid, "entry"+Str(index))
Next
EndTime=ElapsedMilliseconds()
Debug "Hash table time: "+Str(endtime-starttime)
;Now the corresponding array search.
Structure test
name.s
value.l
EndStructure
Dim test.test(1000)
For i = 0 To 1000
test(i)\name="entry"+Str(i)
test(i)\value=i
Next
StartTime = ElapsedMilliseconds()
For loop = 1 To 200 ;200 random searches
index=Random(1000)
i=0
While test(i)\name<>"entry"+Str(index)
i+1
Wend
Next
EndTime=ElapsedMilliseconds()
Debug "Array search: "+Str(endtime-starttime)
Posted: Sun Nov 19, 2006 6:36 pm
by srod
thefool wrote:Hi!
Now it works as it should
Currently with the 5 items, my code, at 99999 repeats of the for i=1 to 5...
is about 4-5 times faster.
I will pm you the code

But that doesn't qualify as a string search - if I've understood correctly what you are doing?

Posted: Sun Nov 19, 2006 6:40 pm
by thefool
What?
See the code. It sure works with a string search. Im not _that_ tired

And it also accepts variables for in & output, just like yours.
Posted: Sun Nov 19, 2006 6:50 pm
by thefool
Nice

The hashtable wins, as expected, when there are a lot of entries. Nice job
Now i gotta go make my own hashalgo hehe
Posted: Sun Nov 19, 2006 6:51 pm
by srod
Yes, but having seen your example (which is a good one) my answer would be that you wouldn't use a hash table for just 5 items of data. There is going to be no advantage there at all, and as you demonstrated quite nicely, a hash table would be a distinct hindrance in such cases. The time in calculating the hashes will outweigh the single lookups required etc.
However, for larger sets of data (as my example above shows) and especially for one off searches, the hash table, by virtue of it's underlying structure, is going to be faster, and, with the more data added, is often going to be considerably faster.
**EDIT** sorry, we posted at the same time!
Posted: Sun Nov 19, 2006 6:55 pm
by srod
thefool wrote:Nice

The hashtable wins, as expected, when there are a lot of entries. Nice job
Now i gotta go make my own hashalgo hehe
Thanks.
And to you Dare; nearly missed your post! Sorry.
Posted: Sun Nov 19, 2006 7:00 pm
by thefool
srod wrote: with the more data added, is often going to be considerably faster.
Absolutely. To people wondering; we talk more than 18 times with the random access of the 1000 elements repeated 999 times
