PureBasic Forums - English

Posted: **Sun Nov 19, 2006 5:06 pm**

Hi,

being in need of a dynamic means of storing data indexed on textual information, I turned to creating a simple (but hopefully effective

) library for creating and using 'hash tables'.

For those who've never heard of a hash table; LOOK HERE (Hey Sparkie, that's not your website is it?

)

Basically then, a hash table acts like an array; but rather than indexing each entry of the array with an integer, a hash table allows you to use a string value.

E.g. I might wish to store peoples telephone numbers in an array. In order to then look-up Bob Smiths phone number, a hash table would allow something along the lines of

Code: Select all

bobstel = TelArray("Bob Smith").

The great thing about hash tables, however, is that the speed of data access is far far greater than simply searching an array of strings etc. Indeed, if you've set aside 'sufficient' memory and the 'hashing algorithm' is a good 'un, then access can be almost immediate and requires a single 'lookup'. (Contrast this with a search through a string array etc!)

Of course there can be 'collisions', but a good library will have contingency plans for such cases where more than one item of data attempt to share the same 'index'. In the case of my library if, for example, 3 'keys' generate the same hash, then I shove the 3 accompanying values into a linked list etc.

Anyhow, allocating sufficient memory and a good hashing algorithm are the key.

**Continued below (I don't like long posts!)

Posted: **Sun Nov 19, 2006 5:20 pm**

**Continued from above**

My library currently uses the 'djb2' hashing algorithm which is very simple and rumoured to be very effective at generating hash indexes which are quite uniformly spread out (which is the ideal situation).

I've set the library up, however, so that you can very easily switch this algorithm for another one if you so wish.

The ideal situation is in knowing the maximum number of 'rows' of data which you wish to store and then allocating the correponding amount of memory. However, if you do not know the max, then simply allocate as much memory as you feel comfortable. The library will automatically handle all 'collisions'.

Commands

HT_CreateHashTable(numberofindexes)
-the bigger the value of 'numberofindexes' the less chance of 'collision'.
Note that this value is not an upper-limit to the amount of data you can store in the hash table. There is no such limit (apart from the amount of physical memory!)
The return value is needed to identify the hash table with certain other functions etc.
HT_AddItem(HTid, key$, userdata)
-'HTid' is the value returned from the HT_CreateHashTable() command.
HT_IsKeyPresent(HTid, key$)
HT_GetItem(HTid, key$)
HT_RemoveItem(HTid, key$)
HT_RemoveHashTable(HTid)
HT_CreateHash(numelementsintable, key$)

Library + example in next post.

Posted: **Sun Nov 19, 2006 5:23 pm**

Example program:

Code: Select all

;HASH TABLES
;Stephen Rodriguez; November 2006.

;********************************************
;*Example.                                  *
;********************************************


XIncludeFile "PB hashtables.pbi"
DisableExplicit

;Create a hash table of 100 indexes.  (No real limit to how many tables you can create,
;heck you can have an array of them if you wish!)

;The 100 indexes is not a limit on the amount of entries we can add to the hash table. 
;We could add 100000 'rows' of data if we wished. It is the number of unique hashes allowed
;in this table. The bigger the number of indexes, the less chance that two keys will produce
;the same index etc. This is known as a 'collision' which this library can handle without
;a problem.
;The more data added to a hash table, the more chance of a collision, which are inevitable!


  HTid = HT_CreateHashTable(20)

;Let's add some data.
;Hash table entries are indexed by a string; here we use 5 basic 'key' values; 'entry1'
;'entry2' etc.
  HT_AddItem(HTid, "entry1",100)
  HT_AddItem(HTid, "entry2",200)
  HT_AddItem(HTid, "entry3",300)
  HT_AddItem(HTid, "entry4",400)
  HT_AddItem(HTid, "entry5",500)

;Lets have a look at our data.
;Access is very fast indeed. Most access will be of order O(1). Worst case scenario is O(n),
;but this is very unlikely to ever happen.
For i = 1 To 5
  Debug HT_GetItem(HTid, "entry"+Str(i))
Next

PB hashtables.pbi

Code: Select all

;=============================================== 
; Program:     HASH TABLES
; Version:     Beta 1.0.
; Date:        November 19 2006 
; Author:      that genius srod!!! Ahem...  Swap("Genius", "Mule")
; TargetOS:    ALL.
; Compiler:    PureBasic 4.0 for Windows.
; Threadsafe?  On Windows, yes. Unsure about other OS's!!!
; Licence:     Do as you wish with it! No warranties either implied
;              or given... blah blah! :)
;=============================================== 


EnableExplicit


#HT_OVERWRITEDUPLICATES = #True ;Set to #FALSE if duplicates are to be ignored.


;-************************DECLARES
Declare.l HT_CreateHashTable(numberofindexes)
Declare HT_AddItem(HTid, key$, userdata)
Declare.l HT_IsKeyPresent(HTid, key$)
Declare.l HT_GetItem(HTid, key$)
Declare HT_RemoveItem(HTid, key$)
Declare HT_RemoveHashTable(HTid)
Declare.l HT_CreateHash(numelementsintable, key$)


;-************************STRUCTURES/GLOBALS
Structure _HTentry
  tableid.l ;Used because all data for all hash tables is stored within the same linked list.
  key.s ;Text from which hash is calculated. Stored because of possible 'collisions'.
  hash.l ;Calculated hash code Required in order to deal with 'collisions' because all data is placed within the same linked list.
  value.l
EndStructure

Global NewList _HTentries._HTentry()
Global _HTmutex=CreateMutex()


;-************************LIBRARY FUNCTIONS

;The following function creates a new hash table.
;If there is no error, it returns an identifier used in various other functions,
;else it returns zero.
Procedure.l HT_CreateHashTable(numberofindexes)
  Protected returnvalue
  If numberofindexes<=0 : ProcedureReturn 0 : EndIf
  returnvalue=AllocateMemory((numberofindexes+1)*SizeOf(LONG))
  If returnvalue  
    PokeL(returnvalue, numberofindexes) ;The 'zero' elements holds the number of elements in the hash table.
  EndIf
  ProcedureReturn returnvalue
EndProcedure


;The following function adds data to an existing hash table.
;If the key$ already exists, then we proceed as dictated by the value of #HT_OVERWRITEDUPLICATES.
;It is the programmer's responsibility to ensure that HTid references a valid hash table!
Procedure HT_AddItem(HTid, key$, value)
  Protected numelements, hash, index, firstinlist, flag
  numelements=PeekL(HTid)
  If numelements
    ;Good to go! First job is to create the hash index.
    hash=HT_CreateHash(numelements-1, key$)+1
    ;hash is now between 1 and 'numelements'.
    ;Now we need to add the data to the table.
    ;First check if there is already a list of elements (collision!)
    index=HTid+hash*SizeOf(long)
    firstinlist=PeekL(index)
    LockMutex(_HTmutex) ;Ensure that only one thread has access to the linked list right now.
      If firstinlist=0 ;This will be the first element with this hash.
        ResetList(_HTentries())
        AddElement(_HTentries())
        PokeL(index, _HTentries())
        _HTentries()\tableid=HTid
        _HTentries()\key=key$
        _HTentries()\hash=hash
        _HTentries()\value=value
      Else ;Collision!
      ;We need to check if the key already exists. If so, we proceed according to #HT_OVERWRITEDUPLICATES.
        ChangeCurrentElement(_HTentries(),firstinlist)
        Repeat
          If _HTentries()\key=key$ And _HTentries()\tableid=HTid ;Key already exists!
            flag=#True
            ;Check the value of #HT_OVERWRITEDUPLICATES.
            If #HT_OVERWRITEDUPLICATES=#True
              _HTentries()\value=value
            EndIf
            Break
          ElseIf _HTentries()\hash<>hash Or _HTentries()\tableid<>HTid ;Gone too far through the list.
            Break
          EndIf
        Until NextElement(_HTentries())=0
        If flag=0 ;We need to create a new entry for the hash.
          ChangeCurrentElement(_HTentries(),firstinlist)
          InsertElement(_HTentries())
          PokeL(index, _HTentries())
          _HTentries()\tableid=HTid
          _HTentries()\key=key$
          _HTentries()\hash=hash
          _HTentries()\value=value
        EndIf
      EndIf 
    UnlockMutex(_HTmutex)
  EndIf
EndProcedure


;The following function checks a key value to see if it exists within the given hash table.
;A non-zero return indicates that the key is present within the hash table.
;It is the programmer's responsibility to ensure that HTid references a valid hash table!
Procedure.l HT_IsKeyPresent(HTid, key$)
  Protected returnvalue, numelements, hash, index, firstinlist
  numelements=PeekL(HTid)
  If numelements
    ;Good to go! First job is to create the hash index.
    hash=HT_CreateHash(numelements-1, key$)+1
    ;hash is now between 1 and 'numelements'.
    index=HTid+hash*SizeOf(long)
    firstinlist=PeekL(index)
    If firstinlist
      LockMutex(_HTmutex) ;Ensure that only one thread has access to the linked list right now.
        ChangeCurrentElement(_HTentries(),firstinlist)
        Repeat
          If _HTentries()\key=key$ And _HTentries()\tableid=HTid ;Key already exists!
            returnvalue=_HTentries()
            Break
          ElseIf _HTentries()\hash<>hash Or _HTentries()\tableid<>HTid ;Gone too far through the list.
            Break
          EndIf
        Until NextElement(_HTentries())=0
      UnlockMutex(_HTmutex)
    EndIf
  EndIf
  ProcedureReturn returnvalue
EndProcedure


;The following function retrieves a data value from an existing hash table.
;It is the programmer's responsibility to ensure that HTid references a valid hash table!
;The return value does not indicate any error in the case that there is no appropriate
;entry in the underlying hash table.
;Instead, the developer should use the function HT_IsKeyPresent() function to test whether
;the key is present etc.
Procedure.l HT_GetItem(HTid, key$)
  Protected returnvalue, numelements, listentry, flag
  LockMutex(_HTmutex) ;Ensure that only one thread has access to the linked list right now.
    ;First identify the relevant entry in the hash table etc.
    If HT_IsKeyPresent(HTid, key$)
      returnvalue=_HTentries()\value
    EndIf
  UnlockMutex(_HTmutex)
  ProcedureReturn returnvalue
EndProcedure


;The following function removes an entry from an existing hash table.
;It is the programmer's responsibility to ensure that HTid references a valid hash table!
Procedure HT_RemoveItem(HTid, key$)
  Protected numelements, hash, index, firstinlist, lastinlist, item
  numelements=PeekL(HTid)
  If numelements
    ;Good to go! First job is to create the hash index.
    hash=HT_CreateHash(numelements-1, key$)+1
    ;hash is now between 1 and 'numelements'.
    ;First check if there is list of elements with the corresponding hash.
    index=HTid+hash*SizeOf(long)
    firstinlist=PeekL(index)
    If firstinlist
      LockMutex(_HTmutex) ;Ensure that only one thread has access to the linked list right now.
        ChangeCurrentElement(_HTentries(),firstinlist)
        Repeat
          If _HTentries()\key=key$ And _HTentries()\tableid=HTid ;Key found!
            item=_HTentries()
          ElseIf _HTentries()\hash<>hash Or _HTentries()\tableid<>HTid ;Gone too far through the list.
            Break
          EndIf
        Until NextElement(_HTentries())=0
        lastinlist=_HTentries()
        If item ;If there is an entry to remove.
          ;We now have 3 pointers to elements within the list; firstinlist, item, lastinlist.
          ;We need to use these to update the list accordingly.
          If item=firstinlist ;Need to alter the hash table entry to point to next item.
            PokeL(index, PeekL(item-8))
          EndIf
          If firstinlist=lastinlist ;Need to flag that there are no entries with this hash code.
            PokeL(index, 0)
          EndIf
          ChangeCurrentElement(_HTentries(),item)
          DeleteElement(_HTentries())
        EndIf
      UnlockMutex(_HTmutex)
    EndIf
  EndIf
EndProcedure


;The following function completely a hash table.
;It is the programmer's responsibility to ensure that HTid references a valid hash table!
Procedure HT_RemoveHashTable(HTid)
  Protected numelements
  numelements=PeekL(HTid)
  If numelements
    LockMutex(_HTmutex) ;Ensure that only one thread has access to the linked list right now.
      ResetList(_HTentries())
      ForEach _HTentries()
        If _HTentries()\tableid=HTid
          DeleteElement(_HTentries())
        EndIf
      Next
    UnlockMutex(_HTmutex)
    FreeMemory(HTid)
  EndIf
EndProcedure




;-************************HASH FUNCTION
;The following function creates the hash code.
;This can be altered to use a different alogorithm.
;The function MUST return a value between 0 and 'numelementsintable'.
;We currently use the 'djb2' algorithm.
;You do not need to worry about collisions, the library automatically takes care of this.

;NOTE, the following returns the same code when working in either ASCII or UNICODE.
;Probably best to ensure that any replacement algorithm does likewise, so avoid PeekB() etc.
;Either use PeekC() (and take care with pointers) or Asc().
Procedure.l HT_CreateHash(numelementsintable, key$)
  Protected hash=5381, i
  If key$
    For i = 1 To Len(key$)
      hash=(hash<<5)+hash + Asc(Mid(key$,i, 1)) ; hash * 33 + asc
    Next
  EndIf
  If hash>=0
    ProcedureReturn hash%(numelementsintable+1)
  Else
    ProcedureReturn -(hash%(numelementsintable+1))
  EndIf
EndProcedure

I hope it's useful.

Regards.

Posted: **Sun Nov 19, 2006 5:33 pm**

Very nice, very useful. Thanks.

Posted: **Sun Nov 19, 2006 5:49 pm**

srod wrote: The great thing about hash tables, however, is that the speed of data access is far far greater than simply searching an array of strings etc. Indeed, if you've set aside 'sufficient' memory and the 'hashing algorithm' is a good 'un, then access can be almost immediate and requires a single 'lookup'. (Contrast this with a search through a string array etc!)

I just wrote a very simple and very unoptimized array search routine, which allows to do just about the same stuff, and my code wins seriously with the debugger on.
So your hash algo must be a bad one

With debugger off, i can't test it! It simply crashes.

Sorry if i misunderstood something, like if you mentioned that your routine was slower than an actual array search or if it was not the point at all. Im pretty tired and i must admit i didn't read this very well.

Thanks for the interesting explanation anyway, i knew what hash tables were just never bothered to use them. But they do look interesting.

Posted: **Sun Nov 19, 2006 5:52 pm**

That is a weird crash.

What ever is going on probably explains the speed test!

I'll have a look.

Thanks.

**EDIT: got it. Will take a few minutes to fix.

Posted: **Sun Nov 19, 2006 6:08 pm**

thefool wrote:I just wrote a very simple and very unoptimized array search routine, which allows to do just about the same stuff, and my code wins seriously with the debugger on.
So your hash algo must be a bad one

With debugger off, i can't test it! It simply crashes.

The crash is fixed.

It was indeed the algo; I'd cocked it right up!

Any chance you can run your test again? Either that or pm me the test code and I'll have a tinker with it? The thing is that there probably needs to be a 'reasonable' amount of data before you'll see a noticeable difference in speed. My guess, about 20 or so entries.

I'll do some speed tests of my own.

Thanks again.

Posted: **Sun Nov 19, 2006 6:34 pm**

Hi!
Now it works as it should

Currently with the 5 items, my code, at 99999 repeats of the for i=1 to 5...
is about 4-5 times faster.

I will pm you the code

Posted: **Sun Nov 19, 2006 6:35 pm**

@TheFool: I've just done some tests and the Hash Table completely blows away a straight foward string search if you give it enough work to do.

E.g.

Code: Select all

;HASH TABLES
;Stephen Rodriguez; November 2006.

;********************************************
;*Example.                                  *
;********************************************


XIncludeFile "PB hashtables.pbi"
DisableExplicit

;Time a value lookup from a hash table with 500 elements.
 HTid = HT_CreateHashTable(5000)
  For i = 0 To 1000
    HT_AddItem(HTid, "entry"+Str(i), i)
  Next

  StartTime = ElapsedMilliseconds()             
    For loop = 1 To 200 ;200 random searches
      index=Random(1000)
      x=HT_GetItem(HTid, "entry"+Str(index))
    Next
  EndTime=ElapsedMilliseconds()
  Debug "Hash table time: "+Str(endtime-starttime)

;Now the corresponding array search.

  Structure test
    name.s
    value.l
  EndStructure

  Dim test.test(1000)
  For i = 0 To 1000
    test(i)\name="entry"+Str(i)
    test(i)\value=i
  Next

  StartTime = ElapsedMilliseconds()             
    For loop = 1 To 200 ;200 random searches
      index=Random(1000)
      i=0
      While test(i)\name<>"entry"+Str(index)
        i+1
      Wend
    Next
  EndTime=ElapsedMilliseconds()
  Debug "Array search: "+Str(endtime-starttime)

Posted: **Sun Nov 19, 2006 6:36 pm**

thefool wrote:Hi!
Now it works as it should

Currently with the 5 items, my code, at 99999 repeats of the for i=1 to 5...
is about 4-5 times faster.

I will pm you the code

But that doesn't qualify as a string search - if I've understood correctly what you are doing?

Posted: **Sun Nov 19, 2006 6:40 pm**

What?
See the code. It sure works with a string search. Im not _that_ tired

And it also accepts variables for in & output, just like yours.

Posted: **Sun Nov 19, 2006 6:50 pm**

Nice

The hashtable wins, as expected, when there are a lot of entries. Nice job

Now i gotta go make my own hashalgo hehe

Posted: **Sun Nov 19, 2006 6:51 pm**

Yes, but having seen your example (which is a good one) my answer would be that you wouldn't use a hash table for just 5 items of data. There is going to be no advantage there at all, and as you demonstrated quite nicely, a hash table would be a distinct hindrance in such cases. The time in calculating the hashes will outweigh the single lookups required etc.

However, for larger sets of data (as my example above shows) and especially for one off searches, the hash table, by virtue of it's underlying structure, is going to be faster, and, with the more data added, is often going to be considerably faster.

**EDIT** sorry, we posted at the same time!

Posted: **Sun Nov 19, 2006 6:55 pm**

thefool wrote:Nice
The hashtable wins, as expected, when there are a lot of entries. Nice job

Now i gotta go make my own hashalgo hehe

Thanks.

And to you Dare; nearly missed your post! Sorry.

Posted: **Sun Nov 19, 2006 7:00 pm**

srod wrote: with the more data added, is often going to be considerably faster.

Absolutely. To people wondering; we talk more than 18 times with the random access of the 1000 elements repeated 999 times

PureBasic Forums - English

Hash tables.

Hash tables.

Re: Hash tables.

Re: Hash tables.