A OOP class for dynamically implementing hash tables!

superadnim · Post by **superadnim** » Mon Jan 14, 2008 9:28 am

Hi,

Is this "engine" limited to string hashes?, can I use long-int hashes instead?, because I think it defeats the purpose if we compare hash-strings instead of the strings themselves (unless the actual data was longer than the hash length).

I'm using a small array of crc32 hashes to compare instead of comparing my lists of words... is this lib for me?

Thanks

srod · Post by **srod** » Mon Jan 14, 2008 10:30 am

I'm not sure what you mean exactly?

This library takes a string 'key' and from that generates a 32-bit hash code (which lies within certain bounds). There are no string hashes.

This is how 'traditional' hash tables operate.

The only time strings are compared is when there are 'collisions' and even then a binary search is utilised in the case of my library.

superadnim · Post by **superadnim** » Mon Jan 14, 2008 1:00 pm

I'm not really implying anything, so I'm sorry if I wasn't too clear about my question to begin with

Right now I'm using a list of 32bit hashes to compare strings from one set to another. It all works nicely but I was thinking if your library would do the same trick and if it would also provide me with some new useful functionality I might be lacking currently?.

I also like OOP but I'm wondering if this is the right lib for my case or if I should just leave my code as is.

Thanks

srod · Post by **srod** » Mon Jan 14, 2008 1:15 pm

superadnim wrote:I'm not really implying anything, so I'm sorry if I wasn't too clear about my question to begin with

Right now I'm using a list of 32bit hashes to compare strings from one set to another. It all works nicely but I was thinking if your library would do the same trick and if it would also provide me with some new useful functionality I might be lacking currently?.

I also like OOP but I'm wondering if this is the right lib for my case or if I should just leave my code as is.

Thanks

If your situation is such that no two strings can generate the same 32-bit hash code, and your code works fine, then yes I would say that you probably shouldn't switch to my library.

A hash table is used to actually store data rather than just compare hashes etc. and so I am a little unsure if a hash table is what you need? When storing such data, there is always the possibility that two items will generate the same hash (memory is finite after all) and thus multiple items will attempt to occupy the same memory etc. This library of mine is what I would describe as a 'heavy duty' hash table in that it not only deals with 'collisions' in a very efficient way, but you can store absolutely any type of data within the hash tables etc. You are not limited to single 32-bit values.

I am just shaping up right now to create a hash table which will potentially be used to store millions of items. I'll probably break the library in the process!

superadnim · Post by **superadnim** » Mon Jan 14, 2008 1:34 pm

I'll probably break the library in the process!

Or make it even better!

Heres my situation: I have input (dynamic) data and comparison (static) data. on both cases a hash list is created for comparison of the items. I found this is way faster overall than comparing the raw string data in almost every case (other than 2 or 3 string lengths, in such cases the overhead of calculating the initial hash list was too big and not worth it for a small amount of comparisons overall).

The problem: my method is not very clean, code wise.. I don't like it at all, I'm using arrays, etc. I dislike it, period

That's why I was wondering if I should use your library or not but it appears that's not it's main purpose...

srod · Post by **srod** » Mon Jan 14, 2008 1:43 pm

No, if I understand you correctly, then a hashtable does not seem appropriate.

Thinking about what you are doing, did you say you are using CRC32 checksums as your hash? If so, just remember that two different strings could yield the same hash in this case? Thus, you'll need to supplement your hash comparisons in cases where you find two equal hashes etc.

superadnim · Post by **superadnim** » Mon Jan 14, 2008 5:39 pm

Thanks for the info.

Yes, I'm aware of CRC32 and collisions with this algorithm, but I believe there are not enough words in the English dictionary to cause one (correct me if I'm wrong!).

In all cases, the lists are quite tiny (from 50 to 200 unique words on the dynamic side) and not so many on the static side (the user builds this one).

Also, the data is quite small so I don't really see how a collision could happen in my case?, please enlighten me

(and sorry if I somehow hijacked your thread but I thought it was appropriate - at least at the beginning - to ask)

I am toying with another hashing algorithm which is faster and has similar collision rates... I just haven't encountered a single one so far but you are right that I shouldn't think it won't happen ( but if it did, it wouldn't be too "fatal" in my case since the whole process is over viewed by the user ).

srod · Post by **srod** » Mon Jan 14, 2008 5:48 pm

Well, if you get a collision, just compare the original strings!

srod · Post by **srod** » Fri Aug 15, 2008 10:24 am

Update : 15/08/08.
Have added a new method; ClearItems(), which removes all items from the underlying hash table but does not destroy the table. This is much faster than simply enumerating all keys and removing the items individually.

The download link is at the bottom of the first post.

Post by **idle** » Fri Aug 15, 2008 12:50 pm

Thanks for sharing, I'm sure it'll come in very handy.

milan1612 · Post by **milan1612** » Sun Sep 28, 2008 4:02 pm

Hi Stephen,

would it be possible to implement some kind of serialization of the whole hashtable?
So that I could save it to a file and later load it again with all the entries already inserted.
Dunno if you're allocating one block of memory at once or one for each entry when necessary,
so I really don't know how to do it efficiently.

Thanks

srod · Post by **srod** » Sun Sep 28, 2008 5:11 pm

The internal data structures are such that a 'memory dump' of a hash table is very difficult!

Your best bet is to simply enumerate the entire hash table and write the data out to disc that way. When loading back you must create a hash table with exactly the same number of indices and then add each element in turn as you load them from disc. Writing a generic routine to do this for any hash table is possible, but is not a trivial exercise. It is also not something I am in need of and I do not have the time at the moment to code such a routine.

Sorry.

milan1612 · Post by **milan1612** » Sun Sep 28, 2008 5:13 pm

No problem, I just hoped it'd be trivial to do. Okay, I'll do it the hard way

srod · Post by **srod** » Sun Sep 28, 2008 5:15 pm

Yes the main problem is that a hash table does not occupy a single contiguous chunk of memory - far from it!

It doesn't allocate more memory than it needs, but the memory which is allocated will not be contiguous at all!

netmaestro · Post by **netmaestro** » Fri Apr 17, 2009 9:33 pm

Nice work srod! And good choice of hash algorithm imho. May I just make one tiny observation?

This:

Code: Select all

Procedure.l CreateHash(numelementsintable, *key.CHARACTER)
  Protected hash=5381
  While *key\c <> 0
    hash=(hash<<5)+hash + *key\c ; hash * 33 + asc
    *key+1
  Wend
  If hash>=0
    ProcedureReturn hash%(numelementsintable+1)
  Else
    ProcedureReturn -(hash%(numelementsintable+1))
  EndIf
EndProcedure

would be approx. 15x faster than:

Code: Select all

Procedure.l HashTableClass_CreateHash(numelementsintable, key$)
  Protected hash=5381, i
  If key$
    For i = 1 To Len(key$)
      hash=(hash<<5)+hash + Asc(Mid(key$,i, 1)) ; hash * 33 + asc
    Next
  EndIf
  If hash>=0
    ProcedureReturn hash%(numelementsintable+1)
  Else
    ProcedureReturn -(hash%(numelementsintable+1))
  EndIf
EndProcedure

If one were using this for a chess engine there would be many millions of calls to this proc within a relatively short time period.

All in all though, excellent library. Thanks for sharing it.