A OOP class for dynamically implementing hash tables!

Share your advanced PureBasic knowledge/code with the community.
superadnim
Enthusiast
Enthusiast
Posts: 480
Joined: Thu Jul 27, 2006 4:06 am

Post by superadnim »

Hi,

Is this "engine" limited to string hashes?, can I use long-int hashes instead?, because I think it defeats the purpose if we compare hash-strings instead of the strings themselves (unless the actual data was longer than the hash length).

I'm using a small array of crc32 hashes to compare instead of comparing my lists of words... is this lib for me?

Thanks
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

I'm not sure what you mean exactly?

This library takes a string 'key' and from that generates a 32-bit hash code (which lies within certain bounds). There are no string hashes.

This is how 'traditional' hash tables operate.

The only time strings are compared is when there are 'collisions' and even then a binary search is utilised in the case of my library.
I may look like a mule, but I'm not a complete ass.
superadnim
Enthusiast
Enthusiast
Posts: 480
Joined: Thu Jul 27, 2006 4:06 am

Post by superadnim »

I'm not really implying anything, so I'm sorry if I wasn't too clear about my question to begin with :)

Right now I'm using a list of 32bit hashes to compare strings from one set to another. It all works nicely but I was thinking if your library would do the same trick and if it would also provide me with some new useful functionality I might be lacking currently?.

I also like OOP but I'm wondering if this is the right lib for my case or if I should just leave my code as is.

Thanks
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

superadnim wrote:I'm not really implying anything, so I'm sorry if I wasn't too clear about my question to begin with :)

Right now I'm using a list of 32bit hashes to compare strings from one set to another. It all works nicely but I was thinking if your library would do the same trick and if it would also provide me with some new useful functionality I might be lacking currently?.

I also like OOP but I'm wondering if this is the right lib for my case or if I should just leave my code as is.

Thanks
If your situation is such that no two strings can generate the same 32-bit hash code, and your code works fine, then yes I would say that you probably shouldn't switch to my library.

A hash table is used to actually store data rather than just compare hashes etc. and so I am a little unsure if a hash table is what you need? When storing such data, there is always the possibility that two items will generate the same hash (memory is finite after all) and thus multiple items will attempt to occupy the same memory etc. This library of mine is what I would describe as a 'heavy duty' hash table in that it not only deals with 'collisions' in a very efficient way, but you can store absolutely any type of data within the hash tables etc. You are not limited to single 32-bit values.

I am just shaping up right now to create a hash table which will potentially be used to store millions of items. I'll probably break the library in the process! :)
I may look like a mule, but I'm not a complete ass.
superadnim
Enthusiast
Enthusiast
Posts: 480
Joined: Thu Jul 27, 2006 4:06 am

Post by superadnim »

I'll probably break the library in the process!
Or make it even better!

Heres my situation: I have input (dynamic) data and comparison (static) data. on both cases a hash list is created for comparison of the items. I found this is way faster overall than comparing the raw string data in almost every case (other than 2 or 3 string lengths, in such cases the overhead of calculating the initial hash list was too big and not worth it for a small amount of comparisons overall).

The problem: my method is not very clean, code wise.. I don't like it at all, I'm using arrays, etc. I dislike it, period :P

That's why I was wondering if I should use your library or not but it appears that's not it's main purpose...
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

No, if I understand you correctly, then a hashtable does not seem appropriate.

Thinking about what you are doing, did you say you are using CRC32 checksums as your hash? If so, just remember that two different strings could yield the same hash in this case? Thus, you'll need to supplement your hash comparisons in cases where you find two equal hashes etc.
I may look like a mule, but I'm not a complete ass.
superadnim
Enthusiast
Enthusiast
Posts: 480
Joined: Thu Jul 27, 2006 4:06 am

Post by superadnim »

Thanks for the info.

Yes, I'm aware of CRC32 and collisions with this algorithm, but I believe there are not enough words in the English dictionary to cause one (correct me if I'm wrong!).

In all cases, the lists are quite tiny (from 50 to 200 unique words on the dynamic side) and not so many on the static side (the user builds this one).

Also, the data is quite small so I don't really see how a collision could happen in my case?, please enlighten me :) (and sorry if I somehow hijacked your thread but I thought it was appropriate - at least at the beginning - to ask)

I am toying with another hashing algorithm which is faster and has similar collision rates... I just haven't encountered a single one so far but you are right that I shouldn't think it won't happen ( but if it did, it wouldn't be too "fatal" in my case since the whole process is over viewed by the user ).
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

Well, if you get a collision, just compare the original strings! :)
I may look like a mule, but I'm not a complete ass.
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

Update : 15/08/08.
Have added a new method; ClearItems(), which removes all items from the underlying hash table but does not destroy the table. This is much faster than simply enumerating all keys and removing the items individually.

The download link is at the bottom of the first post.
I may look like a mule, but I'm not a complete ass.
User avatar
idle
Always Here
Always Here
Posts: 5839
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Post by idle »

Thanks for sharing, I'm sure it'll come in very handy. :D
milan1612
Addict
Addict
Posts: 894
Joined: Thu Apr 05, 2007 12:15 am
Location: Nuremberg, Germany
Contact:

Post by milan1612 »

Hi Stephen,

would it be possible to implement some kind of serialization of the whole hashtable?
So that I could save it to a file and later load it again with all the entries already inserted.
Dunno if you're allocating one block of memory at once or one for each entry when necessary,
so I really don't know how to do it efficiently.

Thanks
Windows 7 & PureBasic 4.4
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

The internal data structures are such that a 'memory dump' of a hash table is very difficult! :)

Your best bet is to simply enumerate the entire hash table and write the data out to disc that way. When loading back you must create a hash table with exactly the same number of indices and then add each element in turn as you load them from disc. Writing a generic routine to do this for any hash table is possible, but is not a trivial exercise. It is also not something I am in need of and I do not have the time at the moment to code such a routine.

Sorry. :)
I may look like a mule, but I'm not a complete ass.
milan1612
Addict
Addict
Posts: 894
Joined: Thu Apr 05, 2007 12:15 am
Location: Nuremberg, Germany
Contact:

Post by milan1612 »

No problem, I just hoped it'd be trivial to do. Okay, I'll do it the hard way :lol:
Windows 7 & PureBasic 4.4
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

Yes the main problem is that a hash table does not occupy a single contiguous chunk of memory - far from it! :) It doesn't allocate more memory than it needs, but the memory which is allocated will not be contiguous at all!
I may look like a mule, but I'm not a complete ass.
User avatar
netmaestro
PureBasic Bullfrog
PureBasic Bullfrog
Posts: 8451
Joined: Wed Jul 06, 2005 5:42 am
Location: Fort Nelson, BC, Canada

Post by netmaestro »

Nice work srod! And good choice of hash algorithm imho. May I just make one tiny observation?

This:

Code: Select all

Procedure.l CreateHash(numelementsintable, *key.CHARACTER)
  Protected hash=5381
  While *key\c <> 0
    hash=(hash<<5)+hash + *key\c ; hash * 33 + asc
    *key+1
  Wend
  If hash>=0
    ProcedureReturn hash%(numelementsintable+1)
  Else
    ProcedureReturn -(hash%(numelementsintable+1))
  EndIf
EndProcedure
would be approx. 15x faster than:

Code: Select all

Procedure.l HashTableClass_CreateHash(numelementsintable, key$)
  Protected hash=5381, i
  If key$
    For i = 1 To Len(key$)
      hash=(hash<<5)+hash + Asc(Mid(key$,i, 1)) ; hash * 33 + asc
    Next
  EndIf
  If hash>=0
    ProcedureReturn hash%(numelementsintable+1)
  Else
    ProcedureReturn -(hash%(numelementsintable+1))
  EndIf
EndProcedure
If one were using this for a chess engine there would be many millions of calls to this proc within a relatively short time period.

All in all though, excellent library. Thanks for sharing it.
BERESHEIT
Post Reply