Hash question
Posted: Wed Apr 21, 2010 2:01 am
I've tried a few hashes trying to find one with a really uniform distribution. djb2 and sdbm seem at the top, but I'm still disappointed in the number of collisions I'm getting. Here is a test snippet (requires 4.50 Beta 3 to work) and the results I'm getting show more than one-third collisions. Is this normal? Or did I miss the boat somewhere in my hash routine? All discussion welcome.
ps. Damn, this language is turning into one supercharged cool coding engine!
pss. PB's native map is way faster than this. For this kind of task stick with the native one. I'm only experimenting with hashes here.
Code: Select all
Global collisions=0, tableentries=0, buckets=10000
Structure hashdata
id.s
value.l
EndStructure
Structure bucket
List buckets.hashdata()
EndStructure
Structure hashtemplate
*methodpointer.i
Array hash.bucket(10000)
EndStructure
Procedure.l Hash(tablesize, key$)
; sdbm
Protected hash=0, *p.Character
If key$
*p.Character = @key$
While *p\c
hash = *p\c + (hash << 6) + (hash << 16) - hash
*p+1
Wend
EndIf
qhash.q = hash&$FFFFFFFF
ProcedureReturn qhash%(tablesize+1)
EndProcedure
Procedure insert(*this.hashtemplate, key$, value)
hkey = hash(buckets, key$)
If ListSize(*this\hash(hkey)\buckets()) > 0
FirstElement(*this\hash(hkey)\buckets())
If *this\hash(hkey)\buckets()\id = key$
; duplicate, so quit
ProcedureReturn 0
Else
; collision
collisions+1
found=0
ForEach *this\hash(hkey)\buckets()
If *this\hash(hkey)\buckets()\id = key$
found=1
Break
EndIf
Next
If Not found
AddElement(*this\hash(hkey)\buckets())
*this\hash(hkey)\buckets()\id = key$
*this\hash(hkey)\buckets()\value = value
Else
ProcedureReturn 0
EndIf
EndIf
Else
tableentries + 1
AddElement(*this\hash(hkey)\buckets())
*this\hash(hkey)\buckets()\id = key$
*this\hash(hkey)\buckets()\value = value
EndIf
EndProcedure
Procedure destroy(*this.hashtemplate)
FreeMemory(*this)
EndProcedure
Procedure getvalue(*this.hashtemplate, key$)
hkey = hash(buckets, key$)
If *this\hash(hkey)\buckets()\id = key$
ProcedureReturn *this\hash(hkey)\buckets()\value
Else
ForEach *this\hash(hkey)\buckets()
If *this\hash(hkey)\buckets()\id = key$
ProcedureReturn *this\hash(hkey)\buckets()\value
EndIf
Next
ProcedureReturn 0
EndIf
EndProcedure
Procedure.i newhashtable(tablesize)
*object.hashtemplate = AllocateMemory(SizeOf(hashtemplate))
InitializeStructure(*object, hashtemplate)
If *object
*object\methodpointer = ?hashmethods
ProcedureReturn *object
Else
ProcedureReturn 0
EndIf
EndProcedure
Interface hashtable
destroy ()
insert (key.s, value)
getvalue (key.s)
EndInterface
DataSection
hashmethods:
Data.i @destroy(), @insert(), @getvalue()
EndDataSection
;////////////////////////////////////////////////////////////////////
; Test program(s)
hash.hashtable = newhashtable(buckets)
NewList stuff.hashdata()
For i=1 To buckets
*Key = AllocateMemory(4)
If OpenCryptRandom() And *Key
CryptRandomData(*Key, 4)
CloseCryptRandom()
EndIf
RandomSeed(Random(PeekL(*key)))
tmp$ = Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(26))
AddElement(stuff())
stuff()\id = tmp$
stuff()\value = i
Next
t1=ElapsedMilliseconds()
cc=0
ForEach stuff()
cc+1
hash\insert(stuff()\id, stuff()\value)
Next
t2 = ElapsedMilliseconds()
t3 = ElapsedMilliseconds()
ForEach stuff()
value = hash\getvalue(stuff()\id)
Next
t4 = ElapsedMilliseconds()
out$ = ""
out$ + "Total inserts: "+Str(cc) + #CRLF$
out$ + "Total collisions: "+Str(collisions) + #CRLF$
out$ + "Total table entries: "+Str(tableentries) + #CRLF$
out$ + "Total insert time: "+Str(t2-t1)+"milliseconds" + #CRLF$
out$ + "Total retrieval time: "+Str(t4-t3)+"milliseconds" + #CRLF$
MessageRequester("",out$)
pss. PB's native map is way faster than this. For this kind of task stick with the native one. I'm only experimenting with hashes here.