Page 1 of 1

Hash question

Posted: Wed Apr 21, 2010 2:01 am
by netmaestro
I've tried a few hashes trying to find one with a really uniform distribution. djb2 and sdbm seem at the top, but I'm still disappointed in the number of collisions I'm getting. Here is a test snippet (requires 4.50 Beta 3 to work) and the results I'm getting show more than one-third collisions. Is this normal? Or did I miss the boat somewhere in my hash routine? All discussion welcome.

Code: Select all

Global collisions=0, tableentries=0, buckets=10000

Structure hashdata
  id.s
  value.l
EndStructure

Structure bucket
  List buckets.hashdata()
EndStructure

Structure hashtemplate
  *methodpointer.i
  Array hash.bucket(10000)
EndStructure

Procedure.l Hash(tablesize, key$)
  ; sdbm 
  Protected hash=0, *p.Character
  If key$
    *p.Character = @key$
    While *p\c
      hash = *p\c + (hash << 6) + (hash << 16) - hash
      *p+1
    Wend
  EndIf
  qhash.q = hash&$FFFFFFFF
  ProcedureReturn qhash%(tablesize+1)
EndProcedure

Procedure insert(*this.hashtemplate, key$, value)
  hkey = hash(buckets, key$)
  If ListSize(*this\hash(hkey)\buckets()) > 0
    FirstElement(*this\hash(hkey)\buckets()) 
    If *this\hash(hkey)\buckets()\id = key$
      ; duplicate, so quit
      ProcedureReturn 0
    Else
      ; collision
      collisions+1
      found=0
      ForEach *this\hash(hkey)\buckets()
        If *this\hash(hkey)\buckets()\id = key$
          found=1
          Break
        EndIf
      Next
      If Not found
        AddElement(*this\hash(hkey)\buckets())
        *this\hash(hkey)\buckets()\id = key$
        *this\hash(hkey)\buckets()\value = value
      Else
        ProcedureReturn 0
      EndIf
    EndIf
  Else
    tableentries + 1
    AddElement(*this\hash(hkey)\buckets())
    *this\hash(hkey)\buckets()\id = key$
    *this\hash(hkey)\buckets()\value = value
  EndIf
EndProcedure

Procedure destroy(*this.hashtemplate)
  FreeMemory(*this)
EndProcedure

Procedure getvalue(*this.hashtemplate, key$)
  hkey = hash(buckets, key$)
  If *this\hash(hkey)\buckets()\id = key$
    ProcedureReturn *this\hash(hkey)\buckets()\value
  Else
    ForEach *this\hash(hkey)\buckets()
      If *this\hash(hkey)\buckets()\id = key$
        ProcedureReturn *this\hash(hkey)\buckets()\value
      EndIf
    Next
    ProcedureReturn 0
  EndIf
EndProcedure

Procedure.i newhashtable(tablesize)
  *object.hashtemplate = AllocateMemory(SizeOf(hashtemplate))
  InitializeStructure(*object, hashtemplate)
  If *object
    *object\methodpointer = ?hashmethods
    ProcedureReturn *object
  Else
    ProcedureReturn 0
  EndIf
EndProcedure

Interface hashtable
  destroy  ()    
  insert   (key.s, value)
  getvalue (key.s)
EndInterface

DataSection
  hashmethods:
  Data.i @destroy(), @insert(), @getvalue()
EndDataSection

;////////////////////////////////////////////////////////////////////

; Test program(s)

hash.hashtable = newhashtable(buckets)

NewList stuff.hashdata()

For i=1 To buckets
  *Key = AllocateMemory(4)
  If OpenCryptRandom() And *Key
    CryptRandomData(*Key, 4)
    CloseCryptRandom()
  EndIf
  RandomSeed(Random(PeekL(*key)))
  tmp$ = Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(26))
  AddElement(stuff())
  stuff()\id = tmp$
  stuff()\value = i
Next

t1=ElapsedMilliseconds()

cc=0
ForEach stuff()
  cc+1
  hash\insert(stuff()\id, stuff()\value)
Next

t2 = ElapsedMilliseconds()

t3 = ElapsedMilliseconds()

ForEach stuff()
  value = hash\getvalue(stuff()\id)
Next

t4 = ElapsedMilliseconds()

out$ = ""
out$ + "Total inserts:          "+Str(cc)                   + #CRLF$ 
out$ + "Total collisions:       "+Str(collisions)           + #CRLF$
out$ + "Total table entries:    "+Str(tableentries)         + #CRLF$
out$ + "Total insert time:      "+Str(t2-t1)+"milliseconds" + #CRLF$ 
out$ + "Total retrieval time:   "+Str(t4-t3)+"milliseconds" + #CRLF$ 

MessageRequester("",out$)
ps. Damn, this language is turning into one supercharged cool coding engine!

pss. PB's native map is way faster than this. For this kind of task stick with the native one. I'm only experimenting with hashes here.

Re: Hash question

Posted: Wed Apr 21, 2010 11:46 am
by Fred
Just a quick side note: code like "If OpenCryptRandom() And *Key" are not safe, because if *Key is null, you will have opened the cryptrandom but don't closed it. You need to write it in reverse form:
"If *Key And OpenCryptRandom()". Nice map experiment :).

Re: Hash question

Posted: Wed Apr 21, 2010 11:59 am
by Trond
The number of collisions depends on whether the hash algorithm is suitable to the data. Your data isn't random, so the algorithm will need to account for that.

Edit:
It seems normal with random numbers:

Code: Select all

Items   = 10000

Dim Table(Items)

For I = 0 To Items
  r = Random(Items)
  Table(r) + 1
Next

For i = 0 To items
  If Table(i) = 0
    Null + 1
  EndIf
Next

Debug "Collisions: " + Str(Null)
Debug "Elements: " + Str(Items-Null)

Re: Hash question

Posted: Wed Apr 21, 2010 10:45 pm
by idle
It's not easy choosing a hash function as the collision behavior depends on the data and also the table size generally speaking if you can load a closed table to more than 70% full then you're doing well.

Re: Hash question

Posted: Tue Apr 27, 2010 12:16 am
by Rescator
Well, PureBasic's map/hashing is one of the fastest and with the least number of colllisions in tests among the more popular ones. So Freak (or was it Fred?) picked a damn good choice there.

I just wish there was a .i version of it too. as it's kinda silly doing "125763" in string form if it could fit in 8 or 4 bytes, and less overhead too I assume. :)