Hash question

Everything else that doesn't fall into one of the other PB categories.
User avatar
netmaestro
PureBasic Bullfrog
PureBasic Bullfrog
Posts: 8453
Joined: Wed Jul 06, 2005 5:42 am
Location: Fort Nelson, BC, Canada

Hash question

Post by netmaestro »

I've tried a few hashes trying to find one with a really uniform distribution. djb2 and sdbm seem at the top, but I'm still disappointed in the number of collisions I'm getting. Here is a test snippet (requires 4.50 Beta 3 to work) and the results I'm getting show more than one-third collisions. Is this normal? Or did I miss the boat somewhere in my hash routine? All discussion welcome.

Code: Select all

Global collisions=0, tableentries=0, buckets=10000

Structure hashdata
  id.s
  value.l
EndStructure

Structure bucket
  List buckets.hashdata()
EndStructure

Structure hashtemplate
  *methodpointer.i
  Array hash.bucket(10000)
EndStructure

Procedure.l Hash(tablesize, key$)
  ; sdbm 
  Protected hash=0, *p.Character
  If key$
    *p.Character = @key$
    While *p\c
      hash = *p\c + (hash << 6) + (hash << 16) - hash
      *p+1
    Wend
  EndIf
  qhash.q = hash&$FFFFFFFF
  ProcedureReturn qhash%(tablesize+1)
EndProcedure

Procedure insert(*this.hashtemplate, key$, value)
  hkey = hash(buckets, key$)
  If ListSize(*this\hash(hkey)\buckets()) > 0
    FirstElement(*this\hash(hkey)\buckets()) 
    If *this\hash(hkey)\buckets()\id = key$
      ; duplicate, so quit
      ProcedureReturn 0
    Else
      ; collision
      collisions+1
      found=0
      ForEach *this\hash(hkey)\buckets()
        If *this\hash(hkey)\buckets()\id = key$
          found=1
          Break
        EndIf
      Next
      If Not found
        AddElement(*this\hash(hkey)\buckets())
        *this\hash(hkey)\buckets()\id = key$
        *this\hash(hkey)\buckets()\value = value
      Else
        ProcedureReturn 0
      EndIf
    EndIf
  Else
    tableentries + 1
    AddElement(*this\hash(hkey)\buckets())
    *this\hash(hkey)\buckets()\id = key$
    *this\hash(hkey)\buckets()\value = value
  EndIf
EndProcedure

Procedure destroy(*this.hashtemplate)
  FreeMemory(*this)
EndProcedure

Procedure getvalue(*this.hashtemplate, key$)
  hkey = hash(buckets, key$)
  If *this\hash(hkey)\buckets()\id = key$
    ProcedureReturn *this\hash(hkey)\buckets()\value
  Else
    ForEach *this\hash(hkey)\buckets()
      If *this\hash(hkey)\buckets()\id = key$
        ProcedureReturn *this\hash(hkey)\buckets()\value
      EndIf
    Next
    ProcedureReturn 0
  EndIf
EndProcedure

Procedure.i newhashtable(tablesize)
  *object.hashtemplate = AllocateMemory(SizeOf(hashtemplate))
  InitializeStructure(*object, hashtemplate)
  If *object
    *object\methodpointer = ?hashmethods
    ProcedureReturn *object
  Else
    ProcedureReturn 0
  EndIf
EndProcedure

Interface hashtable
  destroy  ()    
  insert   (key.s, value)
  getvalue (key.s)
EndInterface

DataSection
  hashmethods:
  Data.i @destroy(), @insert(), @getvalue()
EndDataSection

;////////////////////////////////////////////////////////////////////

; Test program(s)

hash.hashtable = newhashtable(buckets)

NewList stuff.hashdata()

For i=1 To buckets
  *Key = AllocateMemory(4)
  If OpenCryptRandom() And *Key
    CryptRandomData(*Key, 4)
    CloseCryptRandom()
  EndIf
  RandomSeed(Random(PeekL(*key)))
  tmp$ = Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(190))+Chr(65+Random(26))
  AddElement(stuff())
  stuff()\id = tmp$
  stuff()\value = i
Next

t1=ElapsedMilliseconds()

cc=0
ForEach stuff()
  cc+1
  hash\insert(stuff()\id, stuff()\value)
Next

t2 = ElapsedMilliseconds()

t3 = ElapsedMilliseconds()

ForEach stuff()
  value = hash\getvalue(stuff()\id)
Next

t4 = ElapsedMilliseconds()

out$ = ""
out$ + "Total inserts:          "+Str(cc)                   + #CRLF$ 
out$ + "Total collisions:       "+Str(collisions)           + #CRLF$
out$ + "Total table entries:    "+Str(tableentries)         + #CRLF$
out$ + "Total insert time:      "+Str(t2-t1)+"milliseconds" + #CRLF$ 
out$ + "Total retrieval time:   "+Str(t4-t3)+"milliseconds" + #CRLF$ 

MessageRequester("",out$)
ps. Damn, this language is turning into one supercharged cool coding engine!

pss. PB's native map is way faster than this. For this kind of task stick with the native one. I'm only experimenting with hashes here.
BERESHEIT
Fred
Administrator
Administrator
Posts: 18553
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Re: Hash question

Post by Fred »

Just a quick side note: code like "If OpenCryptRandom() And *Key" are not safe, because if *Key is null, you will have opened the cryptrandom but don't closed it. You need to write it in reverse form:
"If *Key And OpenCryptRandom()". Nice map experiment :).
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Re: Hash question

Post by Trond »

The number of collisions depends on whether the hash algorithm is suitable to the data. Your data isn't random, so the algorithm will need to account for that.

Edit:
It seems normal with random numbers:

Code: Select all

Items   = 10000

Dim Table(Items)

For I = 0 To Items
  r = Random(Items)
  Table(r) + 1
Next

For i = 0 To items
  If Table(i) = 0
    Null + 1
  EndIf
Next

Debug "Collisions: " + Str(Null)
Debug "Elements: " + Str(Items-Null)
User avatar
idle
Always Here
Always Here
Posts: 6238
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Hash question

Post by idle »

It's not easy choosing a hash function as the collision behavior depends on the data and also the table size generally speaking if you can load a closed table to more than 70% full then you're doing well.
User avatar
Rescator
Addict
Addict
Posts: 1769
Joined: Sat Feb 19, 2005 5:05 pm
Location: Norway

Re: Hash question

Post by Rescator »

Well, PureBasic's map/hashing is one of the fastest and with the least number of colllisions in tests among the more popular ones. So Freak (or was it Fred?) picked a damn good choice there.

I just wish there was a .i version of it too. as it's kinda silly doing "125763" in string form if it could fit in 8 or 4 bytes, and less overhead too I assume. :)
Post Reply