Page 3 of 3

Posted: Sat Apr 18, 2009 4:31 am
by kandl
i don't know why no one commented on my thread but my hash function looks much cleaner (and more generic) than yours

http://www.purebasic.fr/english/viewtopic.php?t=37060

Code: Select all


Structure hashtable_char
  c.c
EndStructure 

Procedure.i hastable_hashingfunction(string$)

  *p.hashtable_char = @string$
  s = SizeOf(*p\c)

  ; calculate hash value using the shift-add algorithm
  For i = 1 To Len(string$)
    h << 4
    h + *p\c
    *p + s
  Next
 
  ; make sure h is positive
  If h < 0
    h = ~h + 1
  EndIf

  ProcedureReturn h
 
EndProcedure 


Posted: Sat Apr 18, 2009 4:52 am
by netmaestro
my hash function looks much cleaner (and more generic) than yours
Well, I think you have some work to do yet on it. Here's a comparison using this item list:

http://www.lloydsplace.com/sowpods2.zip

Code: Select all


Procedure.l djb2(tablesize, *key.CHARACTER)
  Protected hash=5381
  While *key\c <> 0
    hash=(hash<<5)+hash + *key\c ; hash * 33 + asc
    *key+1
  Wend
  If hash>=0
    ProcedureReturn hash%(tablesize+1)
  Else
    ProcedureReturn -(hash%(tablesize+1))
  EndIf
EndProcedure


Structure hashtable_char 
  c.c 
EndStructure 

Procedure.i hashtable_hashingfunction(tablesize, string$) 

  *p.hashtable_char = @string$ 
  s = SizeOf(*p\c) 

  ; calculate hash value using the shift-add algorithm 
  For i = 1 To Len(string$) 
    h << 4 
    h + *p\c 
    *p + s 
  Next 
  
  ; make sure h is positive 
  If h < 0 
    h = ~h + 1 
  EndIf 

  ProcedureReturn h%tablesize+1
  
EndProcedure 


Debug "Item count = 216555"
Debug "Bucket size = 10 million slots"
Debug ""

Dim a$(10000000)

collisioncount = 0
OpenFile(0,"sowpods2.txt")
While Not Eof(0) 
  txt$ = ReadString(0)
  hash = hashtable_hashingfunction(10000000, txt$)
  If a$(hash)=""
    a$(hash) = txt$
  Else
    collisioncount+1
  EndIf
Wend

Debug "This algorithm: " + Str(collisioncount) + " collisions"


Dim a$(10000000)

collisioncount = 0
OpenFile(0,"sowpods2.txt")
While Not Eof(0) 
  txt$ = ReadString(0)
  hash = djb2(10000000, @txt$)
  If a$(hash)=""
    a$(hash) = txt$
  Else
    collisioncount+1
  EndIf
Wend

Debug "djb2 algorithm: " + Str(collisioncount) + " collisions"
The code compares the collision frequency of your hash algorithm to that of the djb2 hash on the same 216555 items in a 10-million-slot array. You have 50k+ collisions vs 2.4k for djb2.

Also, the following test reveals a serious flaw in your hash:

Code: Select all

Procedure.l djb2(tablesize, *key.CHARACTER)
  Protected hash=5381
  While *key\c <> 0
    hash=(hash<<5)+hash + *key\c ; hash * 33 + asc
    *key+1
  Wend
  If hash>=0
    ProcedureReturn hash%(tablesize+1)
  Else
    ProcedureReturn -(hash%(tablesize+1))
  EndIf
EndProcedure

Structure hashtable_char 
  c.c 
EndStructure 

Procedure.i hashtable_hashingfunction(tablesize, string$) 

  *p.hashtable_char = @string$ 
  s = SizeOf(*p\c) 

  ; calculate hash value using the shift-add algorithm 
  For i = 1 To Len(string$) 
    h << 4 
    h + *p\c 
    *p + s 
  Next 
  
  ; make sure h is positive 
  If h < 0 
    h = ~h + 1 
  EndIf 

  ProcedureReturn h%tablesize+1
  
EndProcedure 

t$ = "abcdefghijklmnopqrstuvwxyz"
For i=0 To 16
  a$ = Right(t$,Len(t$)-i)
  Debug djb2(1000, @a$) ; well-distributed results
Next

For i=0 To 16
  a$ = Right(t$,Len(t$)-i)
  Debug hashtable_hashingfunction(1000, a$) ; all results = 31
Next

Posted: Sat Apr 18, 2009 12:54 pm
by srod
The djb2 algorithm is well understood and well tested - used for many hash-table type implementations. It is mathematically proven to produce relatively uniform results which is exactly what you want for this kind of thing.

kandl, the algorithm you are using is very similar in that both are shift-add algorithms. Easy to see though that the code you are using will not produce results which are as uniformally spread out.
don't know why no one commented on my thread but my hash function looks much cleaner (and more generic) than yours
No need to get personal. As I said, both are similar algorithms so bang goes the generic argument there!


@Netty : yes that is a good suggestion there. Just one tiny modification required...

Code: Select all

*key+SizeOf(CHARACTER)
:wink:

Thanks.