Page 2 of 2

Re: how to quickly find index of a word in middle of list

Posted: Sat Jan 02, 2016 11:52 am
by Michael Vogel
Didn't read the latest postings, so the following approach might be useless :wink: - I also was not able to find out, if you need to find the entry "hazelnut tree" when searching for a "nut"?

If not, the following start could help - because on a constant word table, also the pointer table also have after do be done only one time. To speed up everything, a fourth character could be included in the pointer list and a binary search in the find routine would to the rest...

Code: Select all


#Words=1234567
#Pointer=26*26*26
Global Dim abc.s(#Words)
Global Dim ptr(#Pointer)

Debug "Create sorted word list..."
RandomSeed(0)
Dim aaa(3)
init.s="aaa"

z=0
more=1
While z<#Words
	z+1
	abc(z)=init
	n=Random(15)
	For i=1 To n
		abc(z)+Chr('a'+Random(25))
	Next i

	If more And Random(50)=0
		n=0
		m=3
		Repeat
			If aaa(m)<25
				aaa(m)+1
				n=1
				;If m=1
				;	Debug Chr(aaa(1)+'a')
				;EndIf
			Else
				aaa(m)=0
				m-1
				If m=0
					n=1
					more=0
				EndIf
			EndIf
		Until n
		If more
			init=Chr('a'+aaa(1))+Chr('a'+aaa(2))+Chr('a'+aaa(3))
		EndIf
	EndIf
Wend

Debug "Create pointer..."
z=#Words
While z>0
	ptr((PeekA(@abc(z))-'a')*576+(PeekA(@abc(z)+SizeOf(Character))-'a')*26+(PeekA(@abc(z)+SizeOf(Character)<<1)-'a'))=z
	z-1
Wend

z=#Pointer
ptr(#Pointer)=#Words
While z>0
	z-1
	If ptr(z)=0
		ptr(z)=-Abs(ptr(z+1))
	EndIf
Wend

Procedure find(s.s)
	
	Protected pointer,n
	
	pointer=(PeekA(@s)-'a')*576+(PeekA(@s+SizeOf(Character))-'a')*26+(PeekA(@s+SizeOf(Character)<<1)-'a')
	
	If pointer>=0 And pointer<#Pointer
		n=Abs(ptr(pointer+1))
		pointer=ptr(pointer)
		If pointer>0
			While pointer<n
				If abc(pointer)=s
					Debug "Found '"+s+"' at position "+Str(pointer)
					ProcedureReturn pointer
				EndIf
				pointer+1
			Wend
		EndIf
	EndIf
	
	ProcedureReturn #Null
		
EndProcedure

	
Debug "Search..."
Debug find("ahedtymjfcszty")
Debug find("cvhnvwzssyrkfmpzsa")
Debug find("zzzdacknjkxghp")

Re: how to quickly find index of a word in middle of list

Posted: Sat Jan 02, 2016 2:03 pm
by Keya
wilbert wrote:You can reserve a special index value (for example $FFFF) to indicate a collision and use the PB Map as a fallback for the few collisions that do occur.
For all those items where there is no collision, you could instantly retrieve the index value this way after calculating the hash.
brilliant! I'll have to give it some more thought about implementation, but throwing all the collisions in the one bucket sounds like it could be a great little efficiency boost

Because I'll be able to preprocess the ~10k items im wondering how feasible it is to create a perfect hash function with no collisions - it seems possible but at this stage i dont personally understand perfect hashes well enough so i'll probably have to put perfect hashing aside and accommodate the few collisions that do occur, and i dont want to clog up my code with kilobytes of perfect hash code just to deal with a small handful of collisions, although I am tempted to try several different multipliers (31, 33 etc) just to see if that results in a cheap way to get perfect hash, storing the multiplier value alongside of course. Testing should be interesting!

Michael,
Michael Vogel wrote:Didn't read the latest postings, so the following approach might be useless :wink: - I also was not able to find out, if you need to find the entry "hazelnut tree" when searching for a "nut"?
good question heehee - I only need to find "nut" in "hazelnut" IF "nut" is itself listed as a single word (so no, i don't need to search for substrings)
Thanks for the sample, checking it out now!

Re: how to quickly find index of a word in middle of list

Posted: Sat Jan 02, 2016 2:27 pm
by Erich
If the list of 10,000 words is fixed in advance, you could create what's called a perfect hash function. This has no collisions at all. Then it all boils down to finding the string search algorithm that is optimal for 1kb strings.

Re: how to quickly find index of a word in middle of list

Posted: Sat Jan 02, 2016 3:39 pm
by wilbert
Keya wrote:Because I'll be able to preprocess the ~10k items im wondering how feasible it is to create a perfect hash function with no collisions
It's worth to invest a bit of time to try to find such a hash function. If you do find one for the list of words you have, it will makes things both faster and easier :)

Re: how to quickly find index of a word in middle of list

Posted: Sat Jan 02, 2016 8:23 pm
by normeus
for the first post KISS pureBasics NewMap is really fast

Code: Select all

NewMap millionRecords.q(1024)

millionRecords("aardvark")=0
millionRecords("abel")=1
;.......... add a few thousand
millionRecords("zoo")=999999
millionRecords("Zzyzx, California")=1000000


Debug millionRecords("abel")
this is fast enough for a few million records.
on your last post however, you have a different scenario. Sounds like you are searching a small list of words to match a larger list of words.
take the smallest list and sort it.
you will have to split millionRecord Map alphabetically so that you only have to search a's once b's once etc..
someone probably mentioned this before but I just wanted to make sure NewMap was shown as a real fast solution for this kind of problem.

Thank you.
Norm.