how to quickly find index of a word in middle of list

Keya · Post by **Keya** » Fri Jan 01, 2016 7:31 pm

I have a long dictionary list, each word is guaranteed unique, where the index for aardvark is 0 and the index for zoo is 999999

What's an efficient way to quickly find the index of a word in the middle?!? as opposed to starting at 0 and exhaustively checking the entire list one string at a time.

The only thing i can think of is have an index to the start word of each letter, so if im trying to find peach i only need to search the P words, but that's still ~1/26th of the original problem. Does anyone know a better way? I was wondering if "hash tables" might play a role here (seeing as each word is unique) but i dont understand them enough to make sense of how they might work for this if at all, although im guessing they would at least reduce the problem from searching strings to searching say 32bit hashes? (would i still need to search entire list?) sorry for my confusion

wilbert · Post by **wilbert** » Fri Jan 01, 2016 8:12 pm

A hash table is an option, so is a trie.
Simply recursively dividing the sorted list (don't know how that would be called as an algorithm) is also an option.
Compare with the mid item of the list. If it's above, continue with the upper halve, if it's not the lower half and keep repeating.
What's best might also depend on the dictionary. Does it change, how long is the longest word, are the first letters equally spread, do you only use a-z characters etc.
If you need to look up a lot of english words in a case insensitive way in a large dictionary, I think the trie might be fastest.

RASHAD · Post by **RASHAD** » Fri Jan 01, 2016 8:27 pm

Hi Keya
I feel that you are in the middle of a project
But if I will enforced to do such thing ,I will do it like this way where I can add, remove ,edit and get what I want in no time

Code: Select all

  Structure Dict
    List A$()
    List B$()
    List C$()
    List D$()
  EndStructure

  dc.Dict
  AddElement(dc\A$())
  dc\A$() = "Adam"

  AddElement(dc\A$())
  dc\A$() = "Alph"
  
  AddElement(dc\B$())
  dc\B$() = "Boy"

  AddElement(dc\B$())
  dc\B$() = "Bear"
  
  AddElement(dc\C$())
  dc\C$() = "Circle"

  AddElement(dc\C$())
  dc\C$() = "Circus"
  
  ForEach dc\A$()
    Debug dc\A$()
  Next
  
  Debug "  "
  
  ForEach dc\B$()
    Debug dc\B$()
  Next
  
  Debug "  "
  
  ForEach dc\C$()
    Debug dc\C$()
  Next

Little John · Post by **Little John** » Fri Jan 01, 2016 8:37 pm

wilbert wrote:Simply recursively dividing the sorted list (don't know how that would be called as an algorithm) is also an option.

That's binary search.

@Keya:
Luis' AA tree implementation might also be of interest for you.
In the same thread, I compared AA trees with PB's built-in maps (hash tables).

Post by **idle** » Fri Jan 01, 2016 10:56 pm

A Trie is probably the best option.
http://www.purebasic.fr/english/viewtop ... 12&t=43652
it's an old implementation and is ascii only but will be easy to update and make it unicode compatible

Keya · Post by **Keya** » Fri Jan 01, 2016 11:22 pm

Thankyou everyone so far for your input!!!!!

I probably should mention that the list never changes, so i never need to do modifications such as delete/insert, i simply need the index of string "peach" in the array, and yes it's fine to do a preprocess to for example calculate a list of hashes.

idle I only just saw your post then so i haven't had a good chance to take it all in yet, but looking at its dictionary demo it certainly does look in the ballpark for what i need!

however in my further reading on hash tables I just came across this bit, which clarified for me what i was confused about: http://algs4.cs.princeton.edu/34hash/

We reference key-value pairs using arrays by doing arithmetic operations to transform keys into array indices.

Search algorithms that use hashing consist of two separate parts. The first step is to compute a hash function that transforms the search key into an array index. Ideally, different keys would map to different indices. This ideal is generally beyond our reach, so we have to face the possibility that two or more different keys may hash to the same array index. Thus, the second part of a hashing search is a collision-resolution process that deals with this situation.

Hash functions: If we have an array that can hold M key-value pairs, then we need a function that can transform any given key into an index into that array: an integer in the range [0, M-1]. We seek a hash function that is both easy to compute and uniformly distributes the keys.

Strings: Modular hashing works for long keys such as strings, too: we simply treat them as huge integers. For example, the code below computes a modular hash function for a String s, where R is a small prime integer (Java uses 31).
Code: Select all
    int hash = 0;
    int M = 1000;   ;array size
    int R = 31;     ;prime
    for (int i = 0; i < s.length(); i++)
        hash = (R * hash + s.charAt(i)) % M;

Which i converted to PB:

Code: Select all

sTxt.s = "Hello, world!"
hash.l = 0
M.i = 1000   ;array size
R.i = 31     ;prime
For *pchar.Ascii = @sTxt To @sTxt + Len(sTxt)-1
   hash = (R * hash + *pchar\a) % M
Next
Debug Hex(hash)

That's the part I was confused about... the "how to get from a seemingly random 32bit checksum/hash like 0xE9F3ABE2, which is obviously outside array bounds, to a normal array index like 317", but that prime-based modular hash function makes it easy, whew! what a great little addition to the toolbelt

So I think I understand now how we can use it to jump DIRECTLY to the item, which seems even more efficient than a Trie (for this)??? (but im learning as I type here! lol) i wonder if this is the sort of fun Fred was having when implementing the Map functions!

There is still collision resolution to handle, but it seems the main problem is now dealt with!? hmmm lots more reading to do heehee

Collisions. There are chances that certain strings might cause the same key to be generated. In such cases, the individual has storage is turned into a link list or some other type of storage that can store all the duplicate keys. This is the reason why the individual hash storage is called as a bucket.

ok, that's easy enough ... so instead of expecting each array item to hold one string we expect it to hold a List of strings, and assuming no collisions each List will simply be 1 string in size. I think i get it now! lol

Q. Why does Java use 31 in the hashCode() for String?
A. It's prime, so that when the user mods out by another number, they have no common factors (unless it's a multiple of 31). 31 is also a Mersenne prime (like 127 or 8191) which is a prime number that is one less than a power of 2. This means that the mod can be done with one shift and one subtract if the machine's multiply instruction is slow.

hmm that doesn't explain though why they chose 31 over other Mersenne primes like the one either side of 31... 7 and 127, but this page addresses that.

another interesting example of "why a prime":

Code: Select all

http://stackoverflow.com/questions/3613102/why-use-a-prime-number-in-hashcode
For example, all 32-bit integers are aligned to addresses divisible by 4. Check out the table below to visualize the effects of using a prime vs. non-prime modulus:
Input       Mod8        Mod7 (prime)
0           0           0
4           4           4
8           0           1
12          4           5
16          0           2
20          4           6
24          0           3
28          4           0

Dude · Post by **Dude** » Sat Jan 02, 2016 2:57 am

Keya wrote:I have a long dictionary list, each word is guaranteed unique, where the index for aardvark is 0 and the index for zoo is 999999

What's an efficient way to quickly find the index of a word in the middle?!? as opposed to starting at 0 and exhaustively checking the entire list one string at a time.

Just note the start index of each letter, and then start searching from that letter's index. It's easy and almost instant, especially for only a tiny 999999 entries. Like this:

Code: Select all

size=999999

b_entries=size/2

Dim dic$(size)

For n=1 To b_entries-1 ; All "A" entries
  dic$(n)="a"+Str(n)
Next

For n=b_entries To size ; All "B" entries
  dic$(n)="b"+Str(n)
Next

Debug dic$(b_entries-1) ; Show end of "A" entries
Debug dic$(b_entries)   ; Show start of "B" entries
Debug ""

; Find "b499999" by starting search from beginning of list (slow)

s=GetTickCount_()
For n=1 To size
  If dic$(n)="b499999"
    Debug "Found at "+Str(n)+" in "+Str(GetTickCount_()-s)+" ms"
    Break
  EndIf
Next

; Find "b499999" by starting search from the "B" entries (fast)

s=GetTickCount_()
For n=b_entries To size
  If dic$(n)="b499999"
    Debug "Found at "+Str(n)+" in "+Str(GetTickCount_()-s)+" ms"
    Break
  EndIf
Next

Dude · Post by **Dude** » Sat Jan 02, 2016 3:05 am

You could also have 26 list arrays (a,b,c,etc) and store each word in it's letter array. Then you only need to search the specific letter array when needed, instead of all arrays.

Keya · Post by **Keya** » Sat Jan 02, 2016 3:33 am

Dude, perhaps a combination might be the way to go

btw I just had a look at PB's NewMap function, it's the same multiplier algorithm as K&R, Java, Larson, DJB etc - it seems Fred has chosen the multiplier of 65599/0x1003F

A quick google for that magic number reveals it's referred to as sdbm
http://www.cse.yorku.ca/~oz/hash.html#sdbm

this algorithm was created for sdbm (a public-domain reimplementation of ndbm) database library. it was found to do well in scrambling bits, causing better distribution of the keys and fewer splits. it also happens to be a good general hashing function with good distribution. the actual function is hash(i) = hash(i - 1) * 65599 + str; what is included below is the faster version used in gawk. [there is even a faster, duff-device version] the magic constant 65599 was picked out of thin air while experimenting with different constants, and turns out to be a prime. this is one of the algorithms used in berkeley db (see sleepycat) and elsewhere.

Lunasole · Post by **Lunasole** » Sat Jan 02, 2016 7:36 am

Keya wrote: I probably should mention that the list never changes, so i never need to do modifications such as delete/insert, i simply need the index of string "peach" in the array, and yes it's fine to do a preprocess to for example calculate a list of hashes.

You can also take a look at this simplified tree based on array indexes, maybe it can help someway

http://www.purebasic.fr/english/viewtop ... 12&t=63853

wilbert · Post by **wilbert** » Sat Jan 02, 2016 7:58 am

The advantage of a trie is that you are jumping from node to node and don't have to compare the string anymore when you reach the end node.
That is very fast but uses quite a lot of memory for a list with a million words.

The advantage of a hash table is that you are dividing all items over a number of buckets so you don't have to compare as much items but you still need to compare all items that are inside a bucket.
So the larger the hash table, the faster it will be. For text strings, the Larson hash is a good choice http://www.purebasic.fr/english/viewtop ... 85#p463985

The advantage of a binary search is that you need very little memory and it still is pretty fast.

I don't know if you have a copy of the dictionary as a zip file so we can study its characteristics but so far I still think the trie will beat the other options when it comes to speed.
The question is, do you really need the speed. If you need to look up thousands of words automatically, it really makes a difference but if the user needs to type the word and look it up, it makes not much of a difference if is takes 50 or 100 ms to look it up.

Keya · Post by **Keya** » Sat Jan 02, 2016 8:29 am

Lunasole thanks, checking it out now

wilbert the task in a nutshell is to take a smallish ~1kb string, and scan through that to look for known substrings of varying length from a dictionary of ~10,000 words. And then i need to do that for about a million strings, so yes I need to find a very fast method because the exhaustive approaches i imagine will be excruciatingly slow!

Ive spent all day learning about hash tables as they've finally clicked in my brain, fascinating things! (now i can use checksums for more than just error detection!)
but I'm still very open-minded to all approaches at this stage, and Tries i would like to spend some time learning more about anyway

wilbert · Post by **wilbert** » Sat Jan 02, 2016 8:38 am

Keya wrote:scan through that to look for known substrings

In that case, do you really need the index inside the dictionary or do you simply need to know if a word exists inside the dictionary.
It may sound similar but are very different tasks.

Keya · Post by **Keya** » Sat Jan 02, 2016 9:09 am

yes their existence alone is not enough, i must get the unique index of the found dictionary words. It's a tricky one! One of those ones that sounds very simple to begin with, but then ...

(like from a few months ago my loaded-unbeknownst-to-me "i just need to find a needle string in a haystack" question, lol)

wilbert · Post by **wilbert** » Sat Jan 02, 2016 9:42 am

Keya wrote:yes their existence alone is not enough, i must get the unique index of the found dictionary words. It's a tricky one!

Still, your task changed a lot from what you initially mentioned if the dictionary only contains about 10.000 words.
The main problem with hashes are the collisions. If the dictionary only consists of 10.000 words, you can do with two bytes for each index value.

If you reserve 16 megabyte of ram, you can store 8388608 16-bit index values.
If you truncate the hash to 23 bits (0 - 8388607) and have only about 10.000 items, the chance of collisions is small.
You can reserve a special index value (for example $FFFF) to indicate a collision and use the PB Map as a fallback for the few collisions that do occur.
For all those items where there is no collision, you could instantly retrieve the index value this way after calculating the hash.

PureBasic Forums - English

how to quickly find index of a word in middle of list

how to quickly find index of a word in middle of list

Re: how to quickly find index of a word in middle of list

Re: how to quickly find index of a word in middle of list

Re: how to quickly find index of a word in middle of list

Re: how to quickly find index of a word in middle of list

Re: how to quickly find index of a word in middle of list

Re: how to quickly find index of a word in middle of list

Re: how to quickly find index of a word in middle of list

Re: how to quickly find index of a word in middle of list

Re: how to quickly find index of a word in middle of list

Re: how to quickly find index of a word in middle of list

Re: how to quickly find index of a word in middle of list

Re: how to quickly find index of a word in middle of list

Re: how to quickly find index of a word in middle of list

Re: how to quickly find index of a word in middle of list