how to quickly find index of a word in middle of list

Just starting out? Need help? Post your questions and find answers here.
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

how to quickly find index of a word in middle of list

Post by Keya »

I have a long dictionary list, each word is guaranteed unique, where the index for aardvark is 0 and the index for zoo is 999999

What's an efficient way to quickly find the index of a word in the middle?!? as opposed to starting at 0 and exhaustively checking the entire list one string at a time.

The only thing i can think of is have an index to the start word of each letter, so if im trying to find peach i only need to search the P words, but that's still ~1/26th of the original problem. Does anyone know a better way? I was wondering if "hash tables" might play a role here (seeing as each word is unique) but i dont understand them enough to make sense of how they might work for this if at all, although im guessing they would at least reduce the problem from searching strings to searching say 32bit hashes? (would i still need to search entire list?) sorry for my confusion :)
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3942
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: how to quickly find index of a word in middle of list

Post by wilbert »

A hash table is an option, so is a trie.
Simply recursively dividing the sorted list (don't know how that would be called as an algorithm) is also an option.
Compare with the mid item of the list. If it's above, continue with the upper halve, if it's not the lower half and keep repeating.
What's best might also depend on the dictionary. Does it change, how long is the longest word, are the first letters equally spread, do you only use a-z characters etc.
If you need to look up a lot of english words in a case insensitive way in a large dictionary, I think the trie might be fastest.
Windows (x64)
Raspberry Pi OS (Arm64)
RASHAD
PureBasic Expert
PureBasic Expert
Posts: 4946
Joined: Sun Apr 12, 2009 6:27 am

Re: how to quickly find index of a word in middle of list

Post by RASHAD »

Hi Keya
I feel that you are in the middle of a project
But if I will enforced to do such thing ,I will do it like this way where I can add, remove ,edit and get what I want in no time

Code: Select all

  Structure Dict
    List A$()
    List B$()
    List C$()
    List D$()
  EndStructure

  dc.Dict
  AddElement(dc\A$())
  dc\A$() = "Adam"

  AddElement(dc\A$())
  dc\A$() = "Alph"
  
  AddElement(dc\B$())
  dc\B$() = "Boy"

  AddElement(dc\B$())
  dc\B$() = "Bear"
  
  AddElement(dc\C$())
  dc\C$() = "Circle"

  AddElement(dc\C$())
  dc\C$() = "Circus"
  
  ForEach dc\A$()
    Debug dc\A$()
  Next
  
  Debug "  "
  
  ForEach dc\B$()
    Debug dc\B$()
  Next
  
  Debug "  "
  
  ForEach dc\C$()
    Debug dc\C$()
  Next
Egypt my love
Little John
Addict
Addict
Posts: 4778
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: how to quickly find index of a word in middle of list

Post by Little John »

wilbert wrote:Simply recursively dividing the sorted list (don't know how that would be called as an algorithm) is also an option.
That's binary search. :-)

@Keya:
Luis' AA tree implementation might also be of interest for you.
In the same thread, I compared AA trees with PB's built-in maps (hash tables).
User avatar
idle
Always Here
Always Here
Posts: 5839
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: how to quickly find index of a word in middle of list

Post by idle »

A Trie is probably the best option.
http://www.purebasic.fr/english/viewtop ... 12&t=43652
it's an old implementation and is ascii only but will be easy to update and make it unicode compatible
Windows 11, Manjaro, Raspberry Pi OS
Image
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: how to quickly find index of a word in middle of list

Post by Keya »

Thankyou everyone so far for your input!!!!! :)

I probably should mention that the list never changes, so i never need to do modifications such as delete/insert, i simply need the index of string "peach" in the array, and yes it's fine to do a preprocess to for example calculate a list of hashes.

idle I only just saw your post then so i haven't had a good chance to take it all in yet, but looking at its dictionary demo it certainly does look in the ballpark for what i need!

however in my further reading on hash tables I just came across this bit, which clarified for me what i was confused about: http://algs4.cs.princeton.edu/34hash/
We reference key-value pairs using arrays by doing arithmetic operations to transform keys into array indices.

Search algorithms that use hashing consist of two separate parts. The first step is to compute a hash function that transforms the search key into an array index. Ideally, different keys would map to different indices. This ideal is generally beyond our reach, so we have to face the possibility that two or more different keys may hash to the same array index. Thus, the second part of a hashing search is a collision-resolution process that deals with this situation.

Hash functions: If we have an array that can hold M key-value pairs, then we need a function that can transform any given key into an index into that array: an integer in the range [0, M-1]. We seek a hash function that is both easy to compute and uniformly distributes the keys.

Strings: Modular hashing works for long keys such as strings, too: we simply treat them as huge integers. For example, the code below computes a modular hash function for a String s, where R is a small prime integer (Java uses 31).

Code: Select all

    int hash = 0;
    int M = 1000;   ;array size
    int R = 31;     ;prime
    for (int i = 0; i < s.length(); i++)
        hash = (R * hash + s.charAt(i)) % M;
Which i converted to PB:

Code: Select all

sTxt.s = "Hello, world!"
hash.l = 0
M.i = 1000   ;array size
R.i = 31     ;prime
For *pchar.Ascii = @sTxt To @sTxt + Len(sTxt)-1
   hash = (R * hash + *pchar\a) % M
Next
Debug Hex(hash)
That's the part I was confused about... the "how to get from a seemingly random 32bit checksum/hash like 0xE9F3ABE2, which is obviously outside array bounds, to a normal array index like 317", but that prime-based modular hash function makes it easy, whew! what a great little addition to the toolbelt

So I think I understand now how we can use it to jump DIRECTLY to the item, which seems even more efficient than a Trie (for this)??? (but im learning as I type here! lol) i wonder if this is the sort of fun Fred was having when implementing the Map functions!

There is still collision resolution to handle, but it seems the main problem is now dealt with!? hmmm lots more reading to do heehee
Collisions. There are chances that certain strings might cause the same key to be generated. In such cases, the individual has storage is turned into a link list or some other type of storage that can store all the duplicate keys. This is the reason why the individual hash storage is called as a bucket.
ok, that's easy enough ... so instead of expecting each array item to hold one string we expect it to hold a List of strings, and assuming no collisions each List will simply be 1 string in size. I think i get it now! lol
Q. Why does Java use 31 in the hashCode() for String?
A. It's prime, so that when the user mods out by another number, they have no common factors (unless it's a multiple of 31). 31 is also a Mersenne prime (like 127 or 8191) which is a prime number that is one less than a power of 2. This means that the mod can be done with one shift and one subtract if the machine's multiply instruction is slow.
hmm that doesn't explain though why they chose 31 over other Mersenne primes like the one either side of 31... 7 and 127, but this page addresses that.

another interesting example of "why a prime":

Code: Select all

http://stackoverflow.com/questions/3613102/why-use-a-prime-number-in-hashcode
For example, all 32-bit integers are aligned to addresses divisible by 4. Check out the table below to visualize the effects of using a prime vs. non-prime modulus:
Input       Mod8        Mod7 (prime)
0           0           0
4           4           4
8           0           1
12          4           5
16          0           2
20          4           6
24          0           3
28          4           0
Dude
Addict
Addict
Posts: 1907
Joined: Mon Feb 16, 2015 2:49 pm

Re: how to quickly find index of a word in middle of list

Post by Dude »

Keya wrote:I have a long dictionary list, each word is guaranteed unique, where the index for aardvark is 0 and the index for zoo is 999999

What's an efficient way to quickly find the index of a word in the middle?!? as opposed to starting at 0 and exhaustively checking the entire list one string at a time.
Just note the start index of each letter, and then start searching from that letter's index. It's easy and almost instant, especially for only a tiny 999999 entries. Like this:

Code: Select all

size=999999

b_entries=size/2

Dim dic$(size)

For n=1 To b_entries-1 ; All "A" entries
  dic$(n)="a"+Str(n)
Next

For n=b_entries To size ; All "B" entries
  dic$(n)="b"+Str(n)
Next

Debug dic$(b_entries-1) ; Show end of "A" entries
Debug dic$(b_entries)   ; Show start of "B" entries
Debug ""

; Find "b499999" by starting search from beginning of list (slow)

s=GetTickCount_()
For n=1 To size
  If dic$(n)="b499999"
    Debug "Found at "+Str(n)+" in "+Str(GetTickCount_()-s)+" ms"
    Break
  EndIf
Next

; Find "b499999" by starting search from the "B" entries (fast)

s=GetTickCount_()
For n=b_entries To size
  If dic$(n)="b499999"
    Debug "Found at "+Str(n)+" in "+Str(GetTickCount_()-s)+" ms"
    Break
  EndIf
Next
Dude
Addict
Addict
Posts: 1907
Joined: Mon Feb 16, 2015 2:49 pm

Re: how to quickly find index of a word in middle of list

Post by Dude »

You could also have 26 list arrays (a,b,c,etc) and store each word in it's letter array. Then you only need to search the specific letter array when needed, instead of all arrays.
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: how to quickly find index of a word in middle of list

Post by Keya »

Dude, perhaps a combination might be the way to go :)

btw I just had a look at PB's NewMap function, it's the same multiplier algorithm as K&R, Java, Larson, DJB etc - it seems Fred has chosen the multiplier of 65599/0x1003F :)
A quick google for that magic number reveals it's referred to as sdbm
http://www.cse.yorku.ca/~oz/hash.html#sdbm
this algorithm was created for sdbm (a public-domain reimplementation of ndbm) database library. it was found to do well in scrambling bits, causing better distribution of the keys and fewer splits. it also happens to be a good general hashing function with good distribution. the actual function is hash(i) = hash(i - 1) * 65599 + str; what is included below is the faster version used in gawk. [there is even a faster, duff-device version] the magic constant 65599 was picked out of thin air while experimenting with different constants, and turns out to be a prime. this is one of the algorithms used in berkeley db (see sleepycat) and elsewhere.
User avatar
Lunasole
Addict
Addict
Posts: 1091
Joined: Mon Oct 26, 2015 2:55 am
Location: UA
Contact:

Re: how to quickly find index of a word in middle of list

Post by Lunasole »

Keya wrote: I probably should mention that the list never changes, so i never need to do modifications such as delete/insert, i simply need the index of string "peach" in the array, and yes it's fine to do a preprocess to for example calculate a list of hashes.
You can also take a look at this simplified tree based on array indexes, maybe it can help someway

http://www.purebasic.fr/english/viewtop ... 12&t=63853
"W̷i̷s̷h̷i̷n̷g o̷n a s̷t̷a̷r"
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3942
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: how to quickly find index of a word in middle of list

Post by wilbert »

The advantage of a trie is that you are jumping from node to node and don't have to compare the string anymore when you reach the end node.
That is very fast but uses quite a lot of memory for a list with a million words.

The advantage of a hash table is that you are dividing all items over a number of buckets so you don't have to compare as much items but you still need to compare all items that are inside a bucket.
So the larger the hash table, the faster it will be. For text strings, the Larson hash is a good choice http://www.purebasic.fr/english/viewtop ... 85#p463985

The advantage of a binary search is that you need very little memory and it still is pretty fast.

I don't know if you have a copy of the dictionary as a zip file so we can study its characteristics but so far I still think the trie will beat the other options when it comes to speed.
The question is, do you really need the speed. If you need to look up thousands of words automatically, it really makes a difference but if the user needs to type the word and look it up, it makes not much of a difference if is takes 50 or 100 ms to look it up.
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: how to quickly find index of a word in middle of list

Post by Keya »

Lunasole thanks, checking it out now :)

wilbert the task in a nutshell is to take a smallish ~1kb string, and scan through that to look for known substrings of varying length from a dictionary of ~10,000 words. And then i need to do that for about a million strings, so yes I need to find a very fast method because the exhaustive approaches i imagine will be excruciatingly slow! :)
Ive spent all day learning about hash tables as they've finally clicked in my brain, fascinating things! (now i can use checksums for more than just error detection!)
but I'm still very open-minded to all approaches at this stage, and Tries i would like to spend some time learning more about anyway
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3942
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: how to quickly find index of a word in middle of list

Post by wilbert »

Keya wrote:scan through that to look for known substrings
In that case, do you really need the index inside the dictionary or do you simply need to know if a word exists inside the dictionary.
It may sound similar but are very different tasks.
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: how to quickly find index of a word in middle of list

Post by Keya »

yes their existence alone is not enough, i must get the unique index of the found dictionary words. It's a tricky one! One of those ones that sounds very simple to begin with, but then ... :) (like from a few months ago my loaded-unbeknownst-to-me "i just need to find a needle string in a haystack" question, lol)
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3942
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: how to quickly find index of a word in middle of list

Post by wilbert »

Keya wrote:yes their existence alone is not enough, i must get the unique index of the found dictionary words. It's a tricky one!
Still, your task changed a lot from what you initially mentioned if the dictionary only contains about 10.000 words.
The main problem with hashes are the collisions. If the dictionary only consists of 10.000 words, you can do with two bytes for each index value.

If you reserve 16 megabyte of ram, you can store 8388608 16-bit index values.
If you truncate the hash to 23 bits (0 - 8388607) and have only about 10.000 items, the chance of collisions is small.
You can reserve a special index value (for example $FFFF) to indicate a collision and use the PB Map as a fallback for the few collisions that do occur.
For all those items where there is no collision, you could instantly retrieve the index value this way after calculating the hash.
Windows (x64)
Raspberry Pi OS (Arm64)
Post Reply