Amazing Anagrams
- netmaestro
- PureBasic Bullfrog
- Posts: 8451
- Joined: Wed Jul 06, 2005 5:42 am
- Location: Fort Nelson, BC, Canada
I just realized - The extra diskspace can be reduced by a lot. Any entry in the pointer database that doesn't have a duplicate represents a word with no anagram. So it doesn't need to be there. We need a sixth step - scan the pointer file for orphans and remove them before building the database. It'll probably cut the size by more than half.
BERESHEIT
- netmaestro
- PureBasic Bullfrog
- Posts: 8451
- Joined: Wed Jul 06, 2005 5:42 am
- Location: Fort Nelson, BC, Canada
Ok, removing orphans from the final list before creating the database reduced the db size from 4.4mb to 0.8 mb. Also, my sowpods.txt had trailing spaces in it, so removing those dropped it from 3.5mb to 2.3mb. The total diskspace usage goes from 8mb down to just over 3. Also, optimizing the index database speeds the searches up dramatically - Average 60 ms -> now 12-15ms. The original posting of the method is updated with these sizes and times.
One consideration as we move on to phrases is that the word list needs to be expanded to include things like names, slang words, contractions such as I'm, they've, we're, etc. and probably a lot more. Still thinking about an approach.
One consideration as we move on to phrases is that the word list needs to be expanded to include things like names, slang words, contractions such as I'm, they've, we're, etc. and probably a lot more. Still thinking about an approach.
BERESHEIT
For phrases you would get the sorted letters for the first phrase (Clint Eastwood = acdeilnoosttw).
You can then build a list of sorted letter combinations that would match a portion of the first phrase's list (i.e. dlo, einos, estw, adel, no, doow, acinot, etc.) .
You would then pair these matches up to get a winning combination by matching length first, then letters. Using our sample phrase as a reference point, this means you would look for patterns that add up to 13 (i.e. 3 + 4 + 6), trying each one in sequence (keeping in mind the length of the patterns already found as a limit on the possibilities).
Next you would match letters for the patterns of a given length (dlo estw acinot).
You would then look up the words for these patterns (old west action) or (cation stew lod) or even (wild sea cotton). Word order would then be chosen manually to apply meaning to a collection that was chosen.
Whew. I think it is almost as easy to do this manually. I'm not up to speed on using databases yet so I haven't been able to test this method out yet.
You can then build a list of sorted letter combinations that would match a portion of the first phrase's list (i.e. dlo, einos, estw, adel, no, doow, acinot, etc.) .
You would then pair these matches up to get a winning combination by matching length first, then letters. Using our sample phrase as a reference point, this means you would look for patterns that add up to 13 (i.e. 3 + 4 + 6), trying each one in sequence (keeping in mind the length of the patterns already found as a limit on the possibilities).
Next you would match letters for the patterns of a given length (dlo estw acinot).
You would then look up the words for these patterns (old west action) or (cation stew lod) or even (wild sea cotton). Word order would then be chosen manually to apply meaning to a collection that was chosen.
Whew. I think it is almost as easy to do this manually. I'm not up to speed on using databases yet so I haven't been able to test this method out yet.

- netmaestro
- PureBasic Bullfrog
- Posts: 8451
- Joined: Wed Jul 06, 2005 5:42 am
- Location: Fort Nelson, BC, Canada
Talk about cooking with gas. I looked at the code I had utilizing a SQLite database for the indexes and I thought something else might be faster. So I replaced it with a binary search procedure and it cut the search time by a factor of ten. An average search now takes from 0 - 0.6 milliseconds! I had to resort to QueryPerformanceCounter_() to measure it. No matter how long the search word or how many anagrams are found, I can't make a search take longer than 0.6 milliseconds. Updated code is in the download.
Now, truly, on to phrases. I'm never going to beat these times.
Now, truly, on to phrases. I'm never going to beat these times.
BERESHEIT
@netmaestro: I have another surprise for you, I've cut your times in half!netmaestro wrote:Talk about cooking with gas. I looked at the code I had utilizing a SQLite database for the indexes and I thought something else might be faster. So I replaced it with a binary search procedure and it cut the search time by a factor of ten. An average search now takes from 0 - 0.6 milliseconds! I had to resort to QueryPerformanceCounter_() to measure it. No matter how long the search word or how many anagrams are found, I can't make a search take longer than 0.6 milliseconds. Updated code is in the download.
Now, truly, on to phrases. I'm never going to beat these times.

I'll post the code here, instead of a link, for easier viewing of the methods. A casual note on the coding, I've added more structure to the code as it's length has increased. To make this version I took the procedures Search() and TicksHQ() from your code. I also rewrote my sortLetters() routine and build the index array anew each time the program runs. This way you only have to supply a dictionary file and you are good to go. It also means you can add to your dictionary file and you'll automatically get a working index.
That brings me to another point, I discovered your index file has 2 inaccuracies in it. To discover one of them, find the anagrams for sawblade using your program. The other is that you are missing indexes for the words AHA and AAH so you won't be able to find them in your program either even though they are in your dictionary file. Discovering this was a real aha moment.

To compare the speed results with yours I had to increase the reported time resolution from tenths to hundreths of a second. Yours reported 0.17 milliseconds for a search on "easter" and mine reported a time of 0.07 milliseconds. So to paraphrase you, now you are cooking with more gas.

Here's the code:
Code: Select all
;-constants
#DictionaryFile = "sowpods2.txt"
Enumeration
;windows
#mainWindow
;gadgets
#Word_str = 0
#List_btn
#AnagramList_lvw
EndEnumeration
;-structures
Structure pattern
letters.s
index.i
EndStructure
;-global variables
Global Dim words.s(0) ;resized later in init procedure
Global Dim index.pattern(0) ;resized later in init procedure
;-procedures
Procedure handleError(p_result.l, p_text.s) ;display error message if result= 0 and then END program
;sample use HandleError(InitSprite(), "Could not initialize sprites.")
;Procedure borrowed from Kale in PureBasic forum
If p_result = 0
MessageRequester("Error", p_text, #PB_MessageRequester_Ok)
End
EndIf
EndProcedure
Procedure.i TicksHQ()
;Procedure borrowed from netmaestro in PureBasic forum
Static maxfreq.q
Protected T.q
If maxfreq=0
QueryPerformanceFrequency_(@maxfreq)
maxfreq=maxfreq/100000
EndIf
QueryPerformanceCounter_(@T)
ProcedureReturn T/maxfreq ;Result is in hundreths-of-a-millisecond
EndProcedure
Procedure.s sortLetters(*word.Character, length)
Protected Dim letters.c(length)
CopyMemory(*word, @letters(), length * SizeOf(Character))
SortArray(letters(), #PB_Sort_Ascending, 0, length - 1)
ProcedureReturn PeekS(@letters(), length)
EndProcedure
Procedure initDictionaryPatterns() ;returns #True if successful
Protected i, j, isFirst, numWords, fileNumber
Protected NewList patterns.pattern()
numWords = 200000
Redim words.s(numWords)
fileNumber = ReadFile(#PB_Any, #DictionaryFile) ; Read all legal words into array words()
If Not fileNumber
ProcedureReturn #False ;failed to initialize dictionary and patterns
EndIf
i = 1
While Not Eof(fileNumber)
words(i) = UCase(Trim(ReadString(fileNumber)))
i + 1
If i > numWords
numWords + 20000
Redim words.s(numWords)
EndIf
Wend
numWords = i - 1
CloseFile(fileNumber)
Redim words.s(numWords)
For i = 1 To numWords
AddElement(patterns())
patterns()\letters = sortLetters(@words(i), Len(words(i)))
patterns()\index = i
Next
SortStructuredList(patterns(), #PB_Sort_Ascending, OffsetOf(pattern\letters), #PB_Sort_String)
;build array of pattern index for only duplicate patterns (no singles)
Redim index.pattern(numWords)
i = 0
isFirst = #True
ForEach patterns()
If isFirst
index(i)\letters = patterns()\letters
index(i)\index = patterns()\index
j = i
isFirst = #False
ElseIf patterns()\letters = index(i)\letters
j + 1
index(j)\letters = patterns()\letters
index(j)\index = patterns()\index
ElseIf j = i ;test for a single
isFirst = #True
PreviousElement(patterns())
Else
i = j + 1
isFirst = #True
PreviousElement(patterns())
EndIf
Next
If j = i
i = i - 1
EndIf
Redim index.pattern(i)
ProcedureReturn #True ;success completion
EndProcedure
Procedure Search(Array patterns.pattern(1), key$)
Protected lo = 1
Protected hi = ArraySize(patterns())
Protected mid
While lo <= hi
mid = lo + (hi-lo) / 2
If key$ = patterns(mid)\letters ; match
While patterns(mid - 1)\letters = key$
mid - 1
Wend
ProcedureReturn mid
ElseIf key$ < patterns(mid)\letters
hi = mid - 1 ; search low end of array
Else
lo = mid + 1 ; search high end of array
EndIf
Wend
ProcedureReturn 0
EndProcedure
Procedure findAnagram(word.s, List wordList.s())
Protected matchLength = Len(word)
If matchLength = 0
ProcedureReturn 0
EndIf
word = UCase(word)
Protected matchLetters.s = sortLetters(@word, matchLength)
Protected patternIndex, wordIndex
ClearList(wordList())
patternIndex = Search(index(), matchLetters)
If patternIndex
While index(patternIndex)\letters = matchLetters
wordIndex = index(patternIndex)\index
If words(wordIndex) <> word
AddElement(wordList())
wordList() = words(wordIndex)
EndIf
patternIndex + 1
Wend
EndIf
ProcedureReturn ListSize(wordList())
EndProcedure
;-main
handleError(initDictionaryPatterns(), "Unable To initialize word List")
handleError(OpenWindow(#mainWindow, 0, 0, 290, 500, "Anagram Finder", #PB_Window_ScreenCentered | #PB_Window_SystemMenu), "Unable to open Window")
StringGadget(#Word_str, 10, 10, 160, 20, "", #PB_String_LowerCase)
ButtonGadget(#List_btn, 180, 10, 100, 20, "List")
ListViewGadget(#AnagramList_lvw, 10, 40, 270, 455)
Define event, beginTime, totalTime
NewList wordList.s()
Repeat
event = WaitWindowEvent(10)
Select event
Case #PB_Event_Gadget
Select EventGadget()
Case #Word_str
If EventType() = #PB_EventType_Focus
SendMessage_(GadgetID(#Word_str), #EM_SETSEL, 0, -1)
EndIf
Case #List_btn
ClearGadgetItems(#AnagramList_lvw)
Define StartTime = TicksHQ()
If findAnagram(GetGadgetText(#Word_str), wordList())
ForEach wordList()
AddGadgetItem(#AnagramList_lvw, -1, wordList())
Next
Define totalTime = TicksHQ() - StartTime
Else
AddGadgetItem(#AnagramList_lvw, -1, "< NONE FOUND >")
EndIf
AddGadgetItem(#AnagramList_lvw, -1, "")
AddGadgetItem(#AnagramList_lvw, -1, "Search time: " + Str(totalTime) + " hundreths of a millisecond")
EndSelect
EndSelect
Until event = #PB_Event_CloseWindow