Amazing Anagrams

netmaestro · Post by **netmaestro** » Sat Apr 11, 2009 1:30 pm

I just realized - The extra diskspace can be reduced by a lot. Any entry in the pointer database that doesn't have a duplicate represents a word with no anagram. So it doesn't need to be there. We need a sixth step - scan the pointer file for orphans and remove them before building the database. It'll probably cut the size by more than half.

netmaestro · Post by **netmaestro** » Sat Apr 11, 2009 3:58 pm

Ok, removing orphans from the final list before creating the database reduced the db size from 4.4mb to 0.8 mb. Also, my sowpods.txt had trailing spaces in it, so removing those dropped it from 3.5mb to 2.3mb. The total diskspace usage goes from 8mb down to just over 3. Also, optimizing the index database speeds the searches up dramatically - Average 60 ms -> now 12-15ms. The original posting of the method is updated with these sizes and times.

One consideration as we move on to phrases is that the word list needs to be expanded to include things like names, slang words, contractions such as I'm, they've, we're, etc. and probably a lot more. Still thinking about an approach.

Demivec · Post by **Demivec** » Sat Apr 11, 2009 4:49 pm

For phrases you would get the sorted letters for the first phrase (Clint Eastwood = acdeilnoosttw).

You can then build a list of sorted letter combinations that would match a portion of the first phrase's list (i.e. dlo, einos, estw, adel, no, doow, acinot, etc.) .

You would then pair these matches up to get a winning combination by matching length first, then letters. Using our sample phrase as a reference point, this means you would look for patterns that add up to 13 (i.e. 3 + 4 + 6), trying each one in sequence (keeping in mind the length of the patterns already found as a limit on the possibilities).

Next you would match letters for the patterns of a given length (dlo estw acinot).

You would then look up the words for these patterns (old west action) or (cation stew lod) or even (wild sea cotton). Word order would then be chosen manually to apply meaning to a collection that was chosen.

Whew. I think it is almost as easy to do this manually. I'm not up to speed on using databases yet so I haven't been able to test this method out yet.

netmaestro · Post by **netmaestro** » Sat Apr 11, 2009 9:26 pm

Talk about cooking with gas. I looked at the code I had utilizing a SQLite database for the indexes and I thought something else might be faster. So I replaced it with a binary search procedure and it cut the search time by a factor of ten. An average search now takes from 0 - 0.6 milliseconds! I had to resort to QueryPerformanceCounter_() to measure it. No matter how long the search word or how many anagrams are found, I can't make a search take longer than 0.6 milliseconds. Updated code is in the download.

Now, truly, on to phrases. I'm never going to beat these times.

Demivec · Post by **Demivec** » Mon Apr 13, 2009 1:27 pm

netmaestro wrote:Talk about cooking with gas. I looked at the code I had utilizing a SQLite database for the indexes and I thought something else might be faster. So I replaced it with a binary search procedure and it cut the search time by a factor of ten. An average search now takes from 0 - 0.6 milliseconds! I had to resort to QueryPerformanceCounter_() to measure it. No matter how long the search word or how many anagrams are found, I can't make a search take longer than 0.6 milliseconds. Updated code is in the download.

Now, truly, on to phrases. I'm never going to beat these times.

@netmaestro: I have another surprise for you, I've cut your times in half!

I'll post the code here, instead of a link, for easier viewing of the methods. A casual note on the coding, I've added more structure to the code as it's length has increased. To make this version I took the procedures Search() and TicksHQ() from your code. I also rewrote my sortLetters() routine and build the index array anew each time the program runs. This way you only have to supply a dictionary file and you are good to go. It also means you can add to your dictionary file and you'll automatically get a working index.

That brings me to another point, I discovered your index file has 2 inaccuracies in it. To discover one of them, find the anagrams for sawblade using your program. The other is that you are missing indexes for the words AHA and AAH so you won't be able to find them in your program either even though they are in your dictionary file. Discovering this was a real aha moment.

To compare the speed results with yours I had to increase the reported time resolution from tenths to hundreths of a second. Yours reported 0.17 milliseconds for a search on "easter" and mine reported a time of 0.07 milliseconds. So to paraphrase you, now you are cooking with more gas.

I think the speed improvements that are involved mainly have to do with eliminating more of the string functions, such as Left() and Mid(). I did this by using a structured index array instead of the combined method you had used, the storage used by the array is only 44% of it's former size (for both memory and disk space).

Here's the code:

Code: Select all

;-constants
#DictionaryFile = "sowpods2.txt" 

Enumeration
  ;windows
  #mainWindow
  ;gadgets
  #Word_str = 0
  #List_btn
  #AnagramList_lvw
EndEnumeration

;-structures
Structure pattern
  letters.s
  index.i 
EndStructure

;-global variables
Global Dim words.s(0) ;resized later in init procedure
Global Dim index.pattern(0) ;resized later in init procedure

;-procedures
Procedure handleError(p_result.l, p_text.s) ;display error message if result= 0 and then END program
  ;sample use HandleError(InitSprite(), "Could not initialize sprites.")
  ;Procedure borrowed from Kale in PureBasic forum
  If p_result = 0
    MessageRequester("Error", p_text, #PB_MessageRequester_Ok)
    End
  EndIf
EndProcedure

Procedure.i TicksHQ()
  ;Procedure borrowed from netmaestro in PureBasic forum
  Static maxfreq.q 
  Protected T.q 
  If maxfreq=0 
    QueryPerformanceFrequency_(@maxfreq) 
    maxfreq=maxfreq/100000
  EndIf 
  QueryPerformanceCounter_(@T) 
  ProcedureReturn T/maxfreq ;Result is in hundreths-of-a-millisecond
EndProcedure 

Procedure.s sortLetters(*word.Character, length)
  Protected Dim letters.c(length)
  
  CopyMemory(*word, @letters(), length * SizeOf(Character))
  SortArray(letters(), #PB_Sort_Ascending, 0, length - 1)
  ProcedureReturn PeekS(@letters(), length)
EndProcedure

Procedure initDictionaryPatterns() ;returns #True if successful
  Protected i, j, isFirst, numWords, fileNumber
  Protected NewList patterns.pattern()
  
  numWords = 200000
  Redim words.s(numWords)
  
  fileNumber = ReadFile(#PB_Any, #DictionaryFile) ; Read all legal words into array words()
  If Not fileNumber
    ProcedureReturn #False ;failed to initialize dictionary and patterns
  EndIf 
  
  i = 1
  While Not Eof(fileNumber)
    words(i) = UCase(Trim(ReadString(fileNumber)))
    i + 1
    If i > numWords
      numWords + 20000
      Redim words.s(numWords)
    EndIf 
  Wend
  numWords = i - 1
  CloseFile(fileNumber)
  Redim words.s(numWords)
  
  For i = 1 To numWords
    AddElement(patterns())
    patterns()\letters = sortLetters(@words(i), Len(words(i)))
    patterns()\index = i
  Next
   
  SortStructuredList(patterns(), #PB_Sort_Ascending, OffsetOf(pattern\letters), #PB_Sort_String)
  
  ;build array of pattern index for only duplicate patterns (no singles)
  Redim index.pattern(numWords)
  i = 0
  isFirst = #True
  ForEach patterns()
    If isFirst 
      index(i)\letters = patterns()\letters
      index(i)\index = patterns()\index
      j = i
      isFirst = #False
    ElseIf patterns()\letters = index(i)\letters
      j + 1
      index(j)\letters = patterns()\letters
      index(j)\index = patterns()\index
    ElseIf j = i         ;test for a single
      isFirst = #True
      PreviousElement(patterns())
    Else
      i = j + 1
      isFirst = #True
      PreviousElement(patterns())
    EndIf 
  Next 
  
  If j = i
    i = i - 1
  EndIf
  Redim index.pattern(i)
  
  ProcedureReturn #True ;success completion
EndProcedure

Procedure Search(Array patterns.pattern(1), key$)
  Protected lo = 1
  Protected hi = ArraySize(patterns())
  Protected mid
  
  While lo <= hi
    mid = lo + (hi-lo) / 2
    If key$ = patterns(mid)\letters ; match
      While patterns(mid - 1)\letters = key$
        mid - 1
      Wend
      ProcedureReturn mid
    ElseIf key$ < patterns(mid)\letters
      hi = mid - 1 ; search low end of array
    Else
      lo = mid + 1 ; search high end of array
    EndIf
  Wend
  
  ProcedureReturn 0
  
EndProcedure

Procedure findAnagram(word.s, List wordList.s())
  Protected matchLength = Len(word)
  If matchLength = 0
    ProcedureReturn 0
  EndIf 
  
  word = UCase(word)
  Protected matchLetters.s = sortLetters(@word, matchLength)
   
  Protected patternIndex, wordIndex
  ClearList(wordList())
  patternIndex = Search(index(), matchLetters)
  If patternIndex
    While index(patternIndex)\letters = matchLetters
      wordIndex = index(patternIndex)\index
      If words(wordIndex) <> word
        AddElement(wordList())
        wordList() = words(wordIndex)
      EndIf
      patternIndex + 1
    Wend
  EndIf

  ProcedureReturn ListSize(wordList())

EndProcedure

;-main
handleError(initDictionaryPatterns(), "Unable To initialize word List")

handleError(OpenWindow(#mainWindow, 0, 0, 290, 500, "Anagram Finder", #PB_Window_ScreenCentered | #PB_Window_SystemMenu), "Unable to open Window")
StringGadget(#Word_str, 10, 10, 160, 20, "", #PB_String_LowerCase)
ButtonGadget(#List_btn, 180, 10, 100, 20, "List")
ListViewGadget(#AnagramList_lvw, 10, 40, 270, 455)

Define event, beginTime, totalTime
NewList wordList.s()

Repeat
  event = WaitWindowEvent(10)
  Select event
    Case #PB_Event_Gadget
      Select EventGadget()
        Case #Word_str
          If EventType() = #PB_EventType_Focus 
            SendMessage_(GadgetID(#Word_str), #EM_SETSEL, 0, -1)
          EndIf
        Case #List_btn
          ClearGadgetItems(#AnagramList_lvw)
          Define StartTime = TicksHQ()
          If findAnagram(GetGadgetText(#Word_str), wordList())
            ForEach wordList()
              AddGadgetItem(#AnagramList_lvw, -1, wordList())
            Next 
            Define totalTime = TicksHQ() - StartTime
          Else
            AddGadgetItem(#AnagramList_lvw, -1, "< NONE FOUND >")
          EndIf 
          AddGadgetItem(#AnagramList_lvw, -1, "")
          AddGadgetItem(#AnagramList_lvw, -1, "Search time: " + Str(totalTime) + " hundreths of a millisecond")
      EndSelect
  EndSelect
Until event = #PB_Event_CloseWindow

I'm going to see what kind of results come back for phrases (see my previous post). It might be ready in a day or so. Refining it will be the more difficult part.