Amazing Anagrams

Share your advanced PureBasic knowledge/code with the community.
michaeled314
Enthusiast
Enthusiast
Posts: 340
Joined: Tue Apr 24, 2007 11:14 pm

Amazing Anagrams

Post by michaeled314 »

run this:

Code: Select all

If OpenFile(0,"dictionary.txt")
 While Not Eof(0)
  Current = Loc(0)
  Word$ = ReadString(0)
  While Not Eof(0)
   a=1
   Anagram$ = ReadString(0)
   If Len(Word$) = Len(Anagram$)
   For i = 1 To Len(Word$)
    If CountString(Word$,Mid(Word$,i,1)) = CountString(Anagram$,Mid(Word$,i,1))
    Else
     a = 0
    EndIf
   Next
   If a = 1
    Debug Word$+":"+Anagram$
   EndIf
   EndIf
  Wend
  a=1
  FileSeek(0,Current)
  junk$ = ReadString(0)
 Wend
EndIf
with http://individual.utoronto.ca/rishiyur/dictionary.txt as dictionary.txt in the same directory
michaeled314
Enthusiast
Enthusiast
Posts: 340
Joined: Tue Apr 24, 2007 11:14 pm

Post by michaeled314 »

User avatar
Demivec
Addict
Addict
Posts: 4270
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Post by Demivec »

Try this on for size:

Code: Select all

Structure DictWord
  word.s
  length.i
  letters.s
EndStructure

Structure anagram
  words.s
  count.i
EndStructure

NewList wordList.DictWord()
NewList letters.c()
NewList anagrams.anagram()


If OpenFile(0,"wlist1.txt")
  While Not Eof(0)
    AddElement(wordList())
    wordList()\word = ReadString(0)
    wordList()\length = Len(wordList()\word)
    
    *word.Character = @wordList()\word
    For i = 1 To wordList()\length
      AddElement(letters())
      letters() = *word\c
      *word + sizeof(Character)
    Next 
    
    SortList(letters(),#PB_Sort_Ascending)
    ForEach letters()
      wordList()\letters + Chr(letters())
    Next
    ClearList(letters())
  Wend
  CloseFile(0)
  
  SortStructuredList(wordList(),#PB_Sort_Ascending,OffsetOf(DictWord\word),#PB_Sort_String)
  SortStructuredList(wordList(),#PB_Sort_Ascending,OffsetOf(DictWord\letters),#PB_Sort_String)
  SortStructuredList(wordList(),#PB_Sort_Ascending,OffsetOf(DictWord\length),#PB_Sort_Integer)
  
  If CountList(wordList()) <> 0
    FirstElement(wordList())
    *first.DictWord = @wordList()
    AddElement(anagrams())
    anagrams()\words = wordList()\word
    count = 0
    While NextElement(wordList())
      If wordList()\letters <> *first\letters
        If count <> 0
          anagrams()\count = count
          AddElement(anagrams())
        EndIf 
        *first = @wordList()
        anagrams()\words = wordList()\word
        count = 0
      Else
        anagrams()\words + ", " + wordList()\word
        count + 1
      EndIf 
    Wend
    If count <> 0
      anagrams()\count = count
    Else
      DeleteElement(anagrams())
    EndIf 
    
    SortStructuredList(anagrams(),#PB_Sort_Ascending,OffsetOf(anagram\words),#PB_Sort_String)
    ForEach anagrams()
      Debug anagrams()\words
    Next 
  EndIf 
EndIf
This improves the speed a few million times. :wink:

@Edit: small change to make it unicode compatible, hopefully.
Last edited by Demivec on Thu Apr 09, 2009 3:29 pm, edited 1 time in total.
User avatar
netmaestro
PureBasic Bullfrog
PureBasic Bullfrog
Posts: 8451
Joined: Wed Jul 06, 2005 5:42 am
Location: Fort Nelson, BC, Canada

Post by netmaestro »

I gave this a try, you can download it here:

http://www.lloydsplace.com/anagrams.zip

For anything up to 8 letters, it returns right away, 9 is a couple of seconds and 10 is several seconds (here)

I'll post the code soon.
BERESHEIT
Ollivier
Enthusiast
Enthusiast
Posts: 281
Joined: Mon Jul 23, 2007 8:30 pm
Location: FR

Post by Ollivier »

I am curious to see your code. I think several seconds, it's... a lot! But, sometimes I have a wrong translation...

Brossden gave me an idea but it was with short words:

If in a first way, we sort all the characters of each word, like this example:
banana becomes aaabnn (because "a"<"b" and "b"<"n", etc...) and if we use aaabnn to get a fingerprint, I think it's possible to use a hash map.

Now, to research the anagram of a word, for example... "banana"! We sort it to get aaabnn, we calculate its fingerprint and find the address of all its anagrams.

I don't wrote such a code about this subject. But if it's necessary, I try to do it.

Ollivier
User avatar
dobro
Enthusiast
Enthusiast
Posts: 766
Joined: Sun Oct 31, 2004 10:54 am
Location: France
Contact:

Post by dobro »

the french spaggetti code :
base for retrieve anagram :)

Code: Select all


; Code Dobro
; PureBasic 4.00
Resultat = OpenConsole()

Print("Entrez un mot et appuyez sur 'Return': ")
m$=Input()
n= Len(m$)
Dim mo$(n)
Dim p(n)
mo$(n)=m$
z=n
rt:
p(z)=1
dt:
mo$(z-1)=Right(mo$(z),z-1)
z=z-1
If z>1
    Goto rt
EndIf
m$=""
For w=1 To n
    m$=Left(mo$(w),1)+m$
Next w

Print(m$+" ")

gt:
mo$(z+1)=mo$(z)+Left(mo$(z+1),1)
z=z+1
p(z)=p(z)+1
If p(z)<=z
    Goto dt
EndIf
If z<n
    Goto gt
EndIf


Print("FINI !!!!")

k$=Input()

Resultat = CloseConsole() 
Image
Windows 98/7/10 - PB 5.42
■ sites : http://michel.dobro.free.fr/
Ollivier
Enthusiast
Enthusiast
Posts: 281
Joined: Mon Jul 23, 2007 8:30 pm
Location: FR

Post by Ollivier »

:shock:

@Dobro

I'm a nuts in translation but there, I'm not the worst! There's above a link into a big text file. It's a word list. The goal, I think is to find all the anagrams of a specified word as quick as possible in this word list.

(Je suis une burne en traduction mais là, je ne suis pas le plus nul! Il y a plus haut un lien vers un gros fichier texte. C'est une liste de mots. Le but, je pense est de trouver tous les anagrammes d'un mot spécifié aussi vite que possible dans cette liste de mots)

Ollivier
User avatar
dobro
Enthusiast
Enthusiast
Posts: 766
Joined: Sun Oct 31, 2004 10:54 am
Location: France
Contact:

Post by dobro »

Ollivier wrote::shock:

@Dobro
(Je suis une burne en traduction mais là, je ne suis pas le plus nul! Il y a plus haut un lien vers un gros fichier texte. C'est une liste de mots. Le but, je pense est de trouver tous les anagrammes d'un mot spécifié aussi vite que possible dans cette liste de mots)

Ollivier
oui mais je donne un petit prg qui permet la recherche de TOUT les anagrams d'un mots, puisqu'il sort toute les possibilités possibles...

il suffit ensuite de comparer chaque mot sorti par mon prg avec la liste d'un dico , pour avoir TOUT les anagrammes possible , et pas seulement ceux d'une liste :)
(bien que le dico de reference puisse etre ausi une liste)

je suis conscient que cela ne correspond pas tout a fait au chalenge, mais c'est une voie :)

je te laisse le soin de traduire ça en chinois, ce que l'anglais reste pour moi :lol:
Image
Windows 98/7/10 - PB 5.42
■ sites : http://michel.dobro.free.fr/
Ollivier
Enthusiast
Enthusiast
Posts: 281
Joined: Mon Jul 23, 2007 8:30 pm
Location: FR

Post by Ollivier »

Dobro wrote:yes but I give a little prog which allows you to research ALL the anagrams of a word, because this code give all the possible possibilities...

Then, you should just compare each given word with the dictionary list, in order to have ALL the possible anagrams, and not only the anagrams in a list (however the reference dictionary could be a list)

I am aware it doesn't match exactly to the challenge, but it's a way :)
User avatar
Demivec
Addict
Addict
Posts: 4270
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Post by Demivec »

Ollivier wrote:I am curious to see your code. I think several seconds, it's... a lot! But, sometimes I have a wrong translation...

Brossden gave me an idea but it was with short words:

If in a first way, we sort all the characters of each word, like this example:
banana becomes aaabnn (because "a"<"b" and "b"<"n", etc...) and if we use aaabnn to get a fingerprint, I think it's possible to use a hash map.

Now, to research the anagram of a word, for example... "banana"! We sort it to get aaabnn, we calculate its fingerprint and find the address of all its anagrams.
@Ollivier: the method I posted sorts the dictionary list in the way you mentioned, and it's very fast, it however doesn't use a hash for searching. The purpose of my code was mainly to demonstrate the sorting method that uses the characters of each word.
Ollivier
Enthusiast
Enthusiast
Posts: 281
Joined: Mon Jul 23, 2007 8:30 pm
Location: FR

Post by Ollivier »

Demivec wrote:@Ollivier: the method I posted sorts the dictionary list in the way you mentioned, and it's very fast, it however doesn't use a hash for searching. The purpose of my code was mainly to demonstrate the sorting method that uses the characters of each word.
@Demivec

Ups! I didn't see your code sorted the letters of each word in the wordlist... Knowing this, I suppose the problem is the duration to prepare the arrays.

@Netmaestro

Excuse me. Now with all the explaination, I understand why you said many seconds are necessary to get the good memory environnement for anagrams.

But I keep the "challenge" however I am going to loose it! For others reasons, I am again curious to see the code of you program!

Ollivier
User avatar
Demivec
Addict
Addict
Posts: 4270
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Post by Demivec »

Ollivier wrote:
Demivec wrote:@Ollivier: the method I posted sorts the dictionary list in the way you mentioned, and it's very fast, it however doesn't use a hash for searching. The purpose of my code was mainly to demonstrate the sorting method that uses the characters of each word.
@Demivec

Ups! I didn't see your code sorted the letters of each word in the wordlist... Knowing this, I suppose the problem is the duration to prepare the arrays.
In the code I posted above, I timed it from the beginning until it was ready to display the results (after the last sort) and it only took 700 milliseconds. That seems to be fast enough. It is worth noting that it is finding every anagram in the entire dictionary list and preparing them to be displayed in groups. It does this by keeping the entire list in memory.

It would take even less time if it was simply an inquiry similar to netmaestro's example. The response is instantaneous (less than 0.5 seconds). The response time is affected by the dictionary size. This version reads the dictionary file to answer each inquiry and keeps only the matching anagrams in memory.

Here's a modified version of the former code, now it resembles netmaestro's example.

Code: Select all

#DictionaryFile = "wlist1.txt" 

Procedure.s sortLetters(*word.Character,length)
  Protected result.s
  Protected NewList letters.c()
  
  For i = 1 To length
    AddElement(letters())
    letters() = *word\C
    *word + SizeOf(Character)
  Next 
  
  SortList(letters(),#PB_Sort_Ascending)
  ForEach letters()
    result + chr(letters())
  Next
  
  ProcedureReturn result
EndProcedure

Procedure findAnagram(word.s,List wordList.s())
  Protected matchLength = Len(word)
  If matchLength = 0
    ProcedureReturn 0
  EndIf 
  
  Protected matchletters.s = sortLetters(@word,matchLength)
   
  Protected nextWord.s
  ClearList(wordList())
  If OpenFile(0,#DictionaryFile)
    While Not Eof(0)
      nextWord = LCase(ReadString(0))
      If matchLength = Len(nextWord)
        If matchletters = sortLetters(@nextWord,matchLength)
          If nextWord <> word
            AddElement(wordList())
            wordList() = nextWord
          EndIf 
        EndIf
      EndIf 
    Wend
    CloseFile(0)
    
    ProcedureReturn CountList(wordList())
  EndIf

EndProcedure



If OpenWindow(0,0,0,290,500,"Anagram Finder",#PB_Window_ScreenCentered | #PB_Window_SystemMenu)
  StringGadget(0,12,10,160,20,"",#PB_String_LowerCase)
  ButtonGadget(1,180,10,100,20,"List")
  ListViewGadget(2,12,40,270,455)

  Define event,gadget
  NewList wordList.s()
  
  Repeat
    event = WaitWindowEvent(10)
    gadget = EventGadget()
    If event = #PB_Event_Gadget And gadget = 1
      ClearGadgetItems(2)
      If findAnagram(GetGadgetText(0),wordList())
        ForEach wordList()
          AddGadgetItem(2,-1,UCase(wordList()))
        Next 
      Else
        AddGadgetItem(2,-1,"< NONE FOUND >")
      EndIf 
    EndIf 
  Until event = #PB_Event_CloseWindow
EndIf 
User avatar
netmaestro
PureBasic Bullfrog
PureBasic Bullfrog
Posts: 8451
Joined: Wed Jul 06, 2005 5:42 am
Location: Fort Nelson, BC, Canada

Post by netmaestro »

Demivec shows the right way here.

Building on his approach, I found a way to find all anagrams of a given word, regardless of length (up to 15 letters) in around 12-15 milliseconds. You have to use a bit more diskspace, but nothing too major. Here is the method:

1) Start with a list of legal words, I use sowpods.txt with 216555 entries up to 15 letters long.

2) Create a copy of the list with each word accompanied by its position in the file.
i.e. " DANGER003244"

3) Create a copy of the second list with the entries all in the same order but sort the letters of the words.
i.e. " ADEGNR003244" (the numbers are left alone)

4) Sort the third list by words only. This way all anagrams of a given word will be grouped together
and each duplicate will have its own pointer to the word it represents. Here is an example of the
word "easter" in this file:

AEERST180615
AEERST167428
AEERST157997
AEERST163296
AEERST152860
AEERST009349
AEERST056677
AEERST056710
AEERST189775

5) Any entry in this index file which doesn't have at least one duplicate represents a word with no anagrams. It doesn't need to be there. So run through the file and remove any single entries.

6) Create a SQLITE3 database from this fourth list to use as an index into the original word list. Now you're ready to cook with gas.

Method for retrieving anagrams:

1) Read sowpods2.txt into an array of 216555 strings

2) Open the index database you created

3) when a word is supplied to check for anagrams, sort its letters and search as follows:

Code: Select all

          s$ = Prepare_Input_String_For_Search( GetGadgetText(0) ) ; sort letters and convert to UCase
          sql$="SELECT * FROM Table1 WHERE wordid = '" + s$ + "'"
          DatabaseQuery(0, sql$)
            While NextDatabaseRow(0)
              AddGadgetItem(2,-1,words(GetDatabaseLong(0,1)))
            Wend
          FinishDatabaseQuery(0)
The diskspace footprint of this approach is around 3mb,2.3 for sowpods2.txt and 0.8 for the index database. A bit more than the diskspace of the original word list, but the payoff is that you can find all anagrams for any length word up to 15 letters in 12 ms or so. That's basically less than one timeslice on most Windows OS's. Put TENTATIVENESS in and you get ATTENTIVENESS back immediately.

Here is a zip file containing sowpods.txt, wordindex.db and anagrams.pb:

http://www.lloydsplace.com/anagrams2.zip

Now that single-word anagrams are attainable quickly, the next challenge is to come up with a method that will generate multi-word anagrams, such as TOM CRUISE = SO I'M CUTER or GEORGE BUSH = HE BUGS GORE etc. This is where it gets a lot more interesting with the potential for plenty of amazing results.

So - Any ideas on how to go forward from here?
Last edited by netmaestro on Sat Apr 11, 2009 3:49 pm, edited 3 times in total.
BERESHEIT
michaeled314
Enthusiast
Enthusiast
Posts: 340
Joined: Tue Apr 24, 2007 11:14 pm

Post by michaeled314 »

Clint Eastwood = Old West Action

You can test the program with that
Ollivier
Enthusiast
Enthusiast
Posts: 281
Joined: Mon Jul 23, 2007 8:30 pm
Location: FR

Post by Ollivier »

NetMaestro wrote:Now that single-word anagrams are attainable quickly, the next challenge is to come up with a method that will generate multi-word anagrams, such as TOM CRUISE = SO I'M CUTER or GEORGE BUSH = HE BUGS GORE etc. This is where it gets a lot more interesting with the potential for plenty of amazing results.

So - Any ideas on how to go forward from here?
_________________
Veni, vidi, vici.
The french looser will wait a little bit before posting a code! :D
(He is preparing beautiful farts to get a good cook...)
Post Reply