Hello,
I need to analyze articles and find the word density. That is, which words are repeated and how many times in that article.
I don't know what words to look for, but I want the analysis to find which words are repeated and how many times.
I remember that I once saw examples that worked very quickly in purebasic in this forum and that they even analyzed the Bible for example and managed to do this very quickly.
But I don't know if anyone has that type of code that works in version 5.70 or later.
I need to analyze articles on one page only.
Thank you
Analyze a piece of writing and find word density
Analyze a piece of writing and find word density
ARGENTINA WORLD CHAMPION
Re: Analyze a piece of writing and find word density
Fun task.
1 way:
Create a mapWords()
Read text into array delimited by space.
Put each word from array into mapWords().
Then cycle through mapWords() counting each entry found in the array.
1 way:
Create a mapWords()
Read text into array delimited by space.
Put each word from array into mapWords().
Then cycle through mapWords() counting each entry found in the array.
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
Re: Analyze a piece of writing and find word density
A rough draft of what you would need to do:
clean input text
split text into a map using each word as a key for the map
Norm.
Code: Select all
NewMap words()
Text$="This is a list of words. some of the words repeat, others do not repeat careful with punctuation."+
"you probably would be better off using regex to delete anything that is not a word but, since "+
"this is just a proof of concept we should be ok with only getting rid of period and comma"
; delete .
ReplaceString(Text$,"."," ",#PB_String_InPlace)
;delete ,
ReplaceString(Text$,","," ",#PB_String_InPlace)
; delete extra spaces you should probably use regex to clean input text
Text$=ReplaceString(Text$," "," ")
Trim(Text$)
; after input has been "cleaned"
Debug text$
i=CountString(Text$," ")
For k = 1 To i
wrd$= LCase(StringField(Text$, k, " "))
words(wrd$)=words(wrd$)+1
Next
; display a count for each word
ForEach words()
Debug MapKey(words()) +" = "+words()
Next
split text into a map using each word as a key for the map
Norm.
google Translate;Makes my jokes fall flat- Fait mes blagues tombent à plat- Machte meine Witze verpuffen- Eh cumpari ci vo sunari
Re: Analyze a piece of writing and find word density
Don't forget that Trim(), LTrim(), and RTrim() need a return value; they're not in-place commands:
Also, to remove duplicate spaces, you need to While/Wend the duplicates:
Code: Select all
Text$=" word "
Trim(Text$)
Debug ">"+Text$+"<" ; > word <
Text$=" word "
Text$=Trim(Text$)
Debug ">"+Text$+"<" ; >word<
Code: Select all
Text$=" word "
Text$=ReplaceString(Text$," "," ")
Debug ">"+Text$+"<" ; > word <
Text$=" word "
While FindString(Text$," ")
Text$=ReplaceString(Text$," "," ")
Wend
Debug ">"+Text$+"<" ; > word <
Re: Analyze a piece of writing and find word density
Hi,
The RegEx version with a few exceptions (to be completed) to exclude punctuation characters and separate pasted words
Here's the version that reads a text file.
With additional separators

The RegEx version with a few exceptions (to be completed) to exclude punctuation characters and separate pasted words
Code: Select all
EnableExplicit
NewMap words()
Define Text$ = "This is a list of words. some of the words repeat, others do not repeat careful with punctuation." +
"you probably would be better off using regex ! ? to delete anything that is not a word but, since " +
"this is just a proof of concept we should be ok with only getting rid of period and comma"
If Not CreateRegularExpression(0, "[^ ,;.!?]+") : Debug "Bad Regex" : End : EndIf
If Not ExamineRegularExpression(0, Text$) : Debug "No match" : End : EndIf
Define i
Define wrd$
While NextRegularExpressionMatch(0)
i + 1
wrd$ = RegularExpressionMatchString(0)
words(wrd$) = words(wrd$) + 1
Wend
Debug "Words: " + i + #CRLF$
FreeRegularExpression(0)
ForEach words()
Debug RSet(MapKey(words()), 20, " ") + " : "+ words()
Next
End
With additional separators
Code: Select all
EnableExplicit
NewMap words()
If ReadFile(0, "C:\Program Files\PureBasic\Compilers\APIFunctionListing.txt")
Define Text$ = ReadString(0, #PB_File_IgnoreEOL)
CloseFile(0)
Else
Debug "Error" : End
EndIf
If Not CreateRegularExpression(0, "[^ ,;.!?()*:\t\r\n]+") : Debug "Bad Regex" : End : EndIf
If Not ExamineRegularExpression(0, Text$) : Debug "No match" : End : EndIf
Define i
Define wrd$
Define Start = ElapsedMilliseconds()
While NextRegularExpressionMatch(0)
i + 1
wrd$ = RegularExpressionMatchString(0)
words(wrd$) = words(wrd$) + 1
Wend
Define Time = ElapsedMilliseconds() - Start
FreeRegularExpression(0)
ForEach words()
Debug RSet(MapKey(words()), 50, " ") + " : "+ words()
Next
Debug "Words : " + i + #CRLF$
Debug "Search time : " + Time + " ms"
End

Re: Analyze a piece of writing and find word density
@ marc56us
Great example
Great example