Page 1 of 1
Analyze a piece of writing and find word density
Posted: Wed Sep 20, 2023 5:50 pm
by ricardo
Hello,
I need to analyze articles and find the word density. That is, which words are repeated and how many times in that article.
I don't know what words to look for, but I want the analysis to find which words are repeated and how many times.
I remember that I once saw examples that worked very quickly in purebasic in this forum and that they even analyzed the Bible for example and managed to do this very quickly.
But I don't know if anyone has that type of code that works in version 5.70 or later.
I need to analyze articles on one page only.
Thank you
Re: Analyze a piece of writing and find word density
Posted: Wed Sep 20, 2023 6:19 pm
by skywalk
Fun task.
1 way:
Create a mapWords()
Read text into array delimited by space.
Put each word from array into mapWords().
Then cycle through mapWords() counting each entry found in the array.
Re: Analyze a piece of writing and find word density
Posted: Wed Sep 20, 2023 10:56 pm
by normeus
A rough draft of what you would need to do:
Code: Select all
NewMap words()
Text$="This is a list of words. some of the words repeat, others do not repeat careful with punctuation."+
"you probably would be better off using regex to delete anything that is not a word but, since "+
"this is just a proof of concept we should be ok with only getting rid of period and comma"
; delete .
ReplaceString(Text$,"."," ",#PB_String_InPlace)
;delete ,
ReplaceString(Text$,","," ",#PB_String_InPlace)
; delete extra spaces you should probably use regex to clean input text
Text$=ReplaceString(Text$," "," ")
Trim(Text$)
; after input has been "cleaned"
Debug text$
i=CountString(Text$," ")
For k = 1 To i
wrd$= LCase(StringField(Text$, k, " "))
words(wrd$)=words(wrd$)+1
Next
; display a count for each word
ForEach words()
Debug MapKey(words()) +" = "+words()
Next
clean input text
split text into a map using each word as a key for the map
Norm.
Re: Analyze a piece of writing and find word density
Posted: Thu Sep 21, 2023 8:41 am
by BarryG
Don't forget that Trim(), LTrim(), and RTrim() need a return value; they're not in-place commands:
Code: Select all
Text$=" word "
Trim(Text$)
Debug ">"+Text$+"<" ; > word <
Text$=" word "
Text$=Trim(Text$)
Debug ">"+Text$+"<" ; >word<
Also, to remove duplicate spaces, you need to While/Wend the duplicates:
Code: Select all
Text$=" word "
Text$=ReplaceString(Text$," "," ")
Debug ">"+Text$+"<" ; > word <
Text$=" word "
While FindString(Text$," ")
Text$=ReplaceString(Text$," "," ")
Wend
Debug ">"+Text$+"<" ; > word <
Re: Analyze a piece of writing and find word density
Posted: Thu Sep 21, 2023 9:49 am
by Marc56us
Hi,
The RegEx version with a few exceptions (to be completed) to exclude punctuation characters and separate pasted words
Code: Select all
EnableExplicit
NewMap words()
Define Text$ = "This is a list of words. some of the words repeat, others do not repeat careful with punctuation." +
"you probably would be better off using regex ! ? to delete anything that is not a word but, since " +
"this is just a proof of concept we should be ok with only getting rid of period and comma"
If Not CreateRegularExpression(0, "[^ ,;.!?]+") : Debug "Bad Regex" : End : EndIf
If Not ExamineRegularExpression(0, Text$) : Debug "No match" : End : EndIf
Define i
Define wrd$
While NextRegularExpressionMatch(0)
i + 1
wrd$ = RegularExpressionMatchString(0)
words(wrd$) = words(wrd$) + 1
Wend
Debug "Words: " + i + #CRLF$
FreeRegularExpression(0)
ForEach words()
Debug RSet(MapKey(words()), 20, " ") + " : "+ words()
Next
End
Here's the version that reads a text file.
With additional separators
Code: Select all
EnableExplicit
NewMap words()
If ReadFile(0, "C:\Program Files\PureBasic\Compilers\APIFunctionListing.txt")
Define Text$ = ReadString(0, #PB_File_IgnoreEOL)
CloseFile(0)
Else
Debug "Error" : End
EndIf
If Not CreateRegularExpression(0, "[^ ,;.!?()*:\t\r\n]+") : Debug "Bad Regex" : End : EndIf
If Not ExamineRegularExpression(0, Text$) : Debug "No match" : End : EndIf
Define i
Define wrd$
Define Start = ElapsedMilliseconds()
While NextRegularExpressionMatch(0)
i + 1
wrd$ = RegularExpressionMatchString(0)
words(wrd$) = words(wrd$) + 1
Wend
Define Time = ElapsedMilliseconds() - Start
FreeRegularExpression(0)
ForEach words()
Debug RSet(MapKey(words()), 50, " ") + " : "+ words()
Next
Debug "Words : " + i + #CRLF$
Debug "Search time : " + Time + " ms"
End

Re: Analyze a piece of writing and find word density
Posted: Thu Sep 21, 2023 8:28 pm
by idle
@ marc56us
Great example