Analyze a piece of writing and find word density

ricardo · Post by **ricardo** » Wed Sep 20, 2023 5:50 pm

Hello,

I need to analyze articles and find the word density. That is, which words are repeated and how many times in that article.
I don't know what words to look for, but I want the analysis to find which words are repeated and how many times.

I remember that I once saw examples that worked very quickly in purebasic in this forum and that they even analyzed the Bible for example and managed to do this very quickly.

But I don't know if anyone has that type of code that works in version 5.70 or later.

I need to analyze articles on one page only.

Thank you

skywalk · Post by **skywalk** » Wed Sep 20, 2023 6:19 pm

Fun task.
1 way:
Create a mapWords()
Read text into array delimited by space.
Put each word from array into mapWords().
Then cycle through mapWords() counting each entry found in the array.

normeus · Post by **normeus** » Wed Sep 20, 2023 10:56 pm

A rough draft of what you would need to do:

Code: Select all

NewMap words()
Text$="This is a list of words. some of the words repeat, others do not repeat careful with punctuation."+
      "you probably would be better off using regex to delete anything that is not a word but, since "+
      "this is just a proof of concept we should be ok with only getting rid of period and comma"
; delete .
ReplaceString(Text$,"."," ",#PB_String_InPlace)
;delete ,
ReplaceString(Text$,","," ",#PB_String_InPlace)
; delete extra spaces you should probably use regex to clean input text
Text$=ReplaceString(Text$,"  "," ")
Trim(Text$)
; after input has been "cleaned"
Debug text$
i=CountString(Text$," ")
 For k = 1 To i
   wrd$= LCase(StringField(Text$, k, " "))
   words(wrd$)=words(wrd$)+1
 Next
 
 ; display a count for each word
 ForEach words()
   Debug MapKey(words()) +" = "+words()
 Next

clean input text
split text into a map using each word as a key for the map

Norm.

BarryG · Post by **BarryG** » Thu Sep 21, 2023 8:41 am

Don't forget that Trim(), LTrim(), and RTrim() need a return value; they're not in-place commands:

Code: Select all

Text$=" word "
Trim(Text$)
Debug ">"+Text$+"<" ; > word <

Text$=" word "
Text$=Trim(Text$)
Debug ">"+Text$+"<" ; >word<

Also, to remove duplicate spaces, you need to While/Wend the duplicates:

Code: Select all

Text$="          word    "
Text$=ReplaceString(Text$,"  "," ")
Debug ">"+Text$+"<" ; >     word  <

Text$="          word    "
While FindString(Text$,"  ")
  Text$=ReplaceString(Text$,"  "," ")
Wend
Debug ">"+Text$+"<" ; > word <

Marc56us · Post by **Marc56us** » Thu Sep 21, 2023 9:49 am

Hi,

The RegEx version with a few exceptions (to be completed) to exclude punctuation characters and separate pasted words

Code: Select all

EnableExplicit

NewMap words()

Define Text$ = "This is a list of words. some of the words repeat, others do not repeat careful with punctuation."  +
               "you probably would be better off using regex ! ? to delete anything that is not a word but, since " +
               "this is just a proof of concept we should be ok with only getting rid of period and comma"

If Not CreateRegularExpression(0, "[^ ,;.!?]+") : Debug "Bad Regex" : End : EndIf
If Not ExamineRegularExpression(0, Text$)       : Debug "No match"  : End : EndIf

Define i
Define wrd$
While NextRegularExpressionMatch(0)
    i + 1
    wrd$ = RegularExpressionMatchString(0)
    words(wrd$) = words(wrd$) + 1
Wend

Debug "Words: " + i + #CRLF$

FreeRegularExpression(0)

ForEach words()
    Debug RSet(MapKey(words()), 20, " ") + " : "+ words()
Next

End

Here's the version that reads a text file.
With additional separators

Code: Select all

EnableExplicit

NewMap words()

If ReadFile(0, "C:\Program Files\PureBasic\Compilers\APIFunctionListing.txt")
    Define Text$ = ReadString(0, #PB_File_IgnoreEOL)
    CloseFile(0)
Else
    Debug "Error" : End
EndIf

If Not CreateRegularExpression(0, "[^ ,;.!?()*:\t\r\n]+")   : Debug "Bad Regex" : End : EndIf
If Not ExamineRegularExpression(0, Text$)                   : Debug "No match"  : End : EndIf

Define i
Define wrd$

Define Start = ElapsedMilliseconds()
While NextRegularExpressionMatch(0)
    i + 1
    wrd$ = RegularExpressionMatchString(0)
    words(wrd$) = words(wrd$) + 1
Wend
Define Time = ElapsedMilliseconds() - Start

FreeRegularExpression(0)

ForEach words()
    Debug RSet(MapKey(words()), 50, " ") + " : "+ words()
Next

Debug "Words       : " + i + #CRLF$
Debug "Search time : " + Time + " ms"

End

Post by **idle** » Thu Sep 21, 2023 8:28 pm

@ marc56us
Great example

PureBasic Forums - English

Analyze a piece of writing and find word density

Analyze a piece of writing and find word density

Re: Analyze a piece of writing and find word density

Re: Analyze a piece of writing and find word density

Re: Analyze a piece of writing and find word density

Re: Analyze a piece of writing and find word density

Re: Analyze a piece of writing and find word density