Analyze a piece of writing and find word density

Just starting out? Need help? Post your questions and find answers here.
ricardo
Addict
Addict
Posts: 2438
Joined: Fri Apr 25, 2003 7:06 pm
Location: Argentina

Analyze a piece of writing and find word density

Post by ricardo »

Hello,

I need to analyze articles and find the word density. That is, which words are repeated and how many times in that article.
I don't know what words to look for, but I want the analysis to find which words are repeated and how many times.

I remember that I once saw examples that worked very quickly in purebasic in this forum and that they even analyzed the Bible for example and managed to do this very quickly.

But I don't know if anyone has that type of code that works in version 5.70 or later.

I need to analyze articles on one page only.

Thank you
ARGENTINA WORLD CHAMPION
User avatar
skywalk
Addict
Addict
Posts: 4224
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: Analyze a piece of writing and find word density

Post by skywalk »

Fun task.
1 way:
Create a mapWords()
Read text into array delimited by space.
Put each word from array into mapWords().
Then cycle through mapWords() counting each entry found in the array.
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
normeus
Enthusiast
Enthusiast
Posts: 472
Joined: Fri Apr 20, 2012 8:09 pm
Contact:

Re: Analyze a piece of writing and find word density

Post by normeus »

A rough draft of what you would need to do:

Code: Select all

NewMap words()
Text$="This is a list of words. some of the words repeat, others do not repeat careful with punctuation."+
      "you probably would be better off using regex to delete anything that is not a word but, since "+
      "this is just a proof of concept we should be ok with only getting rid of period and comma"
; delete .
ReplaceString(Text$,"."," ",#PB_String_InPlace)
;delete ,
ReplaceString(Text$,","," ",#PB_String_InPlace)
; delete extra spaces you should probably use regex to clean input text
Text$=ReplaceString(Text$,"  "," ")
Trim(Text$)
; after input has been "cleaned"
Debug text$
i=CountString(Text$," ")
 For k = 1 To i
   wrd$= LCase(StringField(Text$, k, " "))
   words(wrd$)=words(wrd$)+1
 Next
 
 ; display a count for each word
 ForEach words()
   Debug MapKey(words()) +" = "+words()
 Next
 
 
clean input text
split text into a map using each word as a key for the map

Norm.
google Translate;Makes my jokes fall flat- Fait mes blagues tombent à plat- Machte meine Witze verpuffen- Eh cumpari ci vo sunari
BarryG
Addict
Addict
Posts: 4186
Joined: Thu Apr 18, 2019 8:17 am

Re: Analyze a piece of writing and find word density

Post by BarryG »

Don't forget that Trim(), LTrim(), and RTrim() need a return value; they're not in-place commands:

Code: Select all

Text$=" word "
Trim(Text$)
Debug ">"+Text$+"<" ; > word <

Text$=" word "
Text$=Trim(Text$)
Debug ">"+Text$+"<" ; >word<
Also, to remove duplicate spaces, you need to While/Wend the duplicates:

Code: Select all

Text$="          word    "
Text$=ReplaceString(Text$,"  "," ")
Debug ">"+Text$+"<" ; >     word  <

Text$="          word    "
While FindString(Text$,"  ")
  Text$=ReplaceString(Text$,"  "," ")
Wend
Debug ">"+Text$+"<" ; > word <
Marc56us
Addict
Addict
Posts: 1600
Joined: Sat Feb 08, 2014 3:26 pm

Re: Analyze a piece of writing and find word density

Post by Marc56us »

Hi,

The RegEx version with a few exceptions (to be completed) to exclude punctuation characters and separate pasted words

Code: Select all

EnableExplicit

NewMap words()

Define Text$ = "This is a list of words. some of the words repeat, others do not repeat careful with punctuation."  +
               "you probably would be better off using regex ! ? to delete anything that is not a word but, since " +
               "this is just a proof of concept we should be ok with only getting rid of period and comma"

If Not CreateRegularExpression(0, "[^ ,;.!?]+") : Debug "Bad Regex" : End : EndIf
If Not ExamineRegularExpression(0, Text$)       : Debug "No match"  : End : EndIf

Define i
Define wrd$
While NextRegularExpressionMatch(0)
    i + 1
    wrd$ = RegularExpressionMatchString(0)
    words(wrd$) = words(wrd$) + 1
Wend

Debug "Words: " + i + #CRLF$

FreeRegularExpression(0)

ForEach words()
    Debug RSet(MapKey(words()), 20, " ") + " : "+ words()
Next

End
Here's the version that reads a text file.
With additional separators

Code: Select all

EnableExplicit

NewMap words()

If ReadFile(0, "C:\Program Files\PureBasic\Compilers\APIFunctionListing.txt")
    Define Text$ = ReadString(0, #PB_File_IgnoreEOL)
    CloseFile(0)
Else
    Debug "Error" : End
EndIf

If Not CreateRegularExpression(0, "[^ ,;.!?()*:\t\r\n]+")   : Debug "Bad Regex" : End : EndIf
If Not ExamineRegularExpression(0, Text$)                   : Debug "No match"  : End : EndIf

Define i
Define wrd$

Define Start = ElapsedMilliseconds()
While NextRegularExpressionMatch(0)
    i + 1
    wrd$ = RegularExpressionMatchString(0)
    words(wrd$) = words(wrd$) + 1
Wend
Define Time = ElapsedMilliseconds() - Start

FreeRegularExpression(0)

ForEach words()
    Debug RSet(MapKey(words()), 50, " ") + " : "+ words()
Next

Debug "Words       : " + i + #CRLF$
Debug "Search time : " + Time + " ms"

End
:wink:
User avatar
idle
Always Here
Always Here
Posts: 5929
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Analyze a piece of writing and find word density

Post by idle »

@ marc56us
Great example
Post Reply