Comparing text in RTF or TXT files

Just starting out? Need help? Post your questions and find answers here.
Blurryan
User
User
Posts: 32
Joined: Sat Oct 13, 2007 2:08 pm
Location: Kazakhstan

Comparing text in RTF or TXT files

Post by Blurryan »

I have two texts one is the original material (ORG) and the other is the proof read material (CORR). I wanted to compare the two texts and have an output of the following:
Changed words: Those words that have changed with the addition or deletion of a character.
Added words: Those words that have been added into the CORR and are not in the ORG
Removed words: Those that are in the ORG but are not in the CORR.

My algorithm is somewhat primitive as below:

Code: Select all

Global.i i, j, e, r, ad, m
Global.s a, b
Dim T1.s(10)
Dim T2.s(10)

Restore Org
Read.s a
For i = 1 To 10
  T1(i) = StringField(a, i, " ")
Next

Restore Corr
Read.s b
For j = 1 To 10
  T2(j) = StringField(b, j, " ")
Next

Debug a
Debug b

  i = 0
  j = 0
e = 0
r = 0
ad = 0


While quit = 0
  i = i + 1
  j = j + 1
    If T1(i) = T2(j)
      Debug T1(i) + " - " + T2(j) + "  MATCH"
      e = e + 0
    Else
      Debug T1(i) + " - " + T2(j) + "  DO NOT MATCH"
      m = j
      While m < 10
        m = m + 1
        If T1(i) = T2(m)
          Debug T1(i) + " - " + T2(m) + "  MATCH"
          ad = ad + 1
          e = e + 1
          j = m
          Break
        EndIf
        If m = 10
          r = r + 1
          e = e + 1
        EndIf
      Wend
    EndIf
    If i = 10 Or j = 10
      quit = 1
    EndIf
Wend

Debug i
Debug j
Debug k
Debug e
Debug r 
Debug ad


DataSection
  Org:
  Data.s "The quick fox jump overs the dog"
  Corr:
  Data.s "The quick brown fox jumps over the lazy dog"
EndDataSection
However this is not proper as they do not take into account the positional differences.

Can anyone help me with the algorithm?

Thanks
Blurryan
Amilcar Matos
User
User
Posts: 43
Joined: Thu Nov 27, 2014 3:10 pm
Location: San Juan, Puerto Rico

Re: Comparing text in RTF or TXT files

Post by Amilcar Matos »

Just playing with this code! :)

Code: Select all

;{- Program header
;==Code Header Comment==============================
;        Name/title: TextCompare.pb
;   Executable name: TextCompare.exe
;           Version: 1.00
;            Author: Blurryan initial version.
;      Collaborator: Amílcar Matos Pérez (San Juan, Puerto Rico)
; Release date/hour: 27/Jul/2015
;  Operating system: Windows 7
;  Compiler version: PureBasic 5.30 (x86)
;       Explanation: To compare two text lines and identify changes present.
; ==================================================
;.......10........20........30........40........50........60........70........80
;}
Declare ShowResultsProcedure()
Declare ReadOriginalText(Array OrigText.s(1))
Declare ReadCorrectedText(Array CorrText.s(1))
Declare CompareBothTexts()

Global.i WordCounter        , OrigTextWordCounter, CorrTextWordCounter , MatchWordCounter 
Global.i DeletedWordsCounter, AddedWordsCounter  , CorrTextWordPosition                    
Global.l True               , Quit               , False               , MaxNbOfWords     
Global.l MaxNbOfOriginalWords, MaxNbOfCorrectedWords
Global.s AllOriginalWords   , AllCorrectedWords                                             

MaxNbOfWords = 10
Global Dim OrigText.s    (MaxNbOfWords)  
Global Dim CorrText.s    (MaxNbOfWords)  
Global Dim MatchedWords.s(MaxNbOfWords)  
Global Dim AddedWords.s  (MaxNbOfWords)  
Global Dim DeletedWords.s(MaxNbOfWords)  

ReadOriginalText (OrigText.s())  
ReadCorrectedText(CorrText.s())  

Debug AllOriginalWords
Debug AllCorrectedWords
Debug " "

CompareBothTexts()

ShowResultsProcedure()

End


Procedure CompareBothTexts()
  OrigTextWordCounter = 0 
  CorrTextWordCounter = 0 
  MatchWordCounter    = 0 
  DeletedWordsCounter = 0 
  AddedWordsCounter   = 0 
  
  Quit  = 0 
  True  = 1 
  False = 0 
  
While Quit = False
  OrigTextWordCounter = OrigTextWordCounter + 1
  CorrTextWordCounter = CorrTextWordCounter + 1
  
  ;find word match
    If OrigText(OrigTextWordCounter) = CorrText(CorrTextWordCounter)
      Debug OrigText(OrigTextWordCounter) + " - " + CorrText(CorrTextWordCounter) + "  MATCH"
      MatchWordCounter = MatchWordCounter + 1
      MatchedWords(MatchWordCounter) = CorrText(CorrTextWordCounter)
    Else
      ;mismatch found.
      Debug ""
      Debug OrigText(OrigTextWordCounter) + " - " + CorrText(CorrTextWordCounter) + "  DO NOT MATCH"
      CorrTextWordPosition = CorrTextWordCounter
      While CorrTextWordPosition < MaxNbOfCorrectedWords
        ;forward looking for the original word. 
        CorrTextWordPosition = CorrTextWordPosition + 1
        If OrigText(OrigTextWordCounter) = CorrText(CorrTextWordPosition)
          ; found the original word later in the corrected text, so 
          ; the corrected word is an added word.
          Debug OrigText(OrigTextWordCounter) + " - " + CorrText(CorrTextWordPosition) + "  MATCH"
                         AddedWordsCounter = AddedWordsCounter + 1
                          MatchWordCounter = MatchWordCounter  + 1
             AddedWords(AddedWordsCounter) = CorrText(CorrTextWordCounter)
            MatchedWords(MatchWordCounter) = CorrText(CorrTextWordCounter)
                       CorrTextWordCounter = CorrTextWordPosition   ; reposition the corrected word counter.
          Break
        EndIf
        If CorrTextWordPosition = MaxNbOfCorrectedWords
          ; didn't find the original word later in the text, so the original word has been deleted.
          Debug "No match found."
          Debug ""
                        DeletedWordsCounter = DeletedWordsCounter + 1
          DeletedWords(DeletedWordsCounter) = OrigText(OrigTextWordCounter)
                          AddedWordsCounter = AddedWordsCounter + 1
              AddedWords(AddedWordsCounter) = CorrText(CorrTextWordCounter)
        EndIf
      Wend
    EndIf
    
    ;End of text test.
    If OrigTextWordCounter = MaxNbOfOriginalWords Or 
       CorrTextWordCounter = MaxNbOfCorrectedWords
      Quit = True
    EndIf
    
Wend

EndProcedure

Procedure ShowResultsProcedure()
  Debug " "
  Debug "Stats:"
  Debug " Original text word count  " + Right("   " + Str(MaxNbOfOriginalWords ), 3)  
  Debug " Corrected text word count " + Right("   " + Str(MaxNbOfCorrectedWords), 3)  
  Debug ""
  Debug " Number of Matched words "   + Right("   " + Str(MatchWordCounter     ), 3)   
  For WordCounter = 1 To MatchWordCounter
    Debug Space(5) + MatchedWords(WordCounter) 
  Next WordCounter
  
  Debug " "
  Debug " Number of Deleted words " + Right("   " + Str(DeletedWordsCounter), 3)  
  For WordCounter = 1 To DeletedWordsCounter
    Debug Space(5) + DeletedWords(WordCounter) 
  Next WordCounter
  
  Debug " "
  Debug " Number of Added words   " + Right("   " + Str(AddedWordsCounter  ), 3)
  For WordCounter = 1 To AddedWordsCounter
    Debug Space(5) + AddedWords(WordCounter) 
  Next WordCounter
  
EndProcedure

Procedure ReadOriginalText(Array OrigText.s(1))
Restore Org
Read.s AllOriginalWords

MaxNbOfOriginalWords = CountString(AllOriginalWords, " ") + 1
If MaxNbOfWords > MaxNbOfOriginalWords
  For WordCounter = 1 To MaxNbOfOriginalWords
    OrigText(WordCounter) = StringField(AllOriginalWords, WordCounter, " ")
  Next WordCounter
EndIf

EndProcedure

Procedure ReadCorrectedText(Array CorrText.s(1))
Restore Corr
Read.s AllCorrectedWords

MaxNbOfCorrectedWords = CountString(AllCorrectedWords, " ") + 1
If MaxNbOfWords > MaxNbOfCorrectedWords
  For WordCounter = 1 To MaxNbOfCorrectedWords
    CorrText(WordCounter) = StringField(AllCorrectedWords, WordCounter, " ")
  Next WordCounter
EndIf

EndProcedure

DataSection
  Org:
  Data.s "The quick fox jump overs the dog"
  Corr:
  Data.s "The quick brown fox jumps over the lazy dog"
EndDataSection
User avatar
Vera
Addict
Addict
Posts: 858
Joined: Tue Aug 11, 2009 1:56 pm
Location: Essen (Germany)

Re: Comparing text in RTF or TXT files

Post by Vera »

Hi Amilcar,
I too tried to solve this challenge the other week, but didn't reach a suffcient solution yet. Especially as the specifications / rules are unknown.

E.g.: as for the current code the altered/missing words have to follow behind the positions of the corresponding words of the original text, otherwise they are not well detected. ... How to differ altered words against new words?

This won't work: Corr: Data.s "The quick brown fox jumps over dog the lazy"

As for your code - it gives wrong results:
Number of Matched words 5
The
quick
brown
the
lazy
'brown' doesn't match and 'dog' is missing.

Try this debug in both places and you'll see:

Code: Select all

MatchedWords(MatchWordCounter) = CorrText(CorrTextWordCounter) : Debug "match: "+MatchedWords(MatchWordCounter)
I went a different way and also included the requested word position.
But still I wouldn't know how to decide which words had only been altered (if they are close together by position?) and which were added.

Code: Select all

Global.s origin, control

Structure Text
  word.s
  pos.i 
EndStructure

NewList ORG.Text()
NewList CORR.Text()

NewList orgRest.Text()
NewList corrRest.Text()
NewList matchWords.Text()

Restore Org
Read.s origin
For i = 1 To 10
  AddElement(ORG())
  ORG()\word = StringField(origin, i, " ")
  ORG()\pos = i
Next

Restore Corr
Read.s control
For j = 1 To 10
  AddElement(CORR())
  CORR()\word = StringField(control, j, " ")
  CORR()\pos = j
Next

Debug origin
Debug control
;Debug "-------"

CopyList(ORG(), orgRest())
CopyList(CORR(), corrRest())

 ResetList(orgRest())
 While NextElement(orgRest())
   ResetList(CORR())
   While NextElement(CORR())
     If orgRest()\word = CORR()\word
       DeleteElement(orgRest())
       Break 
     EndIf
   Wend
 Wend
 
 Debug "--++++---"

 ResetList(corrRest())
 While NextElement(corrRest())
   ResetList(ORG())
   While NextElement(ORG())
     If corrRest()\word = ORG()\word
       
       If corrRest()\word <> ""
         AddElement(matchWords())
         matchWords()\word = ORG()\word
         matchWords()\pos = ORG()\pos
       EndIf
       
       DeleteElement(corrRest())
       Break 
     EndIf
   Wend
 Wend
 

 Debug "--- only appear in ORG:"
 ForEach orgRest()
   Debug Str(orgRest()\pos) +" _ "+ orgRest()\word 
 Next

 Debug "--- only appear in CORR:"
 ForEach corrRest()
   Debug Str(corrRest()\pos) +" _ "+ corrRest()\word 
 Next
 
 Debug "--++++---"
 Debug "--- matching words (from ORG):"
 ForEach matchWords()
   Debug Str(matchWords()\pos) +" _ "+ matchWords()\word 
 Next

DataSection
  Org:
  Data.s "The quick fox jump overs the dog"
  Corr:
  Data.s "The quick brown fox jumps over the lazy dog"
EndDataSection 
greets ~ Vera
Dude
Addict
Addict
Posts: 1907
Joined: Mon Feb 16, 2015 2:49 pm

Re: Comparing text in RTF or TXT files

Post by Dude »

Blurryan, do you really need an actual list of changes, or just a way to check how close the new text is to the original? Because perhaps just a Levenshtein check is all you need.

With Levenshtein, you get a number that represents how close the new text is to the old text: the lower the number, the closer the similarity and thus more likely that the proofread text is correct. ;)

Observe (not my code originally):

Code: Select all

; Calculates edited diffs between two strings.
; 0 = No diffs, 1 = 1 edit needed to diff, etc.

; http://en.wikipedia.org/wiki/Levenshtein_distance

Procedure Levenshtein(text1$,text2$)
  n=Len(text1$) : m=Len(text2$) : Dim d(n,m)
  If n=0 : ld=m : ProcedureReturn ld : EndIf
  If m=0 : ld=n : ProcedureReturn ld : EndIf
  For i=0 To n : d(i,0)=i : Next
  For j=0 To m : d(0,j)=j : Next
  For i=1 To n
    si$=Mid(text1$,i,1)
    For j=1 To m
      tj$=Mid(text2$,j,1)
      If si$=tj$ : cost=0 : Else : cost=1 : EndIf
      a=d(i,j-1)+1 : b=d(i-1,j-1)+cost : min=d(i-1,j)+1
      If a<min : min=a : EndIf
      If b<min : min=b : EndIf
      d(i,j)=min
    Next
  Next
  ProcedureReturn d(n,m)
EndProcedure

orig$="The quick brown fox jumps over the lazy dog"

Debug Levenshtein(orig$,orig$) ; 0 edits.
Debug Levenshtein(orig$,ReplaceString(orig$,"dog","cat")) ; 3 edits.
Debug Levenshtein(orig$,RemoveString(orig$,"brown")) ; 5 edits.
Debug Levenshtein(orig$,orig$+" new words") ; 10 edits.
You can compare different texts to see who was closest to the original:

Code: Select all

; Calculates edited diffs between two strings.
; 0 = No diffs, 1 = 1 edit needed to diff, etc.

; http://en.wikipedia.org/wiki/Levenshtein_distance

Procedure Levenshtein(text1$,text2$)
  n=Len(text1$) : m=Len(text2$) : Dim d(n,m)
  If n=0 : ld=m : ProcedureReturn ld : EndIf
  If m=0 : ld=n : ProcedureReturn ld : EndIf
  For i=0 To n : d(i,0)=i : Next
  For j=0 To m : d(0,j)=j : Next
  For i=1 To n
    si$=Mid(text1$,i,1)
    For j=1 To m
      tj$=Mid(text2$,j,1)
      If si$=tj$ : cost=0 : Else : cost=1 : EndIf
      a=d(i,j-1)+1 : b=d(i-1,j-1)+cost : min=d(i-1,j)+1
      If a<min : min=a : EndIf
      If b<min : min=b : EndIf
      d(i,j)=min
    Next
  Next
  ProcedureReturn d(n,m)
EndProcedure

orig$="The quick brown fox jumps over the lazy dog"

alice=Levenshtein(orig$,"The kwick fox jumped over the lazzy dog") ; 11
bob=Levenshtein(orig$,"The fast brown fox jumps over the lazy dawg") ; 7
colin=Levenshtein(orig$,"Teh quick brown fox jumps over the lazy dog") ; 2

Debug alice
Debug bob
Debug colin
Debug "Colin's version was closest to the orig"
Amilcar Matos
User
User
Posts: 43
Joined: Thu Nov 27, 2014 3:10 pm
Location: San Juan, Puerto Rico

Re: Comparing text in RTF or TXT files

Post by Amilcar Matos »

Code improved.
Thank you Vera.

Code: Select all

;{- Program header
;==Code Header Comment==============================
;        Name/title: TextCompare.pb
;   Executable name: TextCompare.exe
;           Version: 1.01
;            Author: Blurryan initial version.
;     Collaborators: Amílcar Matos Pérez, Vera
; Release date/hour: August 3, 2015
;  Operating system: Windows 7
;  Compiler version: PureBasic 5.30 (x86)
;       Explanation: To compare two text lines and identify changes present.
;                    This code identifies: Matched words in both texts;
;                                          Deleted words in the original text;
;                                          Added words in the corrected text;
;                                          Words placed in different order in the corrected text.
;                    This code fails when the words are placed too far apart.
; play with this statement: CorrTextWordCounter = CorrTextWordPosition   ; reposition the corrected word counter.
; ==================================================
;.......10........20........30........40........50........60........70........80
;}
Declare ShowResultsProcedure()
Declare ReadOriginalText(Array OrigText.s(1))
Declare ReadCorrectedText(Array CorrText.s(1))
Declare.l CompareBothTexts()
Declare.l IdentifyChangedOrderWords()
  
Global.i WordCounter        , OrigTextWordCounter, CorrTextWordCounter , MatchWordCounter 
Global.i DeletedWordsCounter, AddedWordsCounter  , CorrTextWordPosition, ChangedWordCounter
Global.l True               , Quit               , False               , MaxNbOfWords     
Global.l MaxNbOfOriginalWords, MaxNbOfCorrectedWords
Global.s AllOriginalWords   , AllCorrectedWords  , OriginalTextLine    , CorrectedTextLine 
Global.s BlankLine

   BlankLine = " "
MaxNbOfWords = 10
Global Dim OrigText.s         (MaxNbOfWords)  
Global Dim CorrText.s         (MaxNbOfWords)  
Global Dim MatchedWords.s     (MaxNbOfWords)  
Global Dim AddedWords.s       (MaxNbOfWords)  
Global Dim DeletedWords.s     (MaxNbOfWords)  
Global Dim ChangedOrderWords.s(MaxNbOfWords)  

ReadOriginalText (OrigText.s())  
ReadCorrectedText(CorrText.s())  

Debug AllOriginalWords
Debug AllCorrectedWords
Debug BlankLine

 OriginalTextLine = AllOriginalWords
CorrectedTextLine = AllCorrectedWords

CompareBothTexts()

IdentifyChangedOrderWords()

ShowResultsProcedure()

End

Procedure.l IdentifyChangedOrderWords()
  ;{- Procedure explanation
  ; Identifies words placed in a different order in the corrected text.
  ; Rule: Words that are identified as deleted from the original text but also are identified
  ;       as added words in the corrected text are to be shown as changed order.
  ; Fills a ChangedOrderWords array.
  ;}

  ;{- Protected variables
  Protected ReturnValue.l
  Protected AddedWordRow.l
  Protected DeletedWordRow.l 
  Protected FirstAddedWord.l
  Protected FirstDeletedWord.l
  Protected LastAddedWord.l
  Protected LastDeletedWord.l
  ;}

  ;{- Pre-assigned return value
  ReturnValue = #True
  ;}

  ; for each deleted word in the original text
  ; look for the same word in the added words list.
  ; If present then its an order changed word.
 ChangedWordCounter = 0
   FirstDeletedWord = 1
    LastDeletedWord = DeletedWordsCounter
     FirstAddedWord = 1
      LastAddedWord = AddedWordsCounter

  For DeletedWordRow = FirstDeletedWord To LastDeletedWord
    
    For AddedWordRow = FirstAddedWord To LastAddedWord
      
      If DeletedWords(DeletedWordRow) = AddedWords(AddedWordRow)
        
                           ChangedWordCounter = ChangedWordCounter + 1
        ChangedOrderWords(ChangedWordCounter) = AddedWords(AddedWordRow)
        
      EndIf
      
    Next AddedWordRow
    
  Next DeletedWordRow
  
  ProcedureReturn ReturnValue
EndProcedure     ; IdentifyChangedOrderWords


Procedure.l CompareBothTexts()
  ;{- Procedure explanation
  ; Compares each word from the corrected text to the original text.
  ; Fills three arrays: MatchedWords(), AddedWords() and DeletedWords().
  ;}
  ;{- Protected variables
  Protected Quit
  ;}
  
  ;{- Pre-assigned return value
  ReturnValue = #True
  ;}
  
  OrigTextWordCounter = 0 
  CorrTextWordCounter = 0 
  MatchWordCounter    = 0 
  DeletedWordsCounter = 0 
  AddedWordsCounter   = 0 
  
  Quit  = 0 
  
While Quit = #False
  OrigTextWordCounter = OrigTextWordCounter + 1
  CorrTextWordCounter = CorrTextWordCounter + 1
  
  ;find word match
    If OrigText(OrigTextWordCounter) = CorrText(CorrTextWordCounter)
      Debug OrigText(OrigTextWordCounter) + " - " + CorrText(CorrTextWordCounter) + "  MATCH"
      MatchWordCounter = MatchWordCounter + 1
      MatchedWords(MatchWordCounter) = CorrText(CorrTextWordCounter)
    Else
      ;mismatch found.
      Debug ""
      Debug OrigText(OrigTextWordCounter) + " - " + CorrText(CorrTextWordCounter) + "  DO NOT MATCH"
      CorrTextWordPosition = CorrTextWordCounter
      While CorrTextWordPosition <= MaxNbOfCorrectedWords
        ;forward looking for the original word. 
           CorrTextWordPosition = CorrTextWordPosition + 1
        If CorrTextWordPosition <= MaxNbOfCorrectedWords        
          
          If OrigText(OrigTextWordCounter) = CorrText(CorrTextWordPosition)
            ; found the original word later in the corrected text, so 
            ; the corrected word is an added word.
            Debug OrigText(OrigTextWordCounter) + " - " + CorrText(CorrTextWordPosition) + "  MATCH"
                           AddedWordsCounter = AddedWordsCounter + 1
                            MatchWordCounter = MatchWordCounter  + 1
               AddedWords(AddedWordsCounter) = CorrText(CorrTextWordCounter)
              MatchedWords(MatchWordCounter) = OrigText(OrigTextWordCounter)
                         CorrTextWordCounter = CorrTextWordPosition   ; reposition the corrected word counter.
            Break
          EndIf
          
        EndIf
        
        If CorrTextWordPosition > MaxNbOfCorrectedWords
          ; didn't find the original word later in the text, so the original word has been deleted.
          Debug "No match found."
          Debug ""
                        DeletedWordsCounter = DeletedWordsCounter + 1
          DeletedWords(DeletedWordsCounter) = OrigText(OrigTextWordCounter)
                          AddedWordsCounter = AddedWordsCounter + 1
              AddedWords(AddedWordsCounter) = CorrText(CorrTextWordCounter)
        EndIf
        
      Wend
      
    EndIf
    
    ;End of text test.
    If OrigTextWordCounter = MaxNbOfOriginalWords 
      Quit = #True
    EndIf
    
Wend

ProcedureReturn ReturnValue
EndProcedure ;Compare both texts

Procedure ShowResultsProcedure()
  Debug BlankLine
  Debug "Stats:"
  Debug " Original text word count  " + Right("   " + Str(MaxNbOfOriginalWords ), 3)  
  Debug " Corrected text word count " + Right("   " + Str(MaxNbOfCorrectedWords), 3)  
  
  Debug BlankLine
  Debug " Number of Matched words "   + Right("   " + Str(MatchWordCounter), 3)   
  For WordCounter = 1 To MatchWordCounter
    Debug Space(5) + MatchedWords(WordCounter) 
  Next WordCounter
  
  Debug BlankLine
  Debug " Number of Deleted words " + Right("   " + Str(DeletedWordsCounter), 3)  
  For WordCounter = 1 To DeletedWordsCounter
    Debug Space(5) + DeletedWords(WordCounter) 
  Next WordCounter
  
  Debug BlankLine
  Debug " Number of Added words   " + Right("   " + Str(AddedWordsCounter), 3)
  For WordCounter = 1 To AddedWordsCounter
    Debug Space(5) + AddedWords(WordCounter) 
  Next WordCounter
  
  Debug BlankLine
  Debug OriginalTextLine 
  Debug CorrectedTextLine 
  
  Debug BlankLine
  Debug " Number of Changed Order words   " + Right("   " + Str(ChangedWordCounter), 3)
  For WordCounter = 1 To ChangedWordCounter
    Debug Space(5) + ChangedOrderWords(WordCounter) 
  Next WordCounter
  
EndProcedure

Procedure ReadOriginalText(Array OrigText.s(1))
Restore Org
Read.s AllOriginalWords

MaxNbOfOriginalWords = CountString(AllOriginalWords, " ") + 1
If MaxNbOfWords > MaxNbOfOriginalWords
  For WordCounter = 1 To MaxNbOfOriginalWords
    OrigText(WordCounter) = StringField(AllOriginalWords, WordCounter, " ")
  Next WordCounter
EndIf

EndProcedure

Procedure ReadCorrectedText(Array CorrText.s(1))
Restore Corr
Read.s AllCorrectedWords

MaxNbOfCorrectedWords = CountString(AllCorrectedWords, " ") + 1
If MaxNbOfWords > MaxNbOfCorrectedWords
  For WordCounter = 1 To MaxNbOfCorrectedWords
    CorrText(WordCounter) = StringField(AllCorrectedWords, WordCounter, " ")
  Next WordCounter
EndIf

EndProcedure

DataSection
  Org:
  Data.s "The quick fox jump overs the dog"
 ; Data.s "ok, you say; think what you say"
  Corr:
  Data.s "The quick brown fox jumps over dog the lazy"   ; for testing the word order change case.
 ; Data.s "The quick brown fox jumps over the lazy dog"  ; for testing the non change case.
 ; Data.s "ok, you say; say what you think"              ; for testing the corrected word counter reposition.
 EndDataSection
Blurryan
User
User
Posts: 32
Joined: Sat Oct 13, 2007 2:08 pm
Location: Kazakhstan

Re: Comparing text in RTF or TXT files

Post by Blurryan »

Sorry everybody, was travelling and today am back on my seat. Shall go through your posts and get back to you by EOD.

Thanks for sharing your ideas.
Blurryan
User
User
Posts: 32
Joined: Sat Oct 13, 2007 2:08 pm
Location: Kazakhstan

Re: Comparing text in RTF or TXT files

Post by Blurryan »

Hi Dude,

Was experimenting with the Levenshtein system you had suggested. Made a slight change to check words instead of characters and found that in one example the result was incorrect where the sequencing was similar but duplicated. The redone code is below for you to have a look. Please see the orig$ as compared with ee$.

I shall go through Vera's program over the weekend and see how it goes.

Dude's code redone as below:

Code: Select all

orig$="The quick brown fox jumps over the lazy dog"
alice$="The kwick fox jumped over the lazzy dog"

aa$="The kwick fox jumped over the lazzy dog"
bb$="The fast brown fox jumps over the lazy dawg"
cc$="Teh quick brown fox jumps over the lazy dog"
dd$="It pains me that - this is going no-where"
ee$="The quick fox jumps over fox jumps over the lazy dog"

Declare Levenshtein(text1$,text2$)

Debug Levenshtein(orig$,aa$)
Debug "----------"
Debug Levenshtein(orig$,bb$)
Debug "----------"
Debug Levenshtein(orig$,cc$)
Debug "----------"
Debug Levenshtein(orig$,dd$)
Debug "----------"
Debug Levenshtein(orig$,ee$)
Debug "----------"

Procedure Levenshtein(text1$,text2$)
  Debug text1$
  Debug text2$
  i = 1 : j = 1
  x = Len(text1$) : y = Len(text2$)
  While i < x
    p = FindString(text1$, " ", i)
    If p > 0 : pp = pp + 1 : i = p + 1
    Else : Wordstext1 = pp + 1 : Break
    EndIf
  Wend
  While j < y
    q = FindString(text2$, " ", j)
    If q > 0 : qq = qq + 1 : j = q + 1
    Else : Wordstext2 = qq + 1 : Break
    EndIf
  Wend
  Dim Text1.s(Wordstext1) : Dim Text2.s(Wordstext2)
  For i = 1 To Wordstext1 : Text1(i) = StringField(text1$, i, " ") : Next i
  For j = 1 To Wordstext2 : Text2(j) = StringField(text2$, j, " ") : Next j
  
  If Wordstext1=0 : ld=Wordstext2 : ProcedureReturn ld : EndIf
  If Wordstext2=0 : ld=Wordstext1 : ProcedureReturn ld : EndIf
  
  Dim d(Wordstext1,Wordstext2)
  
  For i=0 To Wordstext1 : d(i,0)=i : Next
  For j=0 To Wordstext2 : d(0,j)=j : Next
  For i=1 To Wordstext1
    si$=Text1(i)
    For j=1 To Wordstext2
      tj$=Text2(j)
      If si$=tj$ : cost=0 : Else : cost=1 : EndIf
      a=d(i,j-1)+1 : b=d(i-1,j-1)+cost : min=d(i-1,j)+1
      If a<min : min=a : EndIf
      If b<min : min=b : EndIf
      d(i,j)=min
    Next
  Next
  ProcedureReturn d(Wordstext1,Wordstext2)
EndProcedure
 
Once again thanks a ton for your efforts and time.

Regards
Blurryan
Blurryan
User
User
Posts: 32
Joined: Sat Oct 13, 2007 2:08 pm
Location: Kazakhstan

Re: Comparing text in RTF or TXT files

Post by Blurryan »

Many thanks to Amílcar Matos Pérez, Vera and Dude, for taking interest in the algorithm.

I have worked on the algorithm to find the best solution and my program is given below.

My request to all is to see if the program can be written better and / or if there are any logical errors.
The purpose of the program is to identify an original text and compare the text to a "Proof Read" text along with all possible corrections in it. These I have named as "Original" and "Corrected" text in the program.

My requirement is to understand what is the typing error percentage on the original texts hence I need to know
a.) Which words were removed from the original text and so constitute an error,
b.) Which words were added to the original text and hence constitute an error
c.) Ensure that if the words typed in the original exist in the corrected text but may not be in the same position then these words are to be taken OK and
d.) In case a word has been repeated in the original text without any counterpart in the Corrected text, then it shall be an error.

After getting the word count for errors, to calculate the error percentage of the original typed text.

Thanks to all in advance.

Code: Select all

;{- Program header
;==Code Header Comment==============================
;        Name/title: ProofTest.pb
;   Executable name: ProofTest.exe
;           Version: 2.00
;            Author: Blurryan initial version.
;      Collaborator: Amílcar Matos Pérez (San Juan, Puerto Rico)
; Release date/hour: 03/Sep/2015
;  Operating system: Windows 10
;  Compiler version: PureBasic 5.31 (x64)
;       Explanation: To compare two text lines and identify changes present to calculate errors and percentages.
; ==================================================
;.......10........20........30........40........50........60........70........80........90.......100.......110.......120.......130.......140
;}

Structure Rev   ; Structure made for trapping the words array along with their position in the text
  item.s        ; Word in a string
  orgno.i       ; place in the string
EndStructure

; variables initialised
Global.s ORIG, CORR
Global.i perf, imperf, rept, rem, addi, err
Global.i i, ii, j, jj, m = 0, Pj, Flag = 0
Global.i Quit = 0, LenORIG, LenCORR, LenORIGStat, LenCORRStat, Correxistence, CountO = 0, CountC = 0
Global.f Errorpercent

; arrays - O for Original text, C for Corrected text, P and Q are to trap whether the matches are Perfect/ Imperfect / Repeat / Not existent
Global Dim O.s(100) : Global Dim C.s(100) : Global Dim P.s(100) : Global Dim Q.s(100)
Global Dim OX.Rev(100) : Global Dim OY.Rev(100) ; Structured array for working

Font1 = LoadFont(0, "Consolas", 8) ; main font used

; Declarations of the procedures
Declare ConverttoOXOY()                 ; Converts the texts given to the stucture formats
Declare InitiateOutput()                ; Initial printing of headers
Declare ProcessPerfectMatches()         ; processes for perfect matches i.e. matching of text AND position in text
Declare ProcessImperfectMatches()       ; processes for imperfect matches i.e. matching of text but not of exact position in text
Declare ProcessRepeats()                ; processes for repeats i.e. whether a word has been repeated in the texts - not significant as this is an error
Declare ProcessNotInCorr()              ; processes for words that are in Original text but NOT in the Corrected text
Declare ProcessNotInOrig()              ; processes for words that are in Corrected text but NOT in the Original text
Declare EndProcess()                    ; end process and final printing of the summary
Declare Initialize()                    ; initialises the variables

; Main window is opened along with respective gadgets
If OpenWindow(0, 300, 10, 1000, 620, "Words test", #PB_Window_SystemMenu)
    TextGadget(1, 10, 10, 90, 20, "Original text:") : SetGadgetFont(1, Font1)
    StringGadget(2, 110, 10, 880, 20, "The quick fox jumps over the fox jumps over the fox jumps over the lazzy dog.") : SetGadgetFont(2, Font1)
    TextGadget(3, 10, 40, 90, 20, "Corrected text:") : SetGadgetFont(3, Font1)
    StringGadget(4, 110, 40, 880, 20, "The quick brown fox jumps over the lazy dog.") : SetGadgetFont(4, Font1)
    EditorGadget(5, 10, 70, 980, 500) : SetGadgetFont(5, Font1)
    ButtonGadget(8, 20, 590, 100, 20, "Clear") : SetGadgetFont(8, Font1)      ; This clears all fields in the window and then one can input afresh
    ButtonGadget(9, 150, 590, 100, 20, "Initiate") : SetGadgetFont(9, Font1)  ; clears the editor but keeps the original and corrected text which can be mdified for a subsequent run
    ButtonGadget(10, 740, 590, 100, 20, "GO") : SetGadgetFont(10, Font1)      ; to process the texts and send output to editor
    ButtonGadget(11, 880, 590, 100, 20, "Quit") : SetGadgetFont(11, Font1)    ; quits program

    Repeat
      Select WaitWindowEvent()
        Case #PB_Event_CloseWindow
          Quit = 1
        Case #PB_Event_Gadget
          Select EventGadget()
            Case 8
              SetGadgetText(2, "")
              SetGadgetText(4, "")
              ClearGadgetItems(5)
              Initialize()
            Case 9
              Initialize()
            Case 10
              ORIG = GetGadgetText(2) : LenORIG = Len(ORIG) : CORR = GetGadgetText(4) : LenCORR = Len(CORR)
              ConverttoOXOY()
              InitiateOutput()
              ProcessPerfectMatches()
              ProcessImperfectMatches()
              ProcessRepeats()
              ProcessNotInCorr()
              ProcessNotInOrig()
              EndProcess()
            Case 11
              Quit = 1
          EndSelect
      EndSelect
    Until Quit = 1
  EndIf
  
Procedure ConverttoOXOY() ; Converts the texts given to the stucture formats
  i = 0
  j = 0
  m = 0
  LenORIGStat = 0: LenCORRStat = 0
  ; processes length of string in characters and based on "word = characters within spaces" processes the words and word counts
  While LenORIGStat <= LenORIG
    i = i + 1
    ii = i
    O(i) = StringField(ORIG, i, " ")
    LenORIGStat = LenORIGStat + Len(StringField(ORIG, i, " ")) + 1
  Wend
  While LenCORRStat <= LenCORR
    j = j + 1
    jj = j
    C(j) = StringField(CORR, j, " ")
    LenCORRStat = LenCORRStat + Len(StringField(CORR, j, " ")) + 1
  Wend
  ; Redim the arrays for efficient use of storage and memory and create the structured files for original and corrected texts
  ReDim O(ii) : ReDim C(jj) : ReDim OX.Rev(ii): ReDim OY.Rev(jj): ReDim P(jj): ReDim Q(ii)
  For i = 1 To ii
    OX(i)\item = O(i)
    OX(i)\orgno = i
  Next i
  For j = 1 To jj
    OY(j)\item = C(j)
    OY(j)\orgno = j
  Next j
EndProcedure
  
Procedure InitiateOutput()    ; Initial printing of headers
  AddGadgetItem(5, -1, "Original text:  " + PeekS(@ORIG) + "   Words: " + Str(ii))
  AddGadgetItem(5, -1, "Corrected text: " + PeekS(@CORR) + "   Words: " + Str(jj))
  AddGadgetItem(5, -1, "")
  AddGadgetItem(5, -1, "Original" + Chr(9) + "Original" + Chr(9) + "Corrected" + Chr(9) + "Corrected" + Chr(9) + "COMMENTS")
  AddGadgetItem(5, -1, "  Word  " + Chr(9) + "  Srl # " + Chr(9) + "   Word  " + Chr(9) + "  Srl #  ")
  AddGadgetItem(5, -1, "--------" + Chr(9) + "--------" + Chr(9) + "---------" + Chr(9) + "---------")
  AddGadgetItem(5, -1, "")  
EndProcedure
  
Procedure ProcessPerfectMatches() ; processes for perfect matches i.e. matching of text AND position in text
  For i = 1 To ii
    For j = 1 To jj
      If OX(i)\item = OY(j)\item And OX(i)\orgno = OY(j)\orgno
          AddGadgetItem(5, -1, OX(i)\item + Chr(9) +  Chr(9) +  Chr(9) + Str(OX(i)\orgno) + Chr(9) + OY(j)\item + Chr(9) + Chr(9) + Chr(9) + Str(OY(j)\orgno) + Chr(9) + "PERFECT MATCH !!")
          perf = perf + 1
          P(j) = "P": Q(i) = "P"  ; puts "P" in counter of P and Q array to identify that word is already considered
          CountO = CountO + 1: CountC = CountC + 1   ; a counter count of words to ensure all words taken and to tally with ii and jj
      EndIf
    Next j
  Next i
  AddGadgetItem(5, -1, "---------------------------------------------------------")  
  AddGadgetItem(5, -1, "")  
EndProcedure
  
Procedure ProcessImperfectMatches()   ; processes for imperfect matches i.e. matching of text but not of exact position in text
  Pj = 0
  For i = 1 To ii
    For j = 1 To jj
      If OX(i)\item = OY(j)\item And OX(i)\orgno <> OY(j)\orgno
          If P(j) = "P" Or P(j) = "I" Or Q(i) = "P" Or Q(i) = "I"
            ; do not print or take as IMPERFECT WORD
          Else
            ; print and take as IMPERFECT WORD
            AddGadgetItem(5, -1, OX(i)\item + Chr(9) + Chr(9) +  Chr(9) + Str(OX(i)\orgno) + Chr(9) + OY(j)\item + Chr(9) + Chr(9) +  Chr(9) + Str(OY(j)\orgno) + Chr(9) + "IMPERFECT MATCH !")
            imperf = imperf + 1
            If P(j) = "": P(j) = "I": EndIf: If Q(i) = "": Q(i) = "I": EndIf  ; puts "I" in counter of P and Q array to identify that word is already considered
            CountO = CountO + 1: CountC = CountC + 1 ; a counter count of words to ensure all words taken and to tally with ii and jj
          EndIf
      EndIf
    Next j
  Next i
  AddGadgetItem(5, -1, "---------------------------------------------------------")  
  AddGadgetItem(5, -1, "")  
EndProcedure
  
Procedure ProcessRepeats()    ; processes for repeats i.e. whether a word has been repeated in the texts - not significant as this is an error
  Pj = 0
  For i = 1 To ii
    For j = 1 To jj
      If OX(i)\item = OY(j)\item And OX(i)\orgno <> OY(j)\orgno
          If P(j) = "" And Q(i) = ""
            ; print and take as REPEAT WORD
            AddGadgetItem(5, -1, OX(i)\item + Chr(9) + Chr(9) +  Chr(9) + Str(OX(i)\orgno) + Chr(9) + OY(j)\item + Chr(9) + Chr(9) +  Chr(9) + Str(OY(j)\orgno) + Chr(9) + "REPEAT WORD")
            rept = rept + 1
            err = err + 1
            P(j) = "R": Q(i) = "R"
          Else
            ; do not print and take as REPEAT WORD
          EndIf
      EndIf
    Next j
  Next i
  AddGadgetItem(5, -1, "---------------------------------------------------------")  
  AddGadgetItem(5, -1, "")
EndProcedure
  
Procedure ProcessNotInCorr()  ; processes for words that are in Original text but NOT in the Corrected text
; This routine brings out the words that are in ORIGINAL Text but NOT in CORRECTED Text
  For i = 1 To ii
    If Q(i) = ""
      AddGadgetItem(5, -1, OX(i)\item + Chr(9) + Chr(9) +  Chr(9) + Str(OX(i)\orgno) +  Chr(9) + Chr(9) + Chr(9) + Chr(9) + Chr(9) + "Not in Corrected Text")
      rem = rem + 1
      err = err + 1
      CountO = CountO + 1
    EndIf
  Next i
  AddGadgetItem(5, -1, "---------------------------------------------------------")  
  AddGadgetItem(5, -1, "")
EndProcedure
  
Procedure ProcessNotInOrig()    ; processes for words that are in Corrected text but NOT in the Original text
; This routine brings out the words that are in CORRECTED Text but NOT in ORIGINAL Text
  For j = 1 To jj
    If P(j) = ""
      AddGadgetItem(5, -1, Chr(9) + Chr(9) + Chr(9) +  Chr(9) + OY(j)\item + Chr(9) + Chr(9) +  Chr(9) + Str(OY(j)\orgno) + Chr(9) + "Not in Original Text")
      addi = addi + 1
      err = err + 1
      CountC = CountC + 1
    EndIf
  Next j
  AddGadgetItem(5, -1, "---------------------------------------------------------")  
  AddGadgetItem(5, -1, "")
EndProcedure
  
Procedure EndProcess()    ; end process and final printing of the summary
  AddGadgetItem(5, -1, "Words in Original: " + Chr(9) + Str(CountO) +  Chr(9) + "Words in Corrected: " + Chr(9) + Str(CountC))
  AddGadgetItem(5, -1, "")
  AddGadgetItem(5, -1, Chr(9) + Chr(9) + "Perfect Matches   : " + Str(perf) + " words")
  AddGadgetItem(5, -1, Chr(9) + Chr(9) + "Imperfect Matches : " + Str(imperf) + " words")
  AddGadgetItem(5, -1, Chr(9) + Chr(9) + "Repeat Matches    : " + Str(rept) + " words")
  AddGadgetItem(5, -1, Chr(9) + Chr(9) + "Removals          : " + Str(rem) + " words")
  AddGadgetItem(5, -1, Chr(9) + Chr(9) + "Additions         : " + Str(addi) + " words")
  AddGadgetItem(5, -1, "")
  AddGadgetItem(5, -1, Chr(9) + Chr(9) + "Total Errors      : " + Str(err) + " words")
  Errorpercent.f = err*100/ii
  AddGadgetItem(5, -1, Chr(9) + Chr(9) + "Error Percentage  : " + StrF(Errorpercent, 2) + " %")
  AddGadgetItem(5, -1, "")
  AddGadgetItem(5, -1, Chr(9) + Chr(9) + "--- End ---")
EndProcedure
  
Procedure Initialize()    ; initialises the variables on "Initiate" as well as on "Clear"
  ORIG = "": CORR = ""
  perf=0: imperf=0: rept=0: rem=0: addi=0: err=0: Pj=0: Flag = 0
  LenORIG=0: LenCORR=0: LenORIGStat=0: LenCORRStat=0: Correxistence=0
  Errorpercent=0: CountO = 0: CountC = 0
  ReDim O(100) : ReDim C(100): ReDim P(100): ReDim Q(100)
  For i = 1 To 100
    P(i) = "": Q(i) = ""
  Next i
  ClearGadgetItems(5)
EndProcedure
Post Reply