Page 1 of 3

Removed

Posted: Mon Feb 18, 2008 12:29 am
by Dr_Wildrick
Removed

Posted: Mon Feb 18, 2008 12:39 am
by GeoTrail
Very cool. works like a charm.

Robert and robert gives 1, Robert and roberT gives 2 :)

Here's the code in formatted fashion for those who find it hard to read the unformatted kind ;)

Code: Select all

Procedure.l Minimum(a.l,b.l,c.l) 
  ;******************************* 
  ;*** Get minimum of three values 
  ;******************************* 
  mi.l=0 
  mi = a 
  If b < mi 
  mi = b 
  EndIf 
  If c < mi 
  mi = c 
  EndIf 
  ProcedureReturn mi 
EndProcedure 

Procedure.l Levenshtein(s$,t$) 
  ;******************************** 
  ;*** Compute Levenshtein Distance 
  ;******************************** 
  m.l = 0; length of t 
  n.l = 0;length of s 
  i.l=0 ; iterates through s 
  j.l=0 ; iterates through t 
  s_i.s ="" ; ith character of s 
  t_j.s = ""; jth character of t 
  cost.l=0 ; cost 
  
  ; Step 1 
  n = Len(s$) 
  m = Len(t$) 
  Dim d(n,m); matrix 
  
  If n = 0 
    LD = m 
    ProcedureReturn ld 
  EndIf 
  
  If m = 0 
    LD = n 
    ProcedureReturn ld 
  EndIf 
  
  ; Step 2 
  For i = 0 To n 
    d(i, 0) = i 
  Next 
  
  For j = 0 To m 
    d(0, j) = j 
  Next 
  
  ;Step 3 
  For i = 1 To n 
    s_i = Mid(s$, i, 1) 
    
    ; Step 4 
    For j = 1 To m 
      t_j = Mid(t$, j, 1) 
      
      ; Step 5 
      If s_i = t_j 
        cost = 0 
      Else 
        cost = 1 
      EndIf 
      
      ; Step 6 
      d(i, j) = Minimum(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + cost) 
    Next 
  
  Next 
  
  ; Step 7 
  LD = d(n, m) 
  ProcedureReturn ld 
EndProcedure

Debug Levenshtein("Robert","roberT")

Removed

Posted: Mon Feb 18, 2008 12:48 am
by Dr_Wildrick
Removed

Posted: Mon Feb 18, 2008 2:21 am
by Dare
Nice.

What would be good a way to assess similarities? Average size / cost?
And when would similarities raise a flag for, say, plagerism?

(I guess this is a "how long is a piece of string" question).

Removed

Posted: Mon Feb 18, 2008 3:34 am
by Dr_Wildrick
Removed

Re: How long is a string

Posted: Mon Feb 18, 2008 3:52 am
by Dare
Dr_Wildrick wrote:I have an ex-wife and two teenage daughters who makes sure I am aware of that fact on an a regular basis.)
:D

Thanks for info. :)

Posted: Tue Feb 19, 2008 10:30 am
by dell_jockey
Aahhhh, this takes me back to 1991, when I published an article in the C/C++ Users Journal (May issue, Vol.9/Nr.5) called "Inexact Alphanumeric Comparisons". The code I presented in that article used the Levenstein distance as well.

Perhaps it's time to dust off a more recent iteration of that old C-code and port it to PureBasic...

Posted: Tue Feb 19, 2008 12:38 pm
by DoubleDutch
This is the distance routine I wrote for Elementary Reports's spell checker...

Code: Select all

Procedure Distance(s$,t$)
	m=Len(s$)
	n=Len(t$)
	If m And n
		If m>#MaxWordLength
			m=#maxWordLength
		EndIf
		If n>#MaxWordLength
			n=#maxWordLength
		EndIf
		For i=0 To m
			DistanceMatrix(i,0)=i
		Next
		For j=0 To n
			DistanceMatrix(0,j)=j
		Next
		For i=1 To m
			For j=1 To n
				If Mid(s$,i,1)=Mid(t$,j,1)
					cost=#False
				Else
					cost=#True
				EndIf
				a=DistanceMatrix(i-1,j)+1	; deletion
				b=DistanceMatrix(i,j-1)+1	; insertion
				c=DistanceMatrix(i-1,j-1)+cost ; substitution
				If b<a
					a=b
				EndIf
				If c<a
					a=c
				EndIf
				DistanceMatrix(i,j)=a
			Next
		Next
		result=DistanceMatrix(m,n)
	Else
		result=9999
	EndIf
	ProcedureReturn result
EndProcedure
I got the algorithm for the routine from Wikipedia:
http://en.wikipedia.org/wiki/Levenshtein_distance

Please note: The 9999 value is there because it works well with the spell checker, it should really be the greater length of the two strings (which is exactly how many chages it would take for a null string compared with a non-null string.

Also note: DistanceMatrix[MaxWordLength,MaxWordLength] should be set up as a global array where MaxWordLength is the max word length! ;)

Posted: Tue Feb 19, 2008 1:05 pm
by pdwyer
Playing with the code here I was expected to be disapointed for some reason but I was pleasently surprised :)

I modified the original code to be faster using memory buffers, and it lets it handle pointers to file bytes. (and made the vars easier to recognise ;) )

Even so, doing a lot of comparisons could be time consuming. If you have a lot of students with a large report it's going to take a while to number crunch. Anyway, here's what I have, thanks for introducing me to the algorithm (I'll look at doubledutch's one next :)

Code: Select all


Procedure.s MidMem(*MainMem, StartPos.l, GetLen.l)  

    If *MainMem = 0 Or GetLen = 0
        ProcedureReturn ""
    EndIf
    
    *RetMem = AllocateMemory(GetLen)
    CopyMemory(*MainMem + StartPos -1, *RetMem, GetLen)
    ReturnString.s = PeekS(*RetMem,GetLen)
    FreeMemory(*RetMem)
    
    ProcedureReturn ReturnString

EndProcedure


Procedure.l Minimum(a.l,b.l,c.l) 
    ;******************************* 
    ;*** Get minimum of three values 
    ;******************************* 
    mi.l=0 
    mi = a 
    
    If b < mi 
        mi = b 
    EndIf 
    
    If c < mi 
        mi = c 
    EndIf 
    
    ProcedureReturn mi 
EndProcedure 

Procedure.l Levenshtein(pStr1.l, StrLen1.l, pStr2.l, StrLen2.l) 
    ;******************************** 
    ;*** Compute Levenshtein Distance 
    ;******************************** 
 
    S1_i.s = "" ; ith character of s 
    S2_j.s = "" ; jth character of t 
    cost.l=0 ; cost 
    
    ; Step 1 
    Dim d.l(StrLen1,StrLen2); matrix 
    
    If StrLen1 = 0 
        ProcedureReturn StrLen2
    EndIf 
    
    If StrLen2 = 0 
        ProcedureReturn StrLen1 
    EndIf 
    
    ; Step 2 
    For i = 0 To StrLen1 
        d(i, 0) = i 
    Next 
    
    For j = 0 To StrLen2 
        d(0, j) = j 
    Next 
  
  ;Step 3 
    For i = 1 To StrLen1 
        S1_i = MidMem(pStr1, i, 1) 
        
        ; Step 4 
        For j = 1 To StrLen2 
            S2_j = MidMem(pStr2, j, 1) 
             
            ; Step 5 
            If S1_i = S2_j 
                cost = 0 
            Else 
                cost = 1 
            EndIf 
            
            ; Step 6 
            d(i, j) = Minimum(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + cost) 
        Next 
    
    Next 
  
    ; Step 7  
    ProcedureReturn d(StrLen1, StrLen2)

EndProcedure 

DisableDebugger


Str1.s = "Heello"
Str2.s = "hellddo"

Ans1 =  Levenshtein(@Str1,Len(Str1),@Str2,Len(Str2))

File1.s = OpenFileRequester("File 1","","",0)
File2.s = OpenFileRequester("File 2","","",0)

If Len(File1) > 1 And Len(file2) > 1 

    OpenFile(1,File1)
        FileLen1 = Lof(1)
        pFile1Data = AllocateMemory(FileLen1)
        ReadData(1, pFile1Data, FileLen1)
    CloseFile(1)
    
    OpenFile(2,File2)
        FileLen2 = Lof(2)
        pFile2Data = AllocateMemory(FileLen2)
        ReadData(2, pFile1Data, FileLen2)
    CloseFile(2)

    Ans2 =  Levenshtein(pFile1Data,FileLen1 ,pFile2Data,FileLen2)

EndIf
 
EnableDebugger

Debug ans1
Debug ans2
 


Posted: Tue Feb 19, 2008 1:13 pm
by pdwyer
Doubledutch, you are missing

Dim DistanceMatrix(m,n)
and the #MaxWordLength

I added them and they worked, did you use a global array? Any reason?

I think these are the same, I like yours without the function call, I think that's unnecessary overheard considering the number of times it can get called (millions for a file)

Posted: Tue Feb 19, 2008 1:31 pm
by DoubleDutch
Doubledutch, you are missing

Dim DistanceMatrix(m,n)
and the #MaxWordLength
Please see...
doubledutch wrote:Also note: DistanceMatrix[MaxWordLength,MaxWordLength] should be set up as a global array where MaxWordLength is the max word length!
:)

Edit: Your right, both routines are near identical. I just wanted to give an example of where the routine would have some use (like in a spell checker) and to show how you didn't need the expensive procdure call from the original algorithm.

Posted: Tue Feb 19, 2008 2:10 pm
by pdwyer
:lol:

Oops, did it again.

(kind of off topic)
DistanceMatrix has no need of being global, why would you make it global? Or is it just because you were doing a test? Is there some perf benefit in PB I'm not aware of?

I put it in as:

Code: Select all

#MaxWordLength = 100
 
Procedure Distance(s$,t$) 
   m=Len(s$) 
   n=Len(t$) 
   
   Dim DistanceMatrix(m,n)
   
   If m And n 
      If m>#MaxWordLength 
         m=#maxWordLength 
      EndIf 
      If n>#MaxWordLength 
         n=#maxWordLength 
      EndIf 
      For i=0 To m 
         DistanceMatrix(i,0)=i 
      Next 
      For j=0 To n 
         DistanceMatrix(0,j)=j 
      Next 
      For i=1 To m 
         For j=1 To n 
            If Mid(s$,i,1)=Mid(t$,j,1) 
               cost=#False 
            Else 
               cost=#True 
            EndIf 
            a=DistanceMatrix(i-1,j)+1   ; deletion 
            b=DistanceMatrix(i,j-1)+1   ; insertion 
            c=DistanceMatrix(i-1,j-1)+cost ; substitution 
            If b<a 
               a=b 
            EndIf 
            If c<a 
               a=c 
            EndIf 
            DistanceMatrix(i,j)=a 
         Next 
      Next 
      result=DistanceMatrix(m,n) 
   Else 
      result=9999 
   EndIf 
   ProcedureReturn result 
EndProcedure

Debug distance("Paul"," Paul") 
I think I might make a pointer version of yours to handle files as nulls, I thing the extra function call is unnecessary, but get rid of the "MaxWordLength" limitation

Posted: Tue Feb 19, 2008 4:29 pm
by DoubleDutch
It's because the memory allocation is small and it only needs to be allocated once. I don't know internally how PureBasic does it, but I presume that it uses a Windows routine - so possibly (if other programs are also allocating memory) if I did loads of small memory allocations then the allocations *may* be interleaved (with those of other programs) and this *may* result in memory fragmentation and a general slow down of the system.

This is just a guess though, so I may be wrong?

Removed

Posted: Thu Feb 21, 2008 4:02 am
by Dr_Wildrick
Removed

Posted: Thu Feb 21, 2008 8:20 am
by DoubleDutch
Maybe you should post it as a new tip and a new feature suggestion?