LCS - Longest Common Substring of 2 (byte-)strings

Share your advanced PureBasic knowledge/code with the community.
Froggerprogger
Enthusiast
Enthusiast
Posts: 423
Joined: Fri Apr 25, 2003 5:22 pm
Contact:

LCS - Longest Common Substring of 2 (byte-)strings

Post by Froggerprogger »

This might be quite useful:

Code: Select all

; LCS_Get:
; Calculates the Longest Common Subsequence contained in *p_x and *p_y
; Works with any bytestring, not just for strings !
; So you might feed this function with e.g. binary filedata !
;
; Returns a pointer to resulting bytestring (+terminating 0, so IF you work with
; strings, you might just call PeekS(LCS_Get(...))
; The length is stored in the referenced *p_resLen, or set this parameter to 0,
; if you don't need to get the length.
;
; by Froggerprogger 04/08/12

ProcedureDLL.l LCS_Get(*p_x.l, p_xLen.l, *p_y.l, p_yLen.l, *p_resLen.l)
  ; LCS = Longest Common Subsequence
  ; based on pseudocode from Cormen's 'Introduction to Algorithms SE' p.353f
  ; (combination of LCS_LENGTH and (iterative) PRINT_LCS)
  ; by Froggerprogger 04/08/12
  
  Protected i.l, j.l
  Protected *mem.l
  Protected resLen.l
  Protected actPos.l
  
  ; define the arrays used for calculation (I wish we had local ones...)
  Dim LCS_c.l(p_xLen, p_yLen)
  Dim LCS_d.b(p_xLen, p_yLen) ; 1=upleft, 2=up, 3=left
  
  ; compute the main algorithm
  For i=1 To p_xLen
    For j = 1 To p_yLen
      If PeekB(*p_x + (i-1)) = PeekB(*p_y + (j-1))
        LCS_c(i, j) = LCS_c(i-1, j-1) + 1
        LCS_d(i, j) = 1 ; upleft
      Else 
        If LCS_c(i-1, j) >= LCS_c(i, j-1)
          LCS_c(i,j) = LCS_c(i-1, j)
          LCS_d(i,j) = 2 ; up
        Else
          LCS_c(i,j) = LCS_c(i, j-1)
          LCS_d(i,j) = 3 ; left
        EndIf
      EndIf
    Next j
  Next i
  
  ; the length is stored in bottom right field
  resLen = LCS_c(p_xLen, p_yLen)

  ; now we might free LCS_c
  Dim LCS_c.l(0,0)

  ; allocate memory for our result (plus a terminating zero for comfortable use with PeekS)
  *mem = AllocateMemory(resLen + 1)
  If *mem = 0
    ProcedureReturn -1
  EndIf
  
  i = p_xLen
  j = p_yLen
  actPos = resLen -1
  
  ; write out our lcs
  While i > 0 And j > 0
    If LCS_d(i,j) = 1
      PokeB(*mem + actPos, PeekB(*p_x + (i-1)))
      actPos - 1
      i-1
      j-1
    ElseIf LCS_d(i,j) = 2
      i-1
    Else ; LCS_d(i,j) = 3
      j-1
    EndIf
  Wend
  
  ; free LCS_d
  Dim LCS_d.b(0,0)
  
  ; write the length of lcs at position *p_resLen
  If *p_resLen
    PokeL(*p_resLen, resLen)
  EndIf

  ; return pointer to result
  ProcedureReturn *mem
EndProcedure

a.s = "This is a string that contains some stupid stuff just to type something...!"
b.s = "This another one, that really (!) has nothing to do with spaghetti !"
len.l

Debug ".............................."
Debug "Longest Common Substring of the following 2 bytestrings:"
Debug ".............................."
Debug a
Debug b
Debug ".............................."
Debug PeekS(LCS_Get(@a, Len(a), @b, Len(b), @len))
Debug ".............................."
Debug "length: "  + Str(len)
Last edited by Froggerprogger on Fri Aug 13, 2004 12:39 pm, edited 1 time in total.
%1>>1+1*1/1-1!1|1&1<<$1=1
User avatar
blueznl
PureBasic Expert
PureBasic Expert
Posts: 6166
Joined: Sat May 17, 2003 11:31 am
Contact:

Post by blueznl »

this is the first step to a smart patcher, you know :-)

1. take two files, old and new one
2. find largest identical part
3. store pointers / info
4. keep on doing that until the remaining stuff is too small to bother
5. add all data that is different
6. pack the result

and voila, an update tool :-)

(and who cares about the speed? it's preparing a patch that takes some time, not applying it :-))
( PB6.00 LTS Win11 x64 Asrock AB350 Pro4 Ryzen 5 3600 32GB GTX1060 6GB)
( The path to enlightenment and the PureBasic Survival Guide right here... )
Max.²
Enthusiast
Enthusiast
Posts: 175
Joined: Wed Jul 28, 2004 8:38 am

Post by Max.² »

blueznl wrote:this is the first step to a smart patcher, you know :-)

1. take two files, old and new one
2. find largest identical part
3. store pointers / info
4. keep on doing that until the remaining stuff is too small to bother
5. add all data that is different
6. pack the result

and voila, an update tool :-)

(and who cares about the speed? it's preparing a patch that takes some time, not applying it :-))
It is a very nice code for a lot of purposes!
Froggerprogger
Enthusiast
Enthusiast
Posts: 423
Joined: Fri Apr 25, 2003 5:22 pm
Contact:

Post by Froggerprogger »

...but for a super-smart patcher it should search for similar parts independant of their relative position, too, so that e.g. AAAACCCCBBBB and AAAABBBBCCCC results not just in LCS AAAACCCC with extra-data BBBB, but in same LCS with extra-data 'take the part BBBB from orignal and place it there'

I'm working on it ... :)
%1>>1+1*1/1-1!1|1&1<<$1=1
User avatar
blueznl
PureBasic Expert
PureBasic Expert
Posts: 6166
Joined: Sat May 17, 2003 11:31 am
Contact:

Post by blueznl »

i wonder if a packer would not cover that to a point, anyway, the approach might be something like:

1. compare new file with old file
2. find largest match
3. store info on largest match
4. take that largest match from the new file (ie. mark it such that it will no longer be tested for a match)
5. go back to 2 until there's not enough left to work on

i wonder if it should work of files, filelists, or directories... probably filelists...
( PB6.00 LTS Win11 x64 Asrock AB350 Pro4 Ryzen 5 3600 32GB GTX1060 6GB)
( The path to enlightenment and the PureBasic Survival Guide right here... )
Post Reply