similar string comparison
similar string comparison
dear all, i need help, i have an array of about 3000 phrases, i need to find phrases similarities and show in debug both phrases and % of similarities, i don't have any idea of i can code this, anyone can help me ?
example
this is a book
this is book
those two phrases i need to show with % of similarities
example
this is a book
this is book
those two phrases i need to show with % of similarities
-
- Always Here
- Posts: 6426
- Joined: Fri Oct 23, 2009 2:33 am
- Location: Wales, UK
- Contact:
Re: similar string comparison
using a checksum?
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
If it sounds simple, you have not grasped the complexity.
Re: similar string comparison
Perhaps the Levenshtein Distance might serve your needs. It doesn't express the similarities in *percentages, but rather in ascending values:ALAN-MHz wrote:...i have an array of about 3000 phrases, i need to find phrases similarities and show in debug both phrases and % of similarities
example
this is a book
this is book
Code: Select all
;============================================
;
; a simple Levenshtein string comparator
; the ReceiveHTTPFile() function to
; download a sample text file from DropBox
;
; requires PureBasic v5.40 minimum for
; ReceiveHTTPFile() to work with DropBox
;
;============================================
#maxStrLen = 30
Structure levSort
score.i
phrases.s
EndStructure
Procedure LevenshteinDistance(s1.s, s2.s)
Protected s1Len = Len(s1)
Protected s2Len = Len(s2)
Dim levArray(s1Len, s2Len)
If s1Len > s2Len
percentile = s1Len
Else
percentile = s2Len
EndIf
If s1Len And s2Len
If s1Len > #maxStrLen
s1Len = #maxStrLen
EndIf
If s2Len > #maxStrLen
s2Len = #maxStrLen
EndIf
For i = 0 To s1Len
levArray(i, 0) = i
Next
For j = 0 To s2Len
levArray(0, j) = j
Next
For i = 1 To s1Len
For j = 1 To s2Len
If Mid(s1, i, 1) = Mid(s2, j, 1)
cost = #False
Else
cost = #True
EndIf
a = levArray(i - 1, j) + 1
b = levArray(i, j - 1) + 1
c = levArray(i - 1, j - 1) + cost
If b < a
a = b
EndIf
If c < a
a = c
EndIf
levArray(i, j) = a
Next
Next
result = (percentile - levArray(s1Len, s2Len)) / (percentile * 0.01)
Else
result = 9999
EndIf
ProcedureReturn result
EndProcedure
InitNetwork()
fileName$ = GetTemporaryDirectory() + "similarStrings.txt"
dropBoxLink$ = "https://www.dropbox.com/s/hrdxobhye8eks8d/similarStrings.txt?dl=1"
If FileSize(fileName$) = -1
ReceiveHTTPFile(dropBoxLink$, fileName$)
EndIf
If ReadFile(0, fileName$)
Dim s$(90)
Dim c$(90)
Dim results.levSort(8099)
While Not Eof(0)
s$(i) = ReadString(0)
i + 1
Wend
CloseFile(0)
CopyArray(s$(), c$())
For i = 0 To 89
For ii = 0 To 89
result = LevenshteinDistance(s$(i), c$(ii))
results(sno)\score = result
results(sno)\phrases = s$(i) + " >> " + c$(ii) + " == " + Str(result) + "%"
sno + 1
Next ii
Next i
SortStructuredArray(results(), #PB_Sort_Descending, OffsetOf(levSort\score), #PB_Integer)
wFlags = #PB_Window_SystemMenu | #PB_Window_ScreenCentered
OpenWindow(#PB_Any, #PB_Ignore, #PB_Ignore, 600, 600, "Levenshtein Comparator", wFlags)
listDisplay = ListViewGadget(#PB_Any, 10, 10, 580, 580)
resultsFile.s = SaveFileRequester("Save results to:", "", "Text (*.txt)|*.txt", 0)
If resultsFile <> ""
If GetExtensionPart(resultsFile) = ""
resultsFile + ".txt"
EndIf
CreateFile(0, resultsFile)
EndIf
For i = 0 To ArraySize(results())
AddGadgetItem(listDisplay, -1, Str(i + 1) + ". " + results(i)\phrases)
If IsFile(0)
WriteStringN(0, Str(i + 1) + ". " + results(i)\phrases)
EndIf
Next i
If IsFile(0)
CloseFile(0)
EndIf
While WaitWindowEvent() ! #PB_Event_CloseWindow : Wend
EndIf
Hope it might help in some way.

EDITS wrote:4th September 2016: integrated Lunasole's solution to return percentage-based results
18th February 2019: updated download links
Last edited by TI-994A on Mon Feb 18, 2019 8:18 am, edited 3 times in total.
Texas Instruments TI-99/4A Home Computer: the first home computer with a 16bit processor, crammed into an 8bit architecture. Great hardware - Poor design - Wonderful BASIC engine. And it could talk too! Please visit my YouTube Channel 

Re: similar string comparison
@TI-994A, thank for interesting example.
Btw, it can be easily turned to return %
Btw, it can be easily turned to return %
Code: Select all
;============================================
;
; a simple Levenshtein string comparator
; algorithm code by DoubleDutch
;
;============================================
#maxStrLen = 30
Procedure.f LevenshteinDistance(s1.s, s2.s)
Protected s1Len = Len(s1)
Protected s2Len = Len(s2)
Protected.f result, MaxPc
If s1Len > s2Len ; or switch here ">" to "<" ^^
MaxPc = s1Len
Else
MaxPc = s2Len
EndIf
Dim levArray(s1Len, s2Len)
If s1Len And s2Len
If s1Len > #maxStrLen
s1Len = #maxStrLen
EndIf
If s2Len > #maxStrLen
s2Len = #maxStrLen
EndIf
For i = 0 To s1Len
levArray(i, 0) = i
Next
For j = 0 To s2Len
levArray(0, j) = j
Next
For i = 1 To s1Len
For j = 1 To s2Len
If Mid(s1, i, 1) = Mid(s2, j, 1)
cost = #False
Else
cost = #True
EndIf
a = levArray(i - 1, j) + 1
b = levArray(i, j - 1) + 1
c = levArray(i - 1, j - 1) + cost
If b < a
a = b
EndIf
If c < a
a = c
EndIf
levArray(i, j) = a
Next
Next
result = levArray(s1Len, s2Len)
result = (MaxPc - result) / (MaxPc * 0.01)
Else
result = 0
EndIf
ProcedureReturn result
EndProcedure
; result is in %
Debug LevenshteinDistance("this is a book", "this is book")
Debug LevenshteinDistance("this is phrase 1", "this is 2 phrase")
Debug LevenshteinDistance("this is phrase 1", "this is phrase 2")
Debug LevenshteinDistance("this is phrase 1", "this is phrase 1")
"W̷i̷s̷h̷i̷n̷g o̷n a s̷t̷a̷r"
Re: similar string comparison
Nice Lunasole! Seems to work as expected; didn't look at it that way.Lunasole wrote:Btw, it can be easily turned to return %...
I've integrated it into my example above.

Texas Instruments TI-99/4A Home Computer: the first home computer with a 16bit processor, crammed into an 8bit architecture. Great hardware - Poor design - Wonderful BASIC engine. And it could talk too! Please visit my YouTube Channel 

Re: similar string comparison
@TI-994A
@Lunasole
Very nice examples.
I was wondering how it might be done. Thanks
@Lunasole
Very nice examples.
I was wondering how it might be done. Thanks

DE AA EB
Re: similar string comparison
thanks to all for effort!
- Michael Vogel
- Addict
- Posts: 2807
- Joined: Thu Feb 09, 2006 11:27 pm
- Contact:
Re: similar string comparison
I'm using the (slightly) tuned code below because I want to check for duplicate files on my harddisk - but the routine is quite slow. Defining the matrix as a global array just saves microseconds, but I would need a real speed boost? W...?
Code: Select all
Procedure Distance1(s1.s,s2.s)
Protected a,b,c
Protected i,j,m,n
Protected m1,m2
Protected z
m=Len(s1)
n=Len(s2)
#MaxWordLength = 100
If m And n
If m>#MaxWordLength
m=#maxWordLength
EndIf
If n>#MaxWordLength
n=#maxWordLength
EndIf
Dim DistanceMatrix(m,n)
For i=0 To m
DistanceMatrix(i,0)=i
Next
For j=0 To n
DistanceMatrix(0,j)=j
Next
m1=@s1-1
m2=@s2-1
For i=1 To m
For j=1 To n
If PeekA(m1+i)=PeekA(m2+j)
;If Mid(s1,i,1)=Mid(s2,j,1)
z=#False
Else
z=#True
EndIf
a=DistanceMatrix(i-1,j)+1; deletion
b=DistanceMatrix(i,j-1)+1; insertion
c=DistanceMatrix(i-1,j-1)+z; substitution
If b<a
a=b
EndIf
If c<a
a=c
EndIf
DistanceMatrix(i,j)=a
Next
Next
z=DistanceMatrix(m,n); absolute
If m>n
n=m
EndIf
z=(n-z)*100/n; relative
ElseIf m=n
z=#Null
Else
z=100; or #MaxWordLength for absolute
EndIf
ProcedureReturn z
EndProcedure
Re: similar string comparison
@Michael Vogel, the algorithm is slow enough itself. Also as for me this one badly goes with searching duplicates., but probably you can make it noticeably faster if do it multi-threaded (also rewriting in ASM surely will bring some speed).
"W̷i̷s̷h̷i̷n̷g o̷n a s̷t̷a̷r"
Re: similar string comparison
It helps a bit to use pointers instead of peek.
Code: Select all
Procedure Distance1a(s1.s,s2.s)
Protected a,b,c
Protected i,j,m,n
Protected *m1.Ascii,*m2.Ascii,m1a.a
Protected z
m=Len(s1)
n=Len(s2)
#MaxWordLength = 100
If m And n
If m>#MaxWordLength
m=#maxWordLength
EndIf
If n>#MaxWordLength
n=#maxWordLength
EndIf
Protected Dim DistanceMatrix(m,n)
For i=0 To m
DistanceMatrix(i,0)=i
Next
For j=0 To n
DistanceMatrix(0,j)=j
Next
*m1=@s1
For i=1 To m
m1a = *m1\a
*m2=@s2
For j=1 To n
If m1a=*m2\a
;If Mid(s1,i,1)=Mid(s2,j,1)
z=#False
Else
z=#True
EndIf
a=DistanceMatrix(i-1,j)+1; deletion
b=DistanceMatrix(i,j-1)+1; insertion
c=DistanceMatrix(i-1,j-1)+z; substitution
If b<a
a=b
EndIf
If c<a
a=c
EndIf
DistanceMatrix(i,j)=a
*m2+SizeOf(Character)
Next
*m1+SizeOf(Character)
Next
z=DistanceMatrix(m,n); absolute
If m>n
n=m
EndIf
z=(n-z)*100/n; relative
ElseIf m=n
z=#Null
Else
z=100; or #MaxWordLength for absolute
EndIf
ProcedureReturn z
EndProcedure
Windows (x64)
Raspberry Pi OS (Arm64)
Raspberry Pi OS (Arm64)
- Andre
- PureBasic Team
- Posts: 2139
- Joined: Fri Apr 25, 2003 6:14 pm
- Location: Germany (Saxony, Deutscheinsiedel)
- Contact:
Re: similar string comparison
Looks interesting.
Seems to have the potential for using it in a searchfunction too, when similar searchresults should be noticed.
Or do you know other algorithms, which are used for such cases?

Seems to have the potential for using it in a searchfunction too, when similar searchresults should be noticed.
Or do you know other algorithms, which are used for such cases?
Re: similar string comparison
A little compare procedure. Not sure of the speed as compared to wilbert's or any of the others above. 
EDIT: Updated code.

EDIT: Updated code.
Code: Select all
;PureBasic 5.43
Procedure.s PhraseMatch(Phrase.s, ComparePhrase.s)
If Phrase <> ComparePhrase
CleanPhrase.s = RemoveString(Phrase, ",")
CleanPhrase.s = RemoveString(CleanPhrase, ".")
CleanPhrase.s = RemoveString(CleanPhrase, "?")
CleanPhrase.s = RemoveString(CleanPhrase, "!")
Word.s = ""
WordMatch.i = 0
PhraseLength.i = CountString(CleanPhrase, " ") + 1
StringCheck.i = 0
MatchingWords.s = ""
For SplitPhrase = 1 To PhraseLength
Word = StringField(CleanPhrase, SplitPhrase, " ")
If SplitPhrase = PhraseLength
StringCheck = FindString(ComparePhrase, Word)
Else
StringCheck = FindString(ComparePhrase, Word + " ")
EndIf
If StringCheck
MatchingWords = MatchingWords + Word + Chr(44) + Chr(32)
ComparePhrase = RemoveString(ComparePhrase, Word)
WordMatch + 1
EndIf
Next
Percentage.f = 100 / PhraseLength
ProcedureReturn Str(Percentage * WordMatch) + Chr(37) + " - " + RTrim(RTrim(MatchingWords, " "), ",")
Else
ProcedureReturn "100" + Chr(37) + " - " + Phrase
EndIf
EndProcedure
PhraseA.s = "Sally sells sea shells at the sea shore."
PhraseB.s = "Sally paints sea shells and sells them."
Debug PhraseMatch(PhraseA, PhraseB)
www.posemotion.com
PureBasic Tools for OS X: PureMonitor, plist Tool, Data Maker & App Chef
Even the vine knows it surroundings but the man with eyes does not.
PureBasic Tools for OS X: PureMonitor, plist Tool, Data Maker & App Chef
Even the vine knows it surroundings but the man with eyes does not.
Re: similar string comparison
Hi Wilbert!

Sorry, but your code yields 100% in this example:wilbert wrote:It helps a bit to use pointers instead of peek.Code: Select all
Procedure Distance1a(s1.s,s2.s) Protected a,b,c Protected i,j,m,n Protected *m1.Ascii,*m2.Ascii,m1a.a Protected z m=Len(s1) n=Len(s2) #MaxWordLength = 100 If m And n If m>#MaxWordLength m=#maxWordLength EndIf If n>#MaxWordLength n=#maxWordLength EndIf Protected Dim DistanceMatrix(m,n) For i=0 To m DistanceMatrix(i,0)=i Next For j=0 To n DistanceMatrix(0,j)=j Next *m1=@s1 For i=1 To m m1a = *m1\a *m2=@s2 For j=1 To n If m1a=*m2\a ;If Mid(s1,i,1)=Mid(s2,j,1) z=#False Else z=#True EndIf a=DistanceMatrix(i-1,j)+1; deletion b=DistanceMatrix(i,j-1)+1; insertion c=DistanceMatrix(i-1,j-1)+z; substitution If b<a a=b EndIf If c<a a=c EndIf DistanceMatrix(i,j)=a *m2+SizeOf(Character) Next *m1+SizeOf(Character) Next z=DistanceMatrix(m,n); absolute If m>n n=m EndIf z=(n-z)*100/n; relative ElseIf m=n z=#Null Else z=100; or #MaxWordLength for absolute EndIf ProcedureReturn z EndProcedure
So each empty string gives 100%.Debug Distance1a("Test", "")


- Michael Vogel
- Addict
- Posts: 2807
- Joined: Thu Feb 09, 2006 11:27 pm
- Contact:
Re: similar string comparison
My fault,Lord wrote:Sorry, but your code yields 100% in this example:So each empty string gives 100%.Debug Distance1a("Test", "")
you must change m=n to m-n or swap the z=0 with the z=100 line at the end of the code.
Re: similar string comparison
Andre, may be for a searchfunction it's interesting to have code with small footprint and fast execution for big data apps:
Here our performance study for Damerau-Levenshtein in ASM:
Here our performance study for Damerau-Levenshtein in ASM:
Code: Select all
CompilerIf #PB_Compiler_Processor <> #PB_Processor_x86 And #PB_Compiler_Processor <> #PB_Processor_x64 : CompilerError "This works on x86 or x64 processors" : CompilerEndIf
CompilerSelect #PB_Compiler_Processor : CompilerCase #PB_Processor_x64 : CompilerEndSelect : OpenConsole() : EnableExplicit
Define s1.s, s2.s, s1l.l, s2l.l, t, tableL.l, time1.l, time2.l, *s1p, *s2p, *s1l, *s2l, levi.i
s1 = "meilenstein23456" ; max length = 255
s2 = "levenshtein23456levenshtein23456levenshtein23456levenshtein23456levenshtein23456levenshtein23456levenshtein23456levenshtein23456levenshtein23456levenshtein23456levenshtein23456levenshtein23456levenshtein23456levenshtein23456levenshtein23456levenshtein2345" ; s1 ist r9 und muss als ZeilenIncrement der kürzere String sein. geht aber auch bis Länge 255
;s2 = "levenshtien23456levenshtien23456levenshtien23456levenshtien23456levenshtien23456levenshtien23456levenshtien23456levenshtien23456levenshtien23456levenshtien23456levenshtien23456levenshtien23456levenshtien23456levenshtien23456levenshtien23456levenshtien2345"; s2 könnte zwar "endlos" werden (ecx), aber stept über ecx auch über cl hinweg, also max len = 255
s1l = Len(s1) : s2l = Len(s2) : tableL = s1l * s2l : *s1p = @s1 : *s2p = @s2 : *s1l = @s1l : *s2l = @s2l
t = AllocateMemory(tableL) : levi = AllocateMemory(8) ;: Delay(15000)
; The cost of Damerau-Levenshtein distance is O(mn). Cost is not reducable using SSE instructions but through utilization of x86es 3-4 ALUs per core, instruction fusion,
; cache awareness and other CPU-specifics in superscalar processors. Hyperthreading is of no advantage since there are no more ALUs and should be switched off.)
PrintN("lenght string s1: " + Str(s1l) + #CRLF$ + "lenght string s2: " + Str(s2l)) : PrintN("DamLev table size: " + Str(tableL)) : Time1 = ElapsedMilliseconds()
! align 8
! db 0fh, 1fh, 00h ;, 0fh, 1fh, 00h, 0fh, 1fh, 00h, 90h, 90h ; align der Kernroutine bringt seeehhhr viel! mit Damerau jo 359 ms
! push rbx rbp rsi rdi
! mov rbp, [v_t]
! xor rcx, rcx ; das ist der Hammer: danach erinnert sich die ALU, dass dar High-Part leer ist!!!
! xor rax, rax
! xor r14, r14 ; reset for unten iteration counter
! mov ecx, [v_s1l]
!MOV r11d, [v_s2l]
!MOV r12d, [v_s1l]
! @@: ; Spaltennummern (horizontal) vorbesetzen
! mov [rbp + rcx], cl ; das ist richtig, da die Zeile mit 1-byte Zahlen vorbesetzt wird, geht geht dann jedoch nur bis zu einer s1-Größe von 256
! sub cl, 1
! jns @b
! mov eax, [v_s1l]
! add eax, 1 ; add 1 for mulipl.
! mov r9d, eax ; [v_s1l]
! mov ecx, [v_s2l] ; s2l into ecx next line: ! MOVsx r9, word[v_s2l]
! imul eax, ecx ; al wird größer als 255, also muss ein größeres Register her = eax
! ne:
! mov [rbp + rax], cl ; Zeilennummern (senkrecht) vorbesetzten
! sub eax, [v_s1l]
! sub eax, 1
! sub ecx, 1 ; hier sub genommen anstatt dec wegen 2 zu 3 byte instruction und flag stabilty
! jnz ne ; lea ist noch so eine align instruction und 3 byte nop
! start:
! MOV rsi, [p_s1p] ; adresse von String1
! MOV rdi, [p_s2p] ; adresse von String2
! xor rcx, rcx ; ist wichtig
! xor rax, rax
; --------------------------------------------------------------- outer loop für Levi und Damerau -------------------------------------------
;! db 0fh, 1fh, 00h ; 90h, 90h ; rp1 besser aligned (z.B 8) als rp2 ist schneller
! rp1: ; Macro fusion wird in 64 bit mode supportet Nehalem hat 3 simple + 1 complex instr. decoder
! xor r10, r10 ; r10 macht rdx frei. ---> rdx, r11, r12, r13, (r14) und r15 sind noch frei !
! lea r8, [rbp + rcx] ; r8 ist tableoffset. rcx ist position step der linken vertikalen Zahlenreihe
; --------------------------------------------------------------- "leviCost aus den 2 aktuellen Charposis in den Strings ermitteln" -----------------------------------------
; db 0fh, 1fh, 00h, 0fh, 1fh, 00h, 90h,90h ; rp1 ist 8 und rp2 16-aligned
! rp2: ; rp2 ist eine lange und starke dependency chain, 90 Bytes lang, nicht mehr zu optimieren
! mov al, [rsi + rcx] ; Char aus s1
! cmp al, [rdi + r10] ; Check s1 gegen Char aus s2 ; cmp does not break dependency on Intel core!
; ! setnz al kann bei Damerau hier weg, macht dependency
; al <- cost aus dem Vergleich der beiden Character aus s1 und s2 mit -i und -j
! jmp DamLevCost ; jmp ist viel schneller ! ; jmp and jz are same instruction lenght ; jmp = no Damerau Function ; jz = if LeviCost = Null, jump over Damerau
! MOV bl, [rsi + rcx] ; Char aus s1
! cmp bl, [rdi + r10 - 1] ; Check s1 gegen Char aus s2 - 1
! jnz DamLevCost ; CMP is able to macrofuse mit jcc
! mov bl, [rsi + rcx - 1] ; Char aus s2 - 1
! cmp bl, [rdi + r10] ; Check s2 gegen Char aus s1 - 1
! db 0fh, 1fh, 00h, 90h, 90h ; 16 aligned, 64 align bringt nichts
! DamLevCost:
! setnz al ; zero flag aus S1 und s2 oder s2 - 1 und s1 - 1
; --------------------------------------------------------------- 1. cmp in Table ------------------------
! add al, [r8] ; oben links. add rax ist extrem langsam
! CMP AL, [r8 + 1] ; oben
! JBE no ; cmp & jbe are macro fused rule 19!
! mov al, [r8 + 1] ; mov byte r10b, [r9 + 1 * rbx] ; geht auch!
! ADD AL, 1
! no:
; --------------------------------------------------------------- 2. cmp in Table ------------------------
! CMP AL, [r8 + r9] ; links r9 ist [v_s1l] incremented by 1
! JBE @f
! MOV AL, [r8 + r9]
! add AL, 1
! @@:
! mov byte[r8 + r9 + 1], al ; rechts unten write deswegen brauchen wir r8
; --------------------------------------------------------------- Ende DamLev per Cell, jump to next Cell -----------------------------------------------------------------------
! add r10d, 1 ; r10 ist nächste Stelle innerhalb der Zeile nur zum cost match j++
! add r8d, r9d ; step zur nächsten Zeile, the offset of the table
! cmp r10d, r11d ;[v_s2l] ; j < n ; gegenüber [v_s2l] bringt das 10 Sek. von 11 Sek.
! jnz rp2 ; branch to inner loop
! add ecx, 1 ; nächste Spalte i++
! cmp ecx, r12d ;[v_s1l] ; value of v_s1l ;[v_s1l] bringt noch mal 0,3 Sek von 10 Sek.
! jnz rp1 ; diese loops zu unrollen bringt nichts
; =============================================================== Ende Outer Loop =============================================================================
! add r14, 1 ; add, cmp und jnz mit r14 werden im echten code nicht gebraucht
! cmp r14, 1000000 ; just iteration counter for throughput measurement. r15 ging hier nicht !!!!!!
! jnz start
! mov r14, [v_levi] ; adresse von levi --> r15 ; *levi oder p_levi
! mov [r14], rax ; schreibe rax an adresse von v_levi, weil ich extra keine variable definiert habe, sondern einen MemoryBereich
! pop rdi rsi rbp rbx
Time2 = ElapsedMilliseconds() - Time1
PrintN("DamLev-distance: " + Str(PeekL(levi)))
PrintN("Duration in ms at 1 Mio. iterations in r14: " + Str(Time2))
; Means approx. 135.000 Levs per second with distance 240 on a 3,5 Ghz i5-CPU. If distance is realisticly lower (f.e. 4) 1.250.000 Levs / s per core.)
; Means approx. 135.000 Levs per second with distance 240 on a 4,0 Ghz Sandy-CPU. If distance is realisticly lower (f.e. 4) 2.272.000 Levs / s per core.)
Input()
End