Text-matching function

boddhi · Post by **boddhi** » Sat Jun 08, 2024 12:06 am

Hello,

A long time ago, I have created a tool that allows me to store some informations (artists, album names, track titles and positions, durations, covers, etc.) about my CD collection in a database fed over time.
This represents now not far from 50,000 recordings.

Among the titles, I inevitably have duplicates due to remakes, titles present on several albums or in compilations, and so on.

But this has sometimes led to the database containing duplicates of titles with slightly different names.
Example:

  Sittin' on the dock of the bay
  Sittin' on the Dock of the bay
  Sitting on the dock of the bay
  (Sittin' on) The dock of the bay
  The dock of the bay
  ...

Well, you get the idea.

So, I'd like to write a routine that gives me a match probability rate that two or more titles could potentially be the same. A text-matching rate.

I've started to read docs on the subject but have come across some rather complex algorithmic functions. Too complex for me!

To tell you the truth, I don't really know how to go about doing something that's relatively simple but fast enough given the amount of data. RegEx? Lists? ???

What would be your approach?
Thanks.

BarryG · Post by **BarryG** » Sat Jun 08, 2024 1:07 am

https://www.purebasic.fr/english/viewto ... 37#p494937

AZJIO · Post by **AZJIO** » Sat Jun 08, 2024 4:13 am

https://github.com/acmeism/RosettaCodeD ... .purebasic
https://www.purebasic.fr/english/search ... zzy+search

SPH · Post by **SPH** » Sat Jun 08, 2024 11:28 am

Maybe look at the first 100 bytes of each MP3 (if you supervise mp3 files) and compare them with each other...

spikey · Post by **spikey** » Sat Jun 08, 2024 11:56 am

I'd be looking at 'Soundex' type phonetic standardisation. Jassing recently posted one version at https://www.purebasic.fr/english/viewtopic.php?p=621344. Expanding his example to also remove parenthesis and apostrophes…

Code: Select all

; NYSIIS.PBI
; in the 1990's there was a huge study performed by New York state
;   to come up with a better name matching technique.  
; the result of that was a document "Name Search Technique" by Robert L Taft.
; While I was employed at Social Services in California, I used this algorithm
;   for matching names in databases (ie: Someone says "Smith" but was it "Smythe"?
; We also temporarily normalised names "Chuck" to "Charles" despite the fact someone
;   may actually be named "Chuck" someone may think it's short for "Charles" - this 
;   algorithm does not do that, but it is something to consider when trying to match names.
;   but it should match misspellings such as "tomas" "thomas"
; The alogithm is "New York State Identification & Intellegence System", or NYSIIS for short.
;   And is covered in pages #88-90
; we had a lot of Portugese clients so we had some additoinal tweaks, but tweaks are beyond the original scope.
;   I have included only 2 tweaks to demonstrate the ease of changing it, experiment ....
;
; The document is 126 pages long, and includes indepth informatin, as well as detailed statistical analysis.
;     if you want a copy (of my copy), get in touch with me on gmail (josh.assing) and I'll make copies for costs+postage.
;     OR better (Thanks PBJim) https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System
;
; My original C code was ascii only,not sure how relevant that is. (NYSIIS didn't take into account non ascii
;    this doesn't either ...)
EnableExplicit

; not specified... this was'the old days'; "ú" was entered as "u"; for NYSIIS() we need to convert these.
; YOU SHOULD convert accented to non (ie:"ú" TO "u")

; XIncludeFile "changeAccentedMOD.pbi"

Macro replaceStr(str,find,new,position):ReplaceString(str,find,new,#PB_String_CaseSensitive|#PB_String_InPlace,position,1):EndMacro
Macro removeChar(str,find) : ReplaceString(str,find,"",#PB_String_NoCase) : EndMacro
#includeJoshExtension=#False

Procedure.s NYSIIS( original.s, bForIndexing=#False ) 
  Protected nysiis.s, p,l, nIndexLength
  
  nIndexLength = Len(original)
  
  ; this is not specifically specified, but implied.
  original = UCase(Trim(original)) ; only deal with upper case.
  
  CompilerIf #includeJoshExtension
    ; not specified, optional...
    CompilerIf Defined(changeAccented,#PB_Module)
      changeAccented::changeAll(@original)
    CompilerEndIf
    
    ; this is mine
    ;  ReplaceString(original,"Y","I",#PB_String_NoCase|#PB_String_InPlace,2)
  CompilerEndIf 
  
  ; this is not specificaly specified, but it makes sense
  original = removeChar(original," ")
  original = removeChar(original,".")
  original = removeChar(original,",")
  original = removeChar(original,"'")
  original = removeChar(original,"(")
  original = removeChar(original,")")
  
  ; Step 1.
  If     Left(original,3)="MAC" : replaceStr(original,"MAC","MCC",1)
  ElseIf Left(original,3)="SCH" : replaceStr(original,"SCH","SSS",1)
  ElseIf Left(original,2)="KN"  : replaceStr(original,"KN", "NN", 1)
  ElseIf Left(original,2)="PH"  : replaceStr(original,"PH", "FF", 1)
  ElseIf Left(original,2)="PF"  : replaceStr(original,"PF", "FF", 1)
  ElseIf Left(original,1)="K"   : replaceStr(original,"K"  ,"C",  1)
  EndIf
  
  ; Step 2.
  original=ReverseString(original) ; not technically part of it; but makes the next bit easier.
  Select Left(original,2)
    Case "EE","EI"
      original = " Y"+Mid(original,3)
    Case "TD","TR","DR","TN","DN"
      original = " D"+Mid(original,3)
  EndSelect
  CompilerIf #includeJoshExtension
    ; mine,again
    If Left(original,1)="E" 
      ;   original=Mid(original,2)
    EndIf
  CompilerEndIf 
  original=ReverseString(original)
  
  ; step 3
  nysiis = Left(original,1)
  
  ; step 4
  p=2 : l = Len(original)
  
  ; step 5 (parts are not officially labeled as such, but are broken down to paragraphs)
  ;        (this loop can (and should) be optimised, I just wanted it to closely match the description,
  ;         but this should help you undestand what's going on & follow the document)
  While p <= l
    Select Mid(original,p,1)
      Case "A","E","I","O","U" ; part A 
        If Mid(original,p,2)="EV"
          replaceStr(original,"EV","AF",p)
        Else
          replaceStr(original,Mid(original,p,1),"A",p)
        EndIf
      Case "Q","Z","M" ; part B
        replaceStr(original,"Q","G",p)
        replaceStr(original,"Z","S",p)
        replaceStr(original,"Z","N",p)
      Case "K" ; part C
        If Mid(original,p+1,1)="N"
          replaceStr(original,"K","N",p)
        Else
          replaceStr(original,"K","C",p)
        EndIf
      Case "H" ; part E (we'll get to part D)
        If Not FindString("AEIOU",Mid(original,p-1),1,#PB_String_CaseSensitive) Or
           Not FindString("AEIOU",Mid(original,p+1),1,#PB_String_CaseSensitive) 
          replaceStr(original,"H",Mid(original,p-1,1),p)
        EndIf
      Case "W" ; part F
        If FindString("AEIOU",Mid(original,p-1),1,#PB_String_CaseSensitive) 
          replaceStr(original,"H",Mid(original,p-1,1),p)
        EndIf
        CompilerIf #includeJoshExtension
        Case "Y" ; mine
          replaceStr(original,"Y","I",p)
        CompilerEndIf
      Default ; part D
        If Mid(original,p,3)="SCH" : replaceStr(original,"SCH","SSS",p)
        ElseIf Mid(original,p,2)="PH" : replaceStr(original,"PH","FF",p)
        EndIf
        ; part G
        ; no match, do nothing
    EndSelect
    
    ; step 6
    If Mid(original,p,1) <> Right(nysiis,1)
      nysiis+Mid(original,p,1)
    EndIf 
    p+1
  Wend
  
  ; step 7
  If Right(nysiis,1)="S" And ( #includeJoshExtension =#False Or Len(NYSIIS)>4) ; after the 'And' is mine.
    nysiis=Left(nysiis,Len(nysiis)-1)
  EndIf
  
  ; step 8
  If Right(nysiis,2)="AY"
    nysiis=Left(nysiis,Len(nysiis)-2)+"Y"
  ElseIf Right(nysiis,1)="A" ; step 9
    nysiis=Left(nysiis,Len(nysiis)-1)
  EndIf
  
  ; done!!
  
  If bForIndexing ; indexes should keep original field length.
    nysiis=LSet(nysiis,nIndexLength," ")
  EndIf 
  ProcedureReturn nysiis
EndProcedure

  Debug nysiis("Sittin' on the dock of the bay")
  Debug NYSIIS("Sittin' on the Dock of the bay")
  Debug nysiis("Sitting on the dock of the bay")
  Debug NYSIIS("(Sittin' on) The dock of the bay")
  Debug NYSIIS(" The dock of the bay")

… I get the following results:

Code: Select all

SATANANTADACAFTABY
SATANANTADACAFTABY
SATANGANTADACAFTABY
SATANANTADACAFTABY
TADACAFTABY

As Azjio suggests you can then use something like the Levenshtein distance to determine proximity to a specific target.

boddhi · Post by **boddhi** » Sun Jun 09, 2024 11:02 pm

Hi,

spikey wrote: I'd be looking at 'Soundex' type phonetic standardisation. [...]

I had already seen this code and the associated thread.
If the majority of my titles are in English, another large part are in French, and a minority in Spanish.
For this reason, I had put it aside because I had a lot of adaptations to add, but perhaps it's worth examining it more in detail...

SPH wrote: Sat Jun 08, 2024 11:28 am Maybe look at the first 100 bytes of each MP3 (if you supervise mp3 files) and compare them with each other...

There are several reasons why this is not the right way to go:
  • The main one is that, as codecs have evolved and physical media have grown in size,I've encoded over time in different formats (MP3, M4A then FLAC) and with different audio bitrates (re-encoding older files would be too tedious). So I can have identical track names in several formats.
  • Secondly, each file contains metadata, often present at the beginning of the file, so even for an identically encoded track, the metadata, if different, could skew the results with this technique.
  • The third is that a track may be performed by different artists (e.g. Satisfaction by The Stones or Otis Redding). As the interpretation is different, the encoded data will also be different, and yet it's the same title name.

So physical comparison is absolutely the wrong approach.

@BarryG & AZJIO,
Thanks, I'll look into your suggestions.

PureBasic Forums - English

Text-matching function

Text-matching function

Re: Text-matching function

Re: Text-matching function

Re: Text-matching function

Re: Text-matching function

Re: Text-matching function