Text-matching function

Everything else that doesn't fall into one of the other PB categories.
boddhi
Enthusiast
Enthusiast
Posts: 524
Joined: Mon Nov 15, 2010 9:53 pm

Text-matching function

Post by boddhi »

Hello,

A long time ago, I have created a tool that allows me to store some informations (artists, album names, track titles and positions, durations, covers, etc.) about my CD collection in a database fed over time.
This represents now not far from 50,000 recordings.

Among the titles, I inevitably have duplicates due to remakes, titles present on several albums or in compilations, and so on.

But this has sometimes led to the database containing duplicates of titles with slightly different names.
Example:
  Sittin' on the dock of the bay
  Sittin' on the Dock of the bay
  Sitting on the dock of the bay
  (Sittin' on) The dock of the bay
  The dock of the bay
  ...
Well, you get the idea. :wink:

So, I'd like to write a routine that gives me a match probability rate that two or more titles could potentially be the same. A text-matching rate.

I've started to read docs on the subject but have come across some rather complex algorithmic functions. Too complex for me! :mrgreen:

To tell you the truth, I don't really know how to go about doing something that's relatively simple but fast enough given the amount of data. RegEx? Lists? ???

What would be your approach?
Thanks.
If my English syntax and lexicon are incorrect, please bear with Google translate and DeepL. They rarely agree with each other!
Except on this sentence...
BarryG
Addict
Addict
Posts: 4173
Joined: Thu Apr 18, 2019 8:17 am

Re: Text-matching function

Post by BarryG »

User avatar
SPH
Enthusiast
Enthusiast
Posts: 571
Joined: Tue Jan 04, 2011 6:21 pm

Re: Text-matching function

Post by SPH »

Maybe look at the first 100 bytes of each MP3 (if you supervise mp3 files) and compare them with each other...

!i!i!i!i!i!i!i!i!i!
!i!i!i!i!i!i!
!i!i!i!
//// Informations ////
Portable LENOVO ideapad 110-17ACL 64 bits
Version de PB : 6.12LTS - 64 bits
User avatar
spikey
Enthusiast
Enthusiast
Posts: 769
Joined: Wed Sep 22, 2010 1:17 pm
Location: United Kingdom

Re: Text-matching function

Post by spikey »

I'd be looking at 'Soundex' type phonetic standardisation. Jassing recently posted one version at https://www.purebasic.fr/english/viewtopic.php?p=621344. Expanding his example to also remove parenthesis and apostrophes…

Code: Select all

; NYSIIS.PBI
; in the 1990's there was a huge study performed by New York state
;   to come up with a better name matching technique.  
; the result of that was a document "Name Search Technique" by Robert L Taft.
; While I was employed at Social Services in California, I used this algorithm
;   for matching names in databases (ie: Someone says "Smith" but was it "Smythe"?
; We also temporarily normalised names "Chuck" to "Charles" despite the fact someone
;   may actually be named "Chuck" someone may think it's short for "Charles" - this 
;   algorithm does not do that, but it is something to consider when trying to match names.
;   but it should match misspellings such as "tomas" "thomas"
; The alogithm is "New York State Identification & Intellegence System", or NYSIIS for short.
;   And is covered in pages #88-90
; we had a lot of Portugese clients so we had some additoinal tweaks, but tweaks are beyond the original scope.
;   I have included only 2 tweaks to demonstrate the ease of changing it, experiment ....
;
; The document is 126 pages long, and includes indepth informatin, as well as detailed statistical analysis.
;     if you want a copy (of my copy), get in touch with me on gmail (josh.assing) and I'll make copies for costs+postage.
;     OR better (Thanks PBJim) https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System
;
; My original C code was ascii only,not sure how relevant that is. (NYSIIS didn't take into account non ascii
;    this doesn't either ...)
EnableExplicit

; not specified... this was'the old days'; "ú" was entered as "u"; for NYSIIS() we need to convert these.
; YOU SHOULD convert accented to non (ie:"ú" TO "u")

; XIncludeFile "changeAccentedMOD.pbi"

Macro replaceStr(str,find,new,position):ReplaceString(str,find,new,#PB_String_CaseSensitive|#PB_String_InPlace,position,1):EndMacro
Macro removeChar(str,find) : ReplaceString(str,find,"",#PB_String_NoCase) : EndMacro
#includeJoshExtension=#False

Procedure.s NYSIIS( original.s, bForIndexing=#False ) 
  Protected nysiis.s, p,l, nIndexLength
  
  nIndexLength = Len(original)
  
  ; this is not specifically specified, but implied.
  original = UCase(Trim(original)) ; only deal with upper case.
  
  CompilerIf #includeJoshExtension
    ; not specified, optional...
    CompilerIf Defined(changeAccented,#PB_Module)
      changeAccented::changeAll(@original)
    CompilerEndIf
    
    ; this is mine
    ;  ReplaceString(original,"Y","I",#PB_String_NoCase|#PB_String_InPlace,2)
  CompilerEndIf 
  
  ; this is not specificaly specified, but it makes sense
  original = removeChar(original," ")
  original = removeChar(original,".")
  original = removeChar(original,",")
  original = removeChar(original,"'")
  original = removeChar(original,"(")
  original = removeChar(original,")")
  
  ; Step 1.
  If     Left(original,3)="MAC" : replaceStr(original,"MAC","MCC",1)
  ElseIf Left(original,3)="SCH" : replaceStr(original,"SCH","SSS",1)
  ElseIf Left(original,2)="KN"  : replaceStr(original,"KN", "NN", 1)
  ElseIf Left(original,2)="PH"  : replaceStr(original,"PH", "FF", 1)
  ElseIf Left(original,2)="PF"  : replaceStr(original,"PF", "FF", 1)
  ElseIf Left(original,1)="K"   : replaceStr(original,"K"  ,"C",  1)
  EndIf
  
  ; Step 2.
  original=ReverseString(original) ; not technically part of it; but makes the next bit easier.
  Select Left(original,2)
    Case "EE","EI"
      original = " Y"+Mid(original,3)
    Case "TD","TR","DR","TN","DN"
      original = " D"+Mid(original,3)
  EndSelect
  CompilerIf #includeJoshExtension
    ; mine,again
    If Left(original,1)="E" 
      ;   original=Mid(original,2)
    EndIf
  CompilerEndIf 
  original=ReverseString(original)
  
  ; step 3
  nysiis = Left(original,1)
  
  ; step 4
  p=2 : l = Len(original)
  
  ; step 5 (parts are not officially labeled as such, but are broken down to paragraphs)
  ;        (this loop can (and should) be optimised, I just wanted it to closely match the description,
  ;         but this should help you undestand what's going on & follow the document)
  While p <= l
    Select Mid(original,p,1)
      Case "A","E","I","O","U" ; part A 
        If Mid(original,p,2)="EV"
          replaceStr(original,"EV","AF",p)
        Else
          replaceStr(original,Mid(original,p,1),"A",p)
        EndIf
      Case "Q","Z","M" ; part B
        replaceStr(original,"Q","G",p)
        replaceStr(original,"Z","S",p)
        replaceStr(original,"Z","N",p)
      Case "K" ; part C
        If Mid(original,p+1,1)="N"
          replaceStr(original,"K","N",p)
        Else
          replaceStr(original,"K","C",p)
        EndIf
      Case "H" ; part E (we'll get to part D)
        If Not FindString("AEIOU",Mid(original,p-1),1,#PB_String_CaseSensitive) Or
           Not FindString("AEIOU",Mid(original,p+1),1,#PB_String_CaseSensitive) 
          replaceStr(original,"H",Mid(original,p-1,1),p)
        EndIf
      Case "W" ; part F
        If FindString("AEIOU",Mid(original,p-1),1,#PB_String_CaseSensitive) 
          replaceStr(original,"H",Mid(original,p-1,1),p)
        EndIf
        CompilerIf #includeJoshExtension
        Case "Y" ; mine
          replaceStr(original,"Y","I",p)
        CompilerEndIf
      Default ; part D
        If Mid(original,p,3)="SCH" : replaceStr(original,"SCH","SSS",p)
        ElseIf Mid(original,p,2)="PH" : replaceStr(original,"PH","FF",p)
        EndIf
        ; part G
        ; no match, do nothing
    EndSelect
    
    ; step 6
    If Mid(original,p,1) <> Right(nysiis,1)
      nysiis+Mid(original,p,1)
    EndIf 
    p+1
  Wend
  
  ; step 7
  If Right(nysiis,1)="S" And ( #includeJoshExtension =#False Or Len(NYSIIS)>4) ; after the 'And' is mine.
    nysiis=Left(nysiis,Len(nysiis)-1)
  EndIf
  
  ; step 8
  If Right(nysiis,2)="AY"
    nysiis=Left(nysiis,Len(nysiis)-2)+"Y"
  ElseIf Right(nysiis,1)="A" ; step 9
    nysiis=Left(nysiis,Len(nysiis)-1)
  EndIf
  
  ; done!!
  
  If bForIndexing ; indexes should keep original field length.
    nysiis=LSet(nysiis,nIndexLength," ")
  EndIf 
  ProcedureReturn nysiis
EndProcedure

  Debug nysiis("Sittin' on the dock of the bay")
  Debug NYSIIS("Sittin' on the Dock of the bay")
  Debug nysiis("Sitting on the dock of the bay")
  Debug NYSIIS("(Sittin' on) The dock of the bay")
  Debug NYSIIS(" The dock of the bay")
 
… I get the following results:

Code: Select all

SATANANTADACAFTABY
SATANANTADACAFTABY
SATANGANTADACAFTABY
SATANANTADACAFTABY
TADACAFTABY
As Azjio suggests you can then use something like the Levenshtein distance to determine proximity to a specific target.
boddhi
Enthusiast
Enthusiast
Posts: 524
Joined: Mon Nov 15, 2010 9:53 pm

Re: Text-matching function

Post by boddhi »

Hi,
spikey wrote: I'd be looking at 'Soundex' type phonetic standardisation. [...]
I had already seen this code and the associated thread.
If the majority of my titles are in English, another large part are in French, and a minority in Spanish.
For this reason, I had put it aside because I had a lot of adaptations to add, but perhaps it's worth examining it more in detail... :wink:
 
SPH wrote: Sat Jun 08, 2024 11:28 am Maybe look at the first 100 bytes of each MP3 (if you supervise mp3 files) and compare them with each other...
There are several reasons why this is not the right way to go:
  • The main one is that, as codecs have evolved and physical media have grown in size,I've encoded over time in different formats (MP3, M4A then FLAC) and with different audio bitrates (re-encoding older files would be too tedious). So I can have identical track names in several formats.
  • Secondly, each file contains metadata, often present at the beginning of the file, so even for an identically encoded track, the metadata, if different, could skew the results with this technique.
  • The third is that a track may be performed by different artists (e.g. Satisfaction by The Stones or Otis Redding). As the interpretation is different, the encoded data will also be different, and yet it's the same title name.

So physical comparison is absolutely the wrong approach.

@BarryG & AZJIO,
Thanks, I'll look into your suggestions. :wink:
If my English syntax and lexicon are incorrect, please bear with Google translate and DeepL. They rarely agree with each other!
Except on this sentence...
Post Reply