Name matching, NYIIS

jassing · Post by **jassing** » Wed May 15, 2024 2:31 am

I'll postpone a short rant until later. I was contacted to produce some x64 dll's. The contract ultimately failed.
Since I wasn't paid to produce this, here ya go -- details in the comments.
It's a replacement of sorts for a sounded() (See NYSIIS @ Wiki (thanks @PBJim)
I will be posting more code later.

Code: Select all

; NYSIIS.PBI
; in the 1990's there was a huge study performed by New York state
;   to come up with a better name matching technique.  
; the result of that was a document "Name Search Technique" by Robert L Taft.
; While I was employed at Social Services in California, I used this algorithm
;   for matching names in databases (ie: Someone says "Smith" but was it "Smythe"?
; We also temporarily normalised names "Chuck" to "Charles" despite the fact someone
;   may actually be named "Chuck" someone may think it's short for "Charles" - this 
;   algorithm does not do that, but it is something to consider when trying to match names.
;   but it should match misspellings such as "tomas" "thomas"
; The alogithm is "New York State Identification & Intellegence System", or NYSIIS for short.
;   And is covered in pages #88-90
; we had a lot of Portugese clients so we had some additoinal tweaks, but tweaks are beyond the original scope.
;   I have included only 2 tweaks to demonstrate the ease of changing it, experiment ....
;
; The document is 126 pages long, and includes indepth informatin, as well as detailed statistical analysis.
;     if you want a copy (of my copy), get in touch with me on gmail (josh.assing) and I'll make copies for costs+postage.
;     OR better (Thanks PBJim) https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System
;
; My original C code was ascii only,not sure how relevant that is. (NYSIIS didn't take into account non ascii
;    this doesn't either ...)
EnableExplicit

; not specified... this was'the old days'; "ú" was entered as "u"; for NYSIIS() we need to convert these.
; YOU SHOULD convert accented to non (ie:"ú" TO "u")

XIncludeFile "changeAccentedMOD.pbi"

Macro replaceStr(str,find,new,position):ReplaceString(str,find,new,#PB_String_CaseSensitive|#PB_String_InPlace,position,1):EndMacro
Macro removeChar(str,find) : ReplaceString(str,find,"",#PB_String_NoCase) : EndMacro
#includeJoshExtension=#False

Procedure.s NYSIIS( original.s, bForIndexing=#False ) 
  Protected nysiis.s, p,l, nIndexLength
  
  nIndexLength = Len(original)
  
  ; this is not specifically specified, but implied.
  original = UCase(Trim(original)) ; only deal with upper case.
  
  CompilerIf #includeJoshExtension
    ; not specified, optional...
    CompilerIf Defined(changeAccented,#PB_Module)
      changeAccented::changeAll(@original)
    CompilerEndIf
    
    ; this is mine
    ;  ReplaceString(original,"Y","I",#PB_String_NoCase|#PB_String_InPlace,2)
  CompilerEndIf 
  
  ; this is not specificaly specified, but it makes sense
  original = removeChar(original," ")
  original = removeChar(original,".")
  original = removeChar(original,",")
  
  ; Step 1.
  If     Left(original,3)="MAC" : replaceStr(original,"MAC","MCC",1)
  ElseIf Left(original,3)="SCH" : replaceStr(original,"SCH","SSS",1)
  ElseIf Left(original,2)="KN"  : replaceStr(original,"KN", "NN", 1)
  ElseIf Left(original,2)="PH"  : replaceStr(original,"PH", "FF", 1)
  ElseIf Left(original,2)="PF"  : replaceStr(original,"PF", "FF", 1)
  ElseIf Left(original,1)="K"   : replaceStr(original,"K"  ,"C",  1)
  EndIf
  
  ; Step 2.
  original=ReverseString(original) ; not technically part of it; but makes the next bit easier.
  Select Left(original,2)
    Case "EE","EI"
      original = " Y"+Mid(original,3)
    Case "TD","TR","DR","TN","DN"
      original = " D"+Mid(original,3)
  EndSelect
  CompilerIf #includeJoshExtension
    ; mine,again
    If Left(original,1)="E" 
      ;   original=Mid(original,2)
    EndIf
  CompilerEndIf 
  original=ReverseString(original)
  
  ; step 3
  nysiis = Left(original,1)
  
  ; step 4
  p=2 : l = Len(original)
  
  ; step 5 (parts are not officially labeled as such, but are broken down to paragraphs)
  ;        (this loop can (and should) be optimised, I just wanted it to closely match the description,
  ;         but this should help you undestand what's going on & follow the document)
  While p <= l
    Select Mid(original,p,1)
      Case "A","E","I","O","U" ; part A 
        If Mid(original,p,2)="EV"
          replaceStr(original,"EV","AF",p)
        Else
          replaceStr(original,Mid(original,p,1),"A",p)
        EndIf
      Case "Q","Z","M" ; part B
        replaceStr(original,"Q","G",p)
        replaceStr(original,"Z","S",p)
        replaceStr(original,"Z","N",p)
      Case "K" ; part C
        If Mid(original,p+1,1)="N"
          replaceStr(original,"K","N",p)
        Else
          replaceStr(original,"K","C",p)
        EndIf
      Case "H" ; part E (we'll get to part D)
        If Not FindString("AEIOU",Mid(original,p-1),1,#PB_String_CaseSensitive) Or
           Not FindString("AEIOU",Mid(original,p+1),1,#PB_String_CaseSensitive) 
          replaceStr(original,"H",Mid(original,p-1,1),p)
        EndIf
      Case "W" ; part F
        If FindString("AEIOU",Mid(original,p-1),1,#PB_String_CaseSensitive) 
          replaceStr(original,"H",Mid(original,p-1,1),p)
        EndIf
        CompilerIf #includeJoshExtension
        Case "Y" ; mine
          replaceStr(original,"Y","I",p)
        CompilerEndIf
      Default ; part D
        If Mid(original,p,3)="SCH" : replaceStr(original,"SCH","SSS",p)
        ElseIf Mid(original,p,2)="PH" : replaceStr(original,"PH","FF",p)
        EndIf
        ; part G
        ; no match, do nothing
    EndSelect
    
    ; step 6
    If Mid(original,p,1) <> Right(nysiis,1)
      nysiis+Mid(original,p,1)
    EndIf 
    p+1
  Wend
  
  ; step 7
  If Right(nysiis,1)="S" And ( #includeJoshExtension =#False Or Len(NYSIIS)>4) ; after the 'And' is mine.
    nysiis=Left(nysiis,Len(nysiis)-1)
  EndIf
  
  ; step 8
  If Right(nysiis,2)="AY"
    nysiis=Left(nysiis,Len(nysiis)-2)+"Y"
  ElseIf Right(nysiis,1)="A" ; step 9
    nysiis=Left(nysiis,Len(nysiis)-1)
  EndIf
  
  ; done!!
  
  If bForIndexing ; indexes should keep original field length.
    nysiis=LSet(nysiis,nIndexLength," ")
  EndIf 
  ProcedureReturn nysiis
EndProcedure
CompilerIf #PB_Compiler_IsMainFile
  Debug nysiis("Mac dugal")
  Debug NYSIIS("Grossman")
  Debug nysiis("großman")
  Debug NYSIIS("smith")
  Debug NYSIIS("smithe")
  Debug NYSIIS("smyth")
  Debug NYSIIS("smythe")
  Debug NYSIIS("rosenblatt")
  Debug NYSIIS("rozenblat")
  Debug nysiis("josh")
  Debug nysiis("joßh")
  Debug nysiis("joshua")
  Debug nysiis("joßhua")
  Debug nysiis("jeff")
  Debug nysiis("jeffry")
  CompilerIf Defined(changeaccented,#PB_Module)
    
    ; if you want to customize...
    changeaccented::changeMap('ß','B')
    Debug NYSIIS("Grossman")
    Debug nysiis("großman")
    Debug nysiis("josh")
    Debug nysiis("joßh")
  CompilerEndIf
CompilerEndIf

it uses this, optionally:

Code: Select all

; changeAccentedMOD.pbi
DeclareModule changeAccented
  Declare changeSingle( *asciibyte.ascii )
  Declare changeAll( *string.string )
  Declare changeMap( old.a, new.a )
EndDeclareModule

Module changeAccented
  Threaded NewMap accentedChar.a()
  Global Dim nonAccentedChar(0)
  Global maxCharacters
  
  Procedure loadmap()
    ; easy way of avoiding mutexes for threads...
    Protected i, tmp.a, max.b
    
    If MapSize( accentedChar() ) <> maxCharacters
      ClearMap(accentedChar())    
      Restore accented
      For i = 1 To 32
        Read.a tmp
        accentedChar(Str(tmp))=nonAccentedChar(i)
      Next
    EndIf
  EndProcedure
  
  Procedure changeMap( old.a, new.a)
    If MapSize(accentedChar())=0:loadmap():EndIf
    
    accentedChar(Str(old))=new
  EndProcedure
  
  Procedure changeSingle( *asciiByte.Ascii )
    If MapSize(accentedChar())=0:loadmap():EndIf
    
    If FindMapElement(accentedChar(),Str(*asciiByte\a))
      *asciiByte\a=accentedChar(Str(*asciiByte\a))
    EndIf
  EndProcedure
  
  Procedure changeAll( *string )
    Protected *p
    *p = *string+StringByteLength(PeekS(*string))-1
    While *p > *string
      changeSingle( *p )
      *p-1;SizeOf(Character)
    Wend
  EndProcedure
  
  Procedure init()
    Restore characterCount
    Read.b maxCharacters
    If ArraySize( nonAccentedChar()) <> maxCharacters
      Restore nonAccented
      ReDim nonAccentedChar(maxCharacters)
      For i = 1 To maxCharacters
        Read.a nonAccentedChar(i)
      Next
    EndIf
  EndProcedure
  ; NB: ß sounds like an "S", but looks like a "B" - up to you how to translate it.
  DataSection
    characterCount:
    Data.b 63
    accented: 
    Data.a 'À','Á','Â','Ã','Ä','Å','Æ','Ç',  'È','É','Ê','Ë','Ì','Í','Î','Ï'
    Data.a 'Ð','Ñ','Ò','Ó','Ô','Õ','Ö','×',  'Ø','Ù','Ú','Û','Ü','Ý','Þ','ß' ;
    Data.a 'à','á','â','ã','ä','å','æ','ç',  'è','é','ê','ë','ì','í','î','ï'
    Data.a 'ð','ñ','ò','ó','ô','õ','ö','ø',  'ù','ú','û','ü','ý','þ','ÿ'
    nonaccented:
    Data.a 'A','A','A','A','A','A','A','C',  'E','E','E','E','I','I','I','I'
    Data.a 'D','N','O','O','O','O','O','x',  'O','U','U','U','U','Y','b','S' 
    Data.a 'a','a','a','a','a','a','a','c',  'e','e','e','e','i','i','i','i'
    Data.a 'o','n','o','o','o','o','o','o',  'u','u','u','u','y','b','y'
  EndDataSection
EndModule

CompilerIf #PB_Compiler_IsMainFile
  v.s= "JOßH"
  
  For i = 0 To StringByteLength(v)-1
    changeAccented::changeSingle(@v+i)
  Next
  Debug v
  v.s= "JOßH"
  changeAccented::changeAll(@v)
  Debug v
  changeAccented::changeMap('ß','B')
  v.s= "JOßH"
  changeAccented::changeAll(@v)
  Debug v
CompilerEndIf

Mildly related, however, this compares one name to another (higher value, more alike), whereas NYSIIS() gives a value you can use in a compare.

Code: Select all

EnableExplicit

; made as a module so i1 could xinclude it in another.
; Levenshtein Distance, there are several versions
;   of the agorithm, this uses the "full matrix" version.

DeclareModule Levenshtein
  Declare.f calculate( S1.s, S2.s )
EndDeclareModule

Module Levenshtein
  EnableExplicit
  
  Procedure min3(a, b, c) 
    If b < a : a = b : EndIf
    If c < a : a = c : EndIf
    ProcedureReturn a
  EndProcedure
  
  Procedure.f calculate(S1.s, S2.s) 
    Protected   i1, i2, L1, L2, cost
    Protected.s mS1, mS2
    Protected.f distance = 0
    
    L1 = Len(S1) : L2 = Len(S2) 
    If S1 = S2   : distance=100
    ElseIf L1 <> 0 And L2 <> 0
      Protected Dim matrix(L1, L2)
      For i1 = 0 To L1 : matrix(i1, 0) = i1 : Next
      For i2 = 0 To L2 : matrix(0, i2) = i2 : Next
      
      For i1   = 1 To L1   : mS1 = Mid(S1, i1, 1)
        For i2 = 1 To L2   : mS2 = Mid(S2, i2, 1)
          cost = Bool( Not mS1 = mS2 )
          matrix(i1, i2) = min3( matrix( i1-1, i2   ) + 1    , 
                                 matrix( i1,   i2-1 ) + 1    , 
                                 matrix( i1-1, i2-1 ) + cost )
        Next 
      Next 
      distance = ( ( L2 - matrix(L1, L2) ) / L2 ) * 100
    EndIf
    ProcedureReturn distance
  EndProcedure
EndModule

CompilerIf #PB_Compiler_IsMainFile
  Debug Levenshtein::calculate("smith", "smithe")
  Debug Levenshtein::calculate("smith", "smythe")
  Debug Levenshtein::calculate("smyth", "smythe")
  Debug Levenshtein::calculate("Paul", "Pawl")
  Debug Levenshtein::calculate("Trade", "Blade") 
  Debug Levenshtein::calculate("abcde", "cdeab")
  Debug Levenshtein::calculate("The same","The same")
  Debug Levenshtein::calculate("NOT The same","NOT the same")
CompilerEndIf

ebs · Post by **ebs** » Wed May 15, 2024 3:50 pm

Wow - I thought I was the only one to use this!

I used the NYSIIS encoding to match names when comparing ID information in a system I developed for a currency exchange company. I agree that it does a better job than the SOUNDEX algorithm.

I noticed a couple of what look like errors in your code:

Code: Select all

Macro replaceStr(str,find,new,position):ReplaceString(str,find,new,#PB_String_CaseSensitive|#PB_String_InPlace,p,1):EndMacro

should use 'Position', not 'p' as the fifth parameter

Code: Select all

Case "Q","Z","M" ; part B
        replaceStr(original,"Q","G",p)
        replaceStr(original,"Z","S",p)
        replaceStr(original,"Z","N",p)

The last line should be

Code: Select all

replaceStr(original,"M","N",p)

Also, your code and mine seem to use slightly different implementations. For example, your code translates 'Joshua' to 'JAS', whereas mine converts it to 'JASA'.

jassing · Post by **jassing** » Sat May 18, 2024 3:59 am

ebs wrote: Wed May 15, 2024 3:50 pm Wow - I thought I was the only one to use this!

Not many people know about it. I was almost sent back to NY (from California), but go the document instead... After reading it; I saw no reason to travel.

I noticed a couple of what look like errors in your code:
thanks the 'p' 'position' was a stupid programmer trick at the last minute for posting. Thanks.

ebs wrote: Wed May 15, 2024 3:50 pm Also, your code and mine seem to use slightly different implementations. For example, your code translates 'Joshua' to 'JAS', whereas mine converts it to 'JASA'.

post your version? It is very conceivable that when I converted to PB I made an error...

ebs · Post by **ebs** » Tue May 21, 2024 2:39 pm

Here's the code I used.
When translating "Joshua", it converts the 'U' to an 'A', then discards the trailing 'A' since it is the same as the previous letter, yielding 'JASA'.

Code: Select all

;- New York State Identification and Intelligence System Phonetic Code
;- Devised in 1970 as part of NYSIIS to improve on accuracy of Soundex
Procedure.s NYSIIS(Name.s)
  ; uppercase string
  upName.s = UCase(Name)
  
  ; remove non-alpha characters
  For Pos.l = 1 To Len(upName)
    Char.s = Mid(upName, Pos, 1)
    If Char >= "A" And Char <= "Z"
      uName.s + Char
    EndIf
  Next
  
  ; step 1 - first characters
  If Left(uName, 3) = "MAC"
    uName = "MC" + Mid(uName, 4)  ; "MCC"
  ElseIf Left(uName, 2) = "KN"
    uName = "N" + Mid(uName, 3)
  ElseIf Left(uName, 1) = "K"
    uName = "C" + Mid(uName, 2)
  ElseIf Left(uName, 2) = "PH"
    uName = "FF" + Mid(uName, 3)
  ElseIf Left(uName, 2) = "PF"
    uName = "F" + Mid(uName, 3)   ; "FF"
  ElseIf Left(uName, 3) = "SCH"
    uName = "S" + Mid(uName, 4)   ; "SSS"
  EndIf
  
  ; step 2 - last characters
  End2.s = Right(uName, 2)
  Select End2
    Case "EE", "IE"
      uName = Left(uName, Len(uName) - 2) + "Y"
    Case "DT", "RT", "RD", "NT", "ND"
      uName = Left(uName, Len(uName) - 2) + "D"
  EndSelect
  
  ; step 3 - code starts with first letter of name
  code.s = Left(uName, 1)
  prevKey.s = code
  
  ; steps 4 - 8 - translate letters
  #VOWELS = "AEIOU"
  ; start with second letter
  Pos.l = 2
  len.l = Len(uName)
  While Pos <= len
    Letter.s = Mid(uName, Pos, 1)
    key.s = Letter
    Skip.l = 1
    If FindString(#VOWELS, Letter, 1)
      If Mid(uName, Pos, 2) = "EV"
        key = "AF"
        Skip = 2
      ElseIf Letter = "A"
        If Pos = Len(uName) - 1 And Right(uName, 1) = "Y"
          key = "Y"
          Skip = 2
        ElseIf Pos = Len(uName)
          key = #Null$
        EndIf
      Else
        key = "A"
      EndIf
    ElseIf Letter = "Q"
      key = "G"
    ElseIf Letter = "Z"
      key = "S"
    ElseIf Letter = "M"
      key = "N"
    ElseIf Letter = "K"
      If Mid(uName, Pos, 2) = "KN"
        key = "N"
        Skip = 2
      Else
        key = "C"
      EndIf
    ElseIf Letter = "H"
      prevLetter.s = Mid(uName, Pos - 1, 1)
      nextLetter.s = Mid(uName, Pos + 1, 1)
      If (FindString(#VOWELS, prevLetter, 1) = 0 Or FindString(#VOWELS, nextLetter, 1) = 0)
        key = prevLetter
      EndIf
    ElseIf Letter = "W"
      prevLetter.s = Mid(uName, Pos - 1, 1)
      If FindString(#VOWELS, prevLetter, 1)
        key = "A"
      EndIf
    ElseIf Letter = "S" And Pos = Len(uName)
      key = #Null$
    ElseIf Mid(uName, Pos, 3) = "SCH"
      key = "S" ; "SSS"
      Skip = 3
    ElseIf Mid(uName, Pos, 2) = "PH"
      key = "F" ; "FF"
      Skip = 2
    EndIf
    
    ; add key to code if not duplicate
    If key <> prevKey
      code + key
      prevKey = key
    EndIf
    
    ; skip ahead in name
    Pos + Skip
  Wend
  
  ProcedureReturn code
EndProcedure

PBJim · Post by **PBJim** » Tue May 21, 2024 8:02 pm

Interesting to see these routines, as I have tried Soundex a few times over the years (since 1980s when it was incorporated into a database that I was working on). Trying a few different names with the two examples here, I found that "matthew" results in MATA, but Jassing's version gives MATAW.

I see that the Wikipedia article https://en.wikipedia.org/wiki/New_York_ ... nce_System rule 9 says if the last letter of the NYSIIS result is "A", then remove it, but I suppose the difference in logic is not necessarily that — it's that your result Ebs, is actually removing the "W".

Glad to see this though, it's an expected function in databases, just not that well known perhaps.

jassing · Post by **jassing** » Tue May 21, 2024 8:11 pm

PBJim wrote: Tue May 21, 2024 8:02 pm I see that the Wikipedia article https://en.wikipedia.org/wiki/New_York_ ... nce_System rule 9 says if the last

Thanks for the link, I should have known there was a wiki. I just assumed it was obscure enough there wouldn't be one.

PBJim wrote: Tue May 21, 2024 8:02 pm letter of the NYSIIS result is "A", then remove it, but I suppose the difference in logic is not necessarily that — it's that your result Ebs, is actually removing the "W".

Well; thanks! I was just getting to comparing my code & the actual document...

PBJim wrote: Tue May 21, 2024 8:02 pmGlad to see this though, it's an expected function in databases, just not that well known perhaps.

NYSIIS is centric to western European names. Not so good with Asian names. I think this sort of name-matching is too specific for "wide speed" adoption. But for some it's a valuable tool to have in your kit.

ebs · Post by **ebs** » Tue May 21, 2024 8:13 pm

it's that your result Ebs, is actually removing the "W"

I think that's due to rule 7 that says "If the current position is the letter 'W' and the preceding letter is a vowel then replace the current position with the preceding position."

Unless I misunderstood the spec, the 'W' is replaced by the previous letter 'A', and since consecutive duplicates are ignored, the second 'A' is dropped.

EDIT: Now I'm really confused!

Rule 9 says that a trailing 'A' should be dropped - does that mean that the code should wind up as 'MAT'?

PBJim · Post by **PBJim** » Tue May 21, 2024 8:47 pm

jassing wrote: Tue May 21, 2024 8:11 pm NYSIIS is centric to western European names. Not so good with Asian names. I think this sort of name-matching is too specific for "wide speed" adoption. But for some it's a valuable tool to have in your kit.

Yes, quite true. The Soundex wikipedia page mentions past attempts to make it work better. An example is Eastern European names not working well, such as Moskowitz vs. Moskovitz or Levine vs. Lewin.

PBJim · Post by **PBJim** » Tue May 21, 2024 8:52 pm

ebs wrote: Tue May 21, 2024 8:13 pm
it's that your result Ebs, is actually removing the "W"
I think that's due to rule 7 that says "If the current position is the letter 'W' and the preceding letter is a vowel then replace the current position with the preceding position."

Unless I misunderstood the spec, the 'W' is replaced by the previous letter 'A', and since consecutive duplicates are ignored, the second 'A' is dropped.

EDIT: Now I'm really confused!
Rule 9 says that a trailing 'A' should be dropped - does that mean that the code should wind up as 'MAT'?

Yes, the way I interpreted that, is that when reaching the final step, if the last letter is "A", then drop it, but don't keep dropping it, because the rule is only applied once.

I suppose one advantage with all this though, is that whichever way you implement this, it's always going to function in the same way, because you generate the code consistently. In other words, you're not normally exchanging data with other systems that might have implemented the logic differently.

jassing · Post by **jassing** » Tue May 21, 2024 9:17 pm

PBJim wrote: Tue May 21, 2024 8:52 pm I suppose one advantage with all this though, is that whichever way you implement this, it's always going to function in the same way, because you generate the code consistently. In other words, you're not normally exchanging data with other systems that might have implemented the logic differently.

this is a good point, however, like everything "it depends"
I worked for social services, we had to share data with other county agencies (ie: sheriff). if our algorithms differ, we get mixed results.
so consistency across cohesive environments is important (that sounded horribly like some guru selling goop to gullibles, if only I had tossed synergistic into it...)

PureBasic Forums - English

Name matching, NYIIS

Name matching, NYIIS

Re: Name matching, NYIIS

Re: Name matching, NYIIS

Re: Name matching, NYIIS

Re: Name matching, NYIIS

Re: Name matching, NYIIS

Re: Name matching, NYIIS

Re: Name matching, NYIIS

Re: Name matching, NYIIS

Re: Name matching, NYIIS