Since I wasn't paid to produce this, here ya go -- details in the comments.
It's a replacement of sorts for a sounded() (See NYSIIS @ Wiki (thanks @PBJim)
I will be posting more code later.
Code: Select all
; NYSIIS.PBI
; in the 1990's there was a huge study performed by New York state
; to come up with a better name matching technique.
; the result of that was a document "Name Search Technique" by Robert L Taft.
; While I was employed at Social Services in California, I used this algorithm
; for matching names in databases (ie: Someone says "Smith" but was it "Smythe"?
; We also temporarily normalised names "Chuck" to "Charles" despite the fact someone
; may actually be named "Chuck" someone may think it's short for "Charles" - this
; algorithm does not do that, but it is something to consider when trying to match names.
; but it should match misspellings such as "tomas" "thomas"
; The alogithm is "New York State Identification & Intellegence System", or NYSIIS for short.
; And is covered in pages #88-90
; we had a lot of Portugese clients so we had some additoinal tweaks, but tweaks are beyond the original scope.
; I have included only 2 tweaks to demonstrate the ease of changing it, experiment ....
;
; The document is 126 pages long, and includes indepth informatin, as well as detailed statistical analysis.
; if you want a copy (of my copy), get in touch with me on gmail (josh.assing) and I'll make copies for costs+postage.
; OR better (Thanks PBJim) https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System
;
; My original C code was ascii only,not sure how relevant that is. (NYSIIS didn't take into account non ascii
; this doesn't either ...)
EnableExplicit
; not specified... this was'the old days'; "ú" was entered as "u"; for NYSIIS() we need to convert these.
; YOU SHOULD convert accented to non (ie:"ú" TO "u")
XIncludeFile "changeAccentedMOD.pbi"
Macro replaceStr(str,find,new,position):ReplaceString(str,find,new,#PB_String_CaseSensitive|#PB_String_InPlace,position,1):EndMacro
Macro removeChar(str,find) : ReplaceString(str,find,"",#PB_String_NoCase) : EndMacro
#includeJoshExtension=#False
Procedure.s NYSIIS( original.s, bForIndexing=#False )
Protected nysiis.s, p,l, nIndexLength
nIndexLength = Len(original)
; this is not specifically specified, but implied.
original = UCase(Trim(original)) ; only deal with upper case.
CompilerIf #includeJoshExtension
; not specified, optional...
CompilerIf Defined(changeAccented,#PB_Module)
changeAccented::changeAll(@original)
CompilerEndIf
; this is mine
; ReplaceString(original,"Y","I",#PB_String_NoCase|#PB_String_InPlace,2)
CompilerEndIf
; this is not specificaly specified, but it makes sense
original = removeChar(original," ")
original = removeChar(original,".")
original = removeChar(original,",")
; Step 1.
If Left(original,3)="MAC" : replaceStr(original,"MAC","MCC",1)
ElseIf Left(original,3)="SCH" : replaceStr(original,"SCH","SSS",1)
ElseIf Left(original,2)="KN" : replaceStr(original,"KN", "NN", 1)
ElseIf Left(original,2)="PH" : replaceStr(original,"PH", "FF", 1)
ElseIf Left(original,2)="PF" : replaceStr(original,"PF", "FF", 1)
ElseIf Left(original,1)="K" : replaceStr(original,"K" ,"C", 1)
EndIf
; Step 2.
original=ReverseString(original) ; not technically part of it; but makes the next bit easier.
Select Left(original,2)
Case "EE","EI"
original = " Y"+Mid(original,3)
Case "TD","TR","DR","TN","DN"
original = " D"+Mid(original,3)
EndSelect
CompilerIf #includeJoshExtension
; mine,again
If Left(original,1)="E"
; original=Mid(original,2)
EndIf
CompilerEndIf
original=ReverseString(original)
; step 3
nysiis = Left(original,1)
; step 4
p=2 : l = Len(original)
; step 5 (parts are not officially labeled as such, but are broken down to paragraphs)
; (this loop can (and should) be optimised, I just wanted it to closely match the description,
; but this should help you undestand what's going on & follow the document)
While p <= l
Select Mid(original,p,1)
Case "A","E","I","O","U" ; part A
If Mid(original,p,2)="EV"
replaceStr(original,"EV","AF",p)
Else
replaceStr(original,Mid(original,p,1),"A",p)
EndIf
Case "Q","Z","M" ; part B
replaceStr(original,"Q","G",p)
replaceStr(original,"Z","S",p)
replaceStr(original,"Z","N",p)
Case "K" ; part C
If Mid(original,p+1,1)="N"
replaceStr(original,"K","N",p)
Else
replaceStr(original,"K","C",p)
EndIf
Case "H" ; part E (we'll get to part D)
If Not FindString("AEIOU",Mid(original,p-1),1,#PB_String_CaseSensitive) Or
Not FindString("AEIOU",Mid(original,p+1),1,#PB_String_CaseSensitive)
replaceStr(original,"H",Mid(original,p-1,1),p)
EndIf
Case "W" ; part F
If FindString("AEIOU",Mid(original,p-1),1,#PB_String_CaseSensitive)
replaceStr(original,"H",Mid(original,p-1,1),p)
EndIf
CompilerIf #includeJoshExtension
Case "Y" ; mine
replaceStr(original,"Y","I",p)
CompilerEndIf
Default ; part D
If Mid(original,p,3)="SCH" : replaceStr(original,"SCH","SSS",p)
ElseIf Mid(original,p,2)="PH" : replaceStr(original,"PH","FF",p)
EndIf
; part G
; no match, do nothing
EndSelect
; step 6
If Mid(original,p,1) <> Right(nysiis,1)
nysiis+Mid(original,p,1)
EndIf
p+1
Wend
; step 7
If Right(nysiis,1)="S" And ( #includeJoshExtension =#False Or Len(NYSIIS)>4) ; after the 'And' is mine.
nysiis=Left(nysiis,Len(nysiis)-1)
EndIf
; step 8
If Right(nysiis,2)="AY"
nysiis=Left(nysiis,Len(nysiis)-2)+"Y"
ElseIf Right(nysiis,1)="A" ; step 9
nysiis=Left(nysiis,Len(nysiis)-1)
EndIf
; done!!
If bForIndexing ; indexes should keep original field length.
nysiis=LSet(nysiis,nIndexLength," ")
EndIf
ProcedureReturn nysiis
EndProcedure
CompilerIf #PB_Compiler_IsMainFile
Debug nysiis("Mac dugal")
Debug NYSIIS("Grossman")
Debug nysiis("großman")
Debug NYSIIS("smith")
Debug NYSIIS("smithe")
Debug NYSIIS("smyth")
Debug NYSIIS("smythe")
Debug NYSIIS("rosenblatt")
Debug NYSIIS("rozenblat")
Debug nysiis("josh")
Debug nysiis("joßh")
Debug nysiis("joshua")
Debug nysiis("joßhua")
Debug nysiis("jeff")
Debug nysiis("jeffry")
CompilerIf Defined(changeaccented,#PB_Module)
; if you want to customize...
changeaccented::changeMap('ß','B')
Debug NYSIIS("Grossman")
Debug nysiis("großman")
Debug nysiis("josh")
Debug nysiis("joßh")
CompilerEndIf
CompilerEndIf
Code: Select all
; changeAccentedMOD.pbi
DeclareModule changeAccented
Declare changeSingle( *asciibyte.ascii )
Declare changeAll( *string.string )
Declare changeMap( old.a, new.a )
EndDeclareModule
Module changeAccented
Threaded NewMap accentedChar.a()
Global Dim nonAccentedChar(0)
Global maxCharacters
Procedure loadmap()
; easy way of avoiding mutexes for threads...
Protected i, tmp.a, max.b
If MapSize( accentedChar() ) <> maxCharacters
ClearMap(accentedChar())
Restore accented
For i = 1 To 32
Read.a tmp
accentedChar(Str(tmp))=nonAccentedChar(i)
Next
EndIf
EndProcedure
Procedure changeMap( old.a, new.a)
If MapSize(accentedChar())=0:loadmap():EndIf
accentedChar(Str(old))=new
EndProcedure
Procedure changeSingle( *asciiByte.Ascii )
If MapSize(accentedChar())=0:loadmap():EndIf
If FindMapElement(accentedChar(),Str(*asciiByte\a))
*asciiByte\a=accentedChar(Str(*asciiByte\a))
EndIf
EndProcedure
Procedure changeAll( *string )
Protected *p
*p = *string+StringByteLength(PeekS(*string))-1
While *p > *string
changeSingle( *p )
*p-1;SizeOf(Character)
Wend
EndProcedure
Procedure init()
Restore characterCount
Read.b maxCharacters
If ArraySize( nonAccentedChar()) <> maxCharacters
Restore nonAccented
ReDim nonAccentedChar(maxCharacters)
For i = 1 To maxCharacters
Read.a nonAccentedChar(i)
Next
EndIf
EndProcedure
; NB: ß sounds like an "S", but looks like a "B" - up to you how to translate it.
DataSection
characterCount:
Data.b 63
accented:
Data.a 'À','Á','Â','Ã','Ä','Å','Æ','Ç', 'È','É','Ê','Ë','Ì','Í','Î','Ï'
Data.a 'Ð','Ñ','Ò','Ó','Ô','Õ','Ö','×', 'Ø','Ù','Ú','Û','Ü','Ý','Þ','ß' ;
Data.a 'à','á','â','ã','ä','å','æ','ç', 'è','é','ê','ë','ì','í','î','ï'
Data.a 'ð','ñ','ò','ó','ô','õ','ö','ø', 'ù','ú','û','ü','ý','þ','ÿ'
nonaccented:
Data.a 'A','A','A','A','A','A','A','C', 'E','E','E','E','I','I','I','I'
Data.a 'D','N','O','O','O','O','O','x', 'O','U','U','U','U','Y','b','S'
Data.a 'a','a','a','a','a','a','a','c', 'e','e','e','e','i','i','i','i'
Data.a 'o','n','o','o','o','o','o','o', 'u','u','u','u','y','b','y'
EndDataSection
EndModule
CompilerIf #PB_Compiler_IsMainFile
v.s= "JOßH"
For i = 0 To StringByteLength(v)-1
changeAccented::changeSingle(@v+i)
Next
Debug v
v.s= "JOßH"
changeAccented::changeAll(@v)
Debug v
changeAccented::changeMap('ß','B')
v.s= "JOßH"
changeAccented::changeAll(@v)
Debug v
CompilerEndIf
Code: Select all
EnableExplicit
; made as a module so i1 could xinclude it in another.
; Levenshtein Distance, there are several versions
; of the agorithm, this uses the "full matrix" version.
DeclareModule Levenshtein
Declare.f calculate( S1.s, S2.s )
EndDeclareModule
Module Levenshtein
EnableExplicit
Procedure min3(a, b, c)
If b < a : a = b : EndIf
If c < a : a = c : EndIf
ProcedureReturn a
EndProcedure
Procedure.f calculate(S1.s, S2.s)
Protected i1, i2, L1, L2, cost
Protected.s mS1, mS2
Protected.f distance = 0
L1 = Len(S1) : L2 = Len(S2)
If S1 = S2 : distance=100
ElseIf L1 <> 0 And L2 <> 0
Protected Dim matrix(L1, L2)
For i1 = 0 To L1 : matrix(i1, 0) = i1 : Next
For i2 = 0 To L2 : matrix(0, i2) = i2 : Next
For i1 = 1 To L1 : mS1 = Mid(S1, i1, 1)
For i2 = 1 To L2 : mS2 = Mid(S2, i2, 1)
cost = Bool( Not mS1 = mS2 )
matrix(i1, i2) = min3( matrix( i1-1, i2 ) + 1 ,
matrix( i1, i2-1 ) + 1 ,
matrix( i1-1, i2-1 ) + cost )
Next
Next
distance = ( ( L2 - matrix(L1, L2) ) / L2 ) * 100
EndIf
ProcedureReturn distance
EndProcedure
EndModule
CompilerIf #PB_Compiler_IsMainFile
Debug Levenshtein::calculate("smith", "smithe")
Debug Levenshtein::calculate("smith", "smythe")
Debug Levenshtein::calculate("smyth", "smythe")
Debug Levenshtein::calculate("Paul", "Pawl")
Debug Levenshtein::calculate("Trade", "Blade")
Debug Levenshtein::calculate("abcde", "cdeab")
Debug Levenshtein::calculate("The same","The same")
Debug Levenshtein::calculate("NOT The same","NOT the same")
CompilerEndIf