Unicode normalization (Vista / Win 7)
Posted: Fri Jan 08, 2010 11:06 pm
Hi,
I thought this could be of interest to anyone working with Unicode.
I encountered a problem earlier related to the fact that different Unicode code-points can, linguistically, represent the same character(s).
For example, the strings Chr($00C4) and Chr($0041) + Chr($0308) (in Unicode), when rendered on screen, produce the same character (Ä) but are clearly encoded differently. A string comparison of these two strings will yield a #False result which, in some circumstances, would be undesirable.
To see this, run the following code (enable the Unicode compiler option) :
Now, under Vista / Win 7 there is a solution in the form of the "Normaliz" library and the NormalizeString() function. This function will take a Unicode string and 'normalize' it according to a specified 'normalisation form'. In the case of our two mismatched strings above, we can quickly arrange for them to be normalized so that they both share the same Unicode encoding (and thus a string comparison will return a #True value). The process of "normalization" produces one binary representation for any of the equivalent binary representations of a character. Once normalized, two strings are equivalent if and only if they have identical binary representations.
Try the following to see this in action (enable the Unicode compiler option - Vista / Win 7 only) :
As I say, I thought this might be useful. 
I thought this could be of interest to anyone working with Unicode.
I encountered a problem earlier related to the fact that different Unicode code-points can, linguistically, represent the same character(s).
For example, the strings Chr($00C4) and Chr($0041) + Chr($0308) (in Unicode), when rendered on screen, produce the same character (Ä) but are clearly encoded differently. A string comparison of these two strings will yield a #False result which, in some circumstances, would be undesirable.
To see this, run the following code (enable the Unicode compiler option) :
Code: Select all
a1$ = Chr($00C4)
a2$ = Chr($0041) + Chr($0308)
If a1$ = a2$
MessageRequester("", "a1$ = " + a1$ + ", a2$ = " + a2$ + #LF$ + #LF$ + "a1$ = a2$ returns #True!")
Else
MessageRequester("", "a1$ = " + a1$ + ", a2$ = " + a2$ + #LF$ + #LF$ + "a1$ = a2$ returns #False!")
EndIf
Try the following to see this in action (enable the Unicode compiler option - Vista / Win 7 only) :
Code: Select all
Enumeration
#NormalizationOther = 0
#NormalizationC = 1
#NormalizationD = 2
#NormalizationKC = 5
#NormalizationKD = 6
EndEnumeration
;Need to load the NormalizeString() function.
Prototype.i protNormalizeString(NormForm, lpSrcString, cwSrcLength, lpDstString, cwDstLength)
Global NormalizeString.protNormalizeString
If OpenLibrary(1, "Normaliz.dll") = 0
MessageRequester("Unicode normalization...", "Could not load the NormalizeString() function.")
End
EndIf
NormalizeString = GetFunction(1, "NormalizeString")
a1$ = Chr($00C4)
a2$ = Chr($0041) + Chr($0308)
estimatedLength = NormalizeString(#NormalizationC, @a2$, -1, 0, 0)
If estimatedLength
newa2$ = Space(estimatedLength)
NormalizeString(#NormalizationC, @a2$, -1, @newa2$, estimatedLength)
If a1$ = newa2$
MessageRequester("Unicode normalization...", "a1$ = " + a1$ + ", newa2$ (normalized version of a2$) = " + newa2$ + #LF$ + #LF$ + "a1$ = newa2$ returns #True!")
Else
MessageRequester("Unicode normalization...", "a1$ = " + a1$ + ", newa2$ (normalized version of a2$) = " + newa2$ + #LF$ + #LF$ + "a1$ = newa2$ returns #False!")
EndIf
EndIf
CloseLibrary(1)
