Page 1 of 1

Unicode normalization (Vista / Win 7)

Posted: Fri Jan 08, 2010 11:06 pm
by srod
Hi,

I thought this could be of interest to anyone working with Unicode.

I encountered a problem earlier related to the fact that different Unicode code-points can, linguistically, represent the same character(s).

For example, the strings Chr($00C4) and Chr($0041) + Chr($0308) (in Unicode), when rendered on screen, produce the same character (Ä) but are clearly encoded differently. A string comparison of these two strings will yield a #False result which, in some circumstances, would be undesirable.

To see this, run the following code (enable the Unicode compiler option) :

Code: Select all

a1$ = Chr($00C4)
a2$ = Chr($0041) + Chr($0308)

If a1$ = a2$
  MessageRequester("", "a1$ = " + a1$ + ", a2$ = " + a2$ + #LF$ + #LF$ + "a1$ = a2$ returns #True!")
Else
  MessageRequester("", "a1$ = " + a1$ + ", a2$ = " + a2$ + #LF$ + #LF$ + "a1$ = a2$ returns #False!")
EndIf
Now, under Vista / Win 7 there is a solution in the form of the "Normaliz" library and the NormalizeString() function. This function will take a Unicode string and 'normalize' it according to a specified 'normalisation form'. In the case of our two mismatched strings above, we can quickly arrange for them to be normalized so that they both share the same Unicode encoding (and thus a string comparison will return a #True value). The process of "normalization" produces one binary representation for any of the equivalent binary representations of a character. Once normalized, two strings are equivalent if and only if they have identical binary representations.

Try the following to see this in action (enable the Unicode compiler option - Vista / Win 7 only) :

Code: Select all

Enumeration 
  #NormalizationOther   = 0
  #NormalizationC       = 1
  #NormalizationD       = 2
  #NormalizationKC      = 5
  #NormalizationKD      = 6 
EndEnumeration

;Need to load the NormalizeString() function.
  Prototype.i protNormalizeString(NormForm, lpSrcString, cwSrcLength, lpDstString, cwDstLength)
  Global  NormalizeString.protNormalizeString

  If OpenLibrary(1, "Normaliz.dll") = 0
    MessageRequester("Unicode normalization...", "Could not load the NormalizeString() function.")
    End
  EndIf
  NormalizeString = GetFunction(1, "NormalizeString")
  
a1$ = Chr($00C4)
a2$ = Chr($0041) + Chr($0308)

estimatedLength = NormalizeString(#NormalizationC, @a2$, -1, 0, 0)
If estimatedLength
  newa2$ = Space(estimatedLength)
  NormalizeString(#NormalizationC, @a2$, -1, @newa2$, estimatedLength)
  If a1$ = newa2$
    MessageRequester("Unicode normalization...", "a1$ = " + a1$ + ", newa2$ (normalized version of a2$) = " + newa2$ + #LF$ + #LF$ + "a1$ = newa2$ returns #True!")
  Else
    MessageRequester("Unicode normalization...", "a1$ = " + a1$ + ", newa2$ (normalized version of a2$) = " + newa2$ + #LF$ + #LF$ + "a1$ = newa2$ returns #False!")
  EndIf

EndIf
CloseLibrary(1)
As I say, I thought this might be useful. :)

Re: Unicode normalization (Vista / Win 7)

Posted: Fri Jan 08, 2010 11:17 pm
by Arctic Fox
Very interesting, srod! Thanks! :D
For other interested people, see this MSDN article http://msdn.microsoft.com/en-us/library ... 85%29.aspx

Re: Unicode normalization (Vista / Win 7)

Posted: Fri Jan 08, 2010 11:34 pm
by srod
Arctic Fox wrote:Very interesting, srod! Thanks! :D
For other interested people, see this MSDN article http://msdn.microsoft.com/en-us/library ... 85%29.aspx
Thanks Arctic.

I thought I had left that link in with the code - must have removed it whilst chopping and changing. :)

Re: Unicode normalization (Vista / Win 7)

Posted: Sat Jan 09, 2010 1:21 am
by luis
Interesting, I had no idea. Thanks.

Re: Unicode normalization (Vista / Win 7)

Posted: Sat Jan 09, 2010 12:46 pm
by srod
luis wrote:Interesting, I had no idea. Thanks.
Nor did I until I encountered a string comparison problem. :) Google to the rescue once again! :wink:

Re: Unicode normalization (Vista / Win 7)

Posted: Sun Feb 07, 2010 12:58 am
by Mistrel
I haven't started programming with Unicode yet but I've been doing a lot of research on it. The most complete solution I found is the ICU (International Components for Unicode) library (formerly IBM Classes for Unicode).

There is a really great description on Wikipedia on its origin and development:

http://en.wikipedia.org/wiki/Internatio ... or_Unicode

Re: Unicode normalization (Vista / Win 7)

Posted: Sun Feb 07, 2010 6:02 am
by Rescator
Very interesting, but if I where to do a string comparison I actually would want to differentiate Chr($00C4) and Chr($0041) + Chr($0308) for example as one is the actual character and the other is a combinatorial character, so I would lay the burden on the unicode encoder instead to actually create Chr($00C4) if Chr($00C4) or Chr($0041) + Chr($0308) is entered. I believe unicode specify that the shortest/simplest representation should be used? (at least it is so for UTF8)

I wonder if converting to UTF8 first would simplify string comparisons?

Re: Unicode normalization (Vista / Win 7)

Posted: Sun Feb 07, 2010 2:07 pm
by djes
wow, this is awful :x
Thank you for sharing, srod.