Unicode normalization (Vista / Win 7)

srod · Post by **srod** » Fri Jan 08, 2010 11:06 pm

Hi,

I thought this could be of interest to anyone working with Unicode.

I encountered a problem earlier related to the fact that different Unicode code-points can, linguistically, represent the same character(s).

For example, the strings Chr($00C4) and Chr($0041) + Chr($0308) (in Unicode), when rendered on screen, produce the same character (Ä) but are clearly encoded differently. A string comparison of these two strings will yield a #False result which, in some circumstances, would be undesirable.

To see this, run the following code (enable the Unicode compiler option) :

Code: Select all

a1$ = Chr($00C4)
a2$ = Chr($0041) + Chr($0308)

If a1$ = a2$
  MessageRequester("", "a1$ = " + a1$ + ", a2$ = " + a2$ + #LF$ + #LF$ + "a1$ = a2$ returns #True!")
Else
  MessageRequester("", "a1$ = " + a1$ + ", a2$ = " + a2$ + #LF$ + #LF$ + "a1$ = a2$ returns #False!")
EndIf

Now, under Vista / Win 7 there is a solution in the form of the "Normaliz" library and the NormalizeString() function. This function will take a Unicode string and 'normalize' it according to a specified 'normalisation form'. In the case of our two mismatched strings above, we can quickly arrange for them to be normalized so that they both share the same Unicode encoding (and thus a string comparison will return a #True value). The process of "normalization" produces one binary representation for any of the equivalent binary representations of a character. Once normalized, two strings are equivalent if and only if they have identical binary representations.

Try the following to see this in action (enable the Unicode compiler option - Vista / Win 7 only) :

Code: Select all

Enumeration 
  #NormalizationOther   = 0
  #NormalizationC       = 1
  #NormalizationD       = 2
  #NormalizationKC      = 5
  #NormalizationKD      = 6 
EndEnumeration

;Need to load the NormalizeString() function.
  Prototype.i protNormalizeString(NormForm, lpSrcString, cwSrcLength, lpDstString, cwDstLength)
  Global  NormalizeString.protNormalizeString

  If OpenLibrary(1, "Normaliz.dll") = 0
    MessageRequester("Unicode normalization...", "Could not load the NormalizeString() function.")
    End
  EndIf
  NormalizeString = GetFunction(1, "NormalizeString")
  
a1$ = Chr($00C4)
a2$ = Chr($0041) + Chr($0308)

estimatedLength = NormalizeString(#NormalizationC, @a2$, -1, 0, 0)
If estimatedLength
  newa2$ = Space(estimatedLength)
  NormalizeString(#NormalizationC, @a2$, -1, @newa2$, estimatedLength)
  If a1$ = newa2$
    MessageRequester("Unicode normalization...", "a1$ = " + a1$ + ", newa2$ (normalized version of a2$) = " + newa2$ + #LF$ + #LF$ + "a1$ = newa2$ returns #True!")
  Else
    MessageRequester("Unicode normalization...", "a1$ = " + a1$ + ", newa2$ (normalized version of a2$) = " + newa2$ + #LF$ + #LF$ + "a1$ = newa2$ returns #False!")
  EndIf

EndIf
CloseLibrary(1)

As I say, I thought this might be useful.

Arctic Fox · Post by **Arctic Fox** » Fri Jan 08, 2010 11:17 pm

Very interesting, srod! Thanks!

For other interested people, see this MSDN article http://msdn.microsoft.com/en-us/library ... 85%29.aspx

srod · Post by **srod** » Fri Jan 08, 2010 11:34 pm

Arctic Fox wrote:Very interesting, srod! Thanks!
For other interested people, see this MSDN article http://msdn.microsoft.com/en-us/library ... 85%29.aspx

Thanks Arctic.

I thought I had left that link in with the code - must have removed it whilst chopping and changing.

luis · Post by **luis** » Sat Jan 09, 2010 1:21 am

Interesting, I had no idea. Thanks.

srod · Post by **srod** » Sat Jan 09, 2010 12:46 pm

luis wrote:Interesting, I had no idea. Thanks.

Nor did I until I encountered a string comparison problem.

Google to the rescue once again!

Mistrel · Post by **Mistrel** » Sun Feb 07, 2010 12:58 am

I haven't started programming with Unicode yet but I've been doing a lot of research on it. The most complete solution I found is the ICU (International Components for Unicode) library (formerly IBM Classes for Unicode).

There is a really great description on Wikipedia on its origin and development:

http://en.wikipedia.org/wiki/Internatio ... or_Unicode

Rescator · Post by **Rescator** » Sun Feb 07, 2010 6:02 am

Very interesting, but if I where to do a string comparison I actually would want to differentiate Chr($00C4) and Chr($0041) + Chr($0308) for example as one is the actual character and the other is a combinatorial character, so I would lay the burden on the unicode encoder instead to actually create Chr($00C4) if Chr($00C4) or Chr($0041) + Chr($0308) is entered. I believe unicode specify that the shortest/simplest representation should be used? (at least it is so for UTF8)

I wonder if converting to UTF8 first would simplify string comparisons?

djes · Post by **djes** » Sun Feb 07, 2010 2:07 pm

wow, this is awful

Thank you for sharing, srod.

PureBasic Forums - English

Unicode normalization (Vista / Win 7)

Unicode normalization (Vista / Win 7)

Re: Unicode normalization (Vista / Win 7)

Re: Unicode normalization (Vista / Win 7)

Re: Unicode normalization (Vista / Win 7)

Re: Unicode normalization (Vista / Win 7)

Re: Unicode normalization (Vista / Win 7)

Re: Unicode normalization (Vista / Win 7)

Re: Unicode normalization (Vista / Win 7)