Unicode normalization (Vista / Win 7)

Share your advanced PureBasic knowledge/code with the community.
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Unicode normalization (Vista / Win 7)

Post by srod »

Hi,

I thought this could be of interest to anyone working with Unicode.

I encountered a problem earlier related to the fact that different Unicode code-points can, linguistically, represent the same character(s).

For example, the strings Chr($00C4) and Chr($0041) + Chr($0308) (in Unicode), when rendered on screen, produce the same character (Ä) but are clearly encoded differently. A string comparison of these two strings will yield a #False result which, in some circumstances, would be undesirable.

To see this, run the following code (enable the Unicode compiler option) :

Code: Select all

a1$ = Chr($00C4)
a2$ = Chr($0041) + Chr($0308)

If a1$ = a2$
  MessageRequester("", "a1$ = " + a1$ + ", a2$ = " + a2$ + #LF$ + #LF$ + "a1$ = a2$ returns #True!")
Else
  MessageRequester("", "a1$ = " + a1$ + ", a2$ = " + a2$ + #LF$ + #LF$ + "a1$ = a2$ returns #False!")
EndIf
Now, under Vista / Win 7 there is a solution in the form of the "Normaliz" library and the NormalizeString() function. This function will take a Unicode string and 'normalize' it according to a specified 'normalisation form'. In the case of our two mismatched strings above, we can quickly arrange for them to be normalized so that they both share the same Unicode encoding (and thus a string comparison will return a #True value). The process of "normalization" produces one binary representation for any of the equivalent binary representations of a character. Once normalized, two strings are equivalent if and only if they have identical binary representations.

Try the following to see this in action (enable the Unicode compiler option - Vista / Win 7 only) :

Code: Select all

Enumeration 
  #NormalizationOther   = 0
  #NormalizationC       = 1
  #NormalizationD       = 2
  #NormalizationKC      = 5
  #NormalizationKD      = 6 
EndEnumeration

;Need to load the NormalizeString() function.
  Prototype.i protNormalizeString(NormForm, lpSrcString, cwSrcLength, lpDstString, cwDstLength)
  Global  NormalizeString.protNormalizeString

  If OpenLibrary(1, "Normaliz.dll") = 0
    MessageRequester("Unicode normalization...", "Could not load the NormalizeString() function.")
    End
  EndIf
  NormalizeString = GetFunction(1, "NormalizeString")
  
a1$ = Chr($00C4)
a2$ = Chr($0041) + Chr($0308)

estimatedLength = NormalizeString(#NormalizationC, @a2$, -1, 0, 0)
If estimatedLength
  newa2$ = Space(estimatedLength)
  NormalizeString(#NormalizationC, @a2$, -1, @newa2$, estimatedLength)
  If a1$ = newa2$
    MessageRequester("Unicode normalization...", "a1$ = " + a1$ + ", newa2$ (normalized version of a2$) = " + newa2$ + #LF$ + #LF$ + "a1$ = newa2$ returns #True!")
  Else
    MessageRequester("Unicode normalization...", "a1$ = " + a1$ + ", newa2$ (normalized version of a2$) = " + newa2$ + #LF$ + #LF$ + "a1$ = newa2$ returns #False!")
  EndIf

EndIf
CloseLibrary(1)
As I say, I thought this might be useful. :)
I may look like a mule, but I'm not a complete ass.
User avatar
Arctic Fox
Enthusiast
Enthusiast
Posts: 609
Joined: Sun Dec 21, 2008 5:02 pm
Location: Aarhus, Denmark

Re: Unicode normalization (Vista / Win 7)

Post by Arctic Fox »

Very interesting, srod! Thanks! :D
For other interested people, see this MSDN article http://msdn.microsoft.com/en-us/library ... 85%29.aspx
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Re: Unicode normalization (Vista / Win 7)

Post by srod »

Arctic Fox wrote:Very interesting, srod! Thanks! :D
For other interested people, see this MSDN article http://msdn.microsoft.com/en-us/library ... 85%29.aspx
Thanks Arctic.

I thought I had left that link in with the code - must have removed it whilst chopping and changing. :)
I may look like a mule, but I'm not a complete ass.
User avatar
luis
Addict
Addict
Posts: 3895
Joined: Wed Aug 31, 2005 11:09 pm
Location: Italy

Re: Unicode normalization (Vista / Win 7)

Post by luis »

Interesting, I had no idea. Thanks.
"Have you tried turning it off and on again ?"
A little PureBasic review
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Re: Unicode normalization (Vista / Win 7)

Post by srod »

luis wrote:Interesting, I had no idea. Thanks.
Nor did I until I encountered a string comparison problem. :) Google to the rescue once again! :wink:
I may look like a mule, but I'm not a complete ass.
Mistrel
Addict
Addict
Posts: 3415
Joined: Sat Jun 30, 2007 8:04 pm

Re: Unicode normalization (Vista / Win 7)

Post by Mistrel »

I haven't started programming with Unicode yet but I've been doing a lot of research on it. The most complete solution I found is the ICU (International Components for Unicode) library (formerly IBM Classes for Unicode).

There is a really great description on Wikipedia on its origin and development:

http://en.wikipedia.org/wiki/Internatio ... or_Unicode
User avatar
Rescator
Addict
Addict
Posts: 1769
Joined: Sat Feb 19, 2005 5:05 pm
Location: Norway

Re: Unicode normalization (Vista / Win 7)

Post by Rescator »

Very interesting, but if I where to do a string comparison I actually would want to differentiate Chr($00C4) and Chr($0041) + Chr($0308) for example as one is the actual character and the other is a combinatorial character, so I would lay the burden on the unicode encoder instead to actually create Chr($00C4) if Chr($00C4) or Chr($0041) + Chr($0308) is entered. I believe unicode specify that the shortest/simplest representation should be used? (at least it is so for UTF8)

I wonder if converting to UTF8 first would simplify string comparisons?
User avatar
djes
Addict
Addict
Posts: 1806
Joined: Sat Feb 19, 2005 2:46 pm
Location: Pas-de-Calais, France

Re: Unicode normalization (Vista / Win 7)

Post by djes »

wow, this is awful :x
Thank you for sharing, srod.
Post Reply