Page 1 of 1

Unicode normalization

Posted: Thu Mar 05, 2015 3:08 am
by Little John
I think PB needs built-in functions for Unicode normalization.
Without such functions, Unicode string searches, comparisons and sorting can yield wrong results.
[u]''Unicode equivalence'' in Wikipedia[/u] wrote:For example, the code point U+006E (the Latin lowercase "n") followed by U+0303 (the combining tilde "◌̃") is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other.

Code: Select all

a$ = Chr($006E) + Chr($0303)
b$ = Chr($00F1)

Debug a$
Debug b$

If a$ = b$
   Debug "equal"
Else
   Debug "not equal"
EndIf
Output wrote:ñ
ñ
not equal
If you don't get this output, see here.

If I understand the above mentioned Wikipedia article correctly, then both strings should be considered equal.
If we had a NormalizeString() function, then we could write this:

Code: Select all

a$ = Chr($006E) + Chr($0303)
b$ = Chr($00F1)

Debug a$
Debug b$

If NormalizeString(a$) = NormalizeString(b$)
   Debug "equal"
Else
   Debug "not equal"
EndIf
and this should show "equal".

srod has provided some code here for Windows.
However, we need this on all platforms, and IMHO it is so important that it should be built into PB.

Maybe in Unicode mode also PB's built-in sorting functions should (optionally?) do normalization internally, before comparing strings.

Re: Unicode normalization

Posted: Thu Mar 05, 2015 9:46 pm
by idle
maybe Fred could take a look at ICU http://site.icu-project.org/home

Re: Unicode normalization

Posted: Thu Mar 05, 2015 10:21 pm
by IdeasVacuum
IMHO it is so important that it should be built into PB
+1

Re: Unicode normalization

Posted: Thu Mar 05, 2015 11:24 pm
by davido
+1

Re: Unicode normalization

Posted: Fri Mar 06, 2015 12:05 am
by juror
Especially since they're dropping ascii support.

If you say you support (only) unicode, you should (support unicode).

Re: Unicode normalization

Posted: Wed Aug 12, 2015 3:57 pm
by Little John