Unicode normalization
Posted: Thu Mar 05, 2015 3:08 am
I think PB needs built-in functions for Unicode normalization.
Without such functions, Unicode string searches, comparisons and sorting can yield wrong results.
If I understand the above mentioned Wikipedia article correctly, then both strings should be considered equal.
If we had a NormalizeString() function, then we could write this:
and this should show "equal".
srod has provided some code here for Windows.
However, we need this on all platforms, and IMHO it is so important that it should be built into PB.
Maybe in Unicode mode also PB's built-in sorting functions should (optionally?) do normalization internally, before comparing strings.
Without such functions, Unicode string searches, comparisons and sorting can yield wrong results.
[u]''Unicode equivalence'' in Wikipedia[/u] wrote:For example, the code point U+006E (the Latin lowercase "n") followed by U+0303 (the combining tilde "◌̃") is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other.
Code: Select all
a$ = Chr($006E) + Chr($0303)
b$ = Chr($00F1)
Debug a$
Debug b$
If a$ = b$
Debug "equal"
Else
Debug "not equal"
EndIf
If you don't get this output, see here.Output wrote:ñ
ñ
not equal
If I understand the above mentioned Wikipedia article correctly, then both strings should be considered equal.
If we had a NormalizeString() function, then we could write this:
Code: Select all
a$ = Chr($006E) + Chr($0303)
b$ = Chr($00F1)
Debug a$
Debug b$
If NormalizeString(a$) = NormalizeString(b$)
Debug "equal"
Else
Debug "not equal"
EndIf
srod has provided some code here for Windows.
However, we need this on all platforms, and IMHO it is so important that it should be built into PB.
Maybe in Unicode mode also PB's built-in sorting functions should (optionally?) do normalization internally, before comparing strings.