Unicode normalization

Got an idea for enhancing PureBasic? New command(s) you'd like to see?
Little John
Addict
Addict
Posts: 4779
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Unicode normalization

Post by Little John »

I think PB needs built-in functions for Unicode normalization.
Without such functions, Unicode string searches, comparisons and sorting can yield wrong results.
[u]''Unicode equivalence'' in Wikipedia[/u] wrote:For example, the code point U+006E (the Latin lowercase "n") followed by U+0303 (the combining tilde "◌̃") is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other.

Code: Select all

a$ = Chr($006E) + Chr($0303)
b$ = Chr($00F1)

Debug a$
Debug b$

If a$ = b$
   Debug "equal"
Else
   Debug "not equal"
EndIf
Output wrote:ñ
ñ
not equal
If you don't get this output, see here.

If I understand the above mentioned Wikipedia article correctly, then both strings should be considered equal.
If we had a NormalizeString() function, then we could write this:

Code: Select all

a$ = Chr($006E) + Chr($0303)
b$ = Chr($00F1)

Debug a$
Debug b$

If NormalizeString(a$) = NormalizeString(b$)
   Debug "equal"
Else
   Debug "not equal"
EndIf
and this should show "equal".

srod has provided some code here for Windows.
However, we need this on all platforms, and IMHO it is so important that it should be built into PB.

Maybe in Unicode mode also PB's built-in sorting functions should (optionally?) do normalization internally, before comparing strings.
User avatar
idle
Always Here
Always Here
Posts: 5844
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Unicode normalization

Post by idle »

maybe Fred could take a look at ICU http://site.icu-project.org/home
Windows 11, Manjaro, Raspberry Pi OS
Image
IdeasVacuum
Always Here
Always Here
Posts: 6426
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Unicode normalization

Post by IdeasVacuum »

IMHO it is so important that it should be built into PB
+1
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
davido
Addict
Addict
Posts: 1890
Joined: Fri Nov 09, 2012 11:04 pm
Location: Uttoxeter, UK

Re: Unicode normalization

Post by davido »

+1
DE AA EB
juror
Enthusiast
Enthusiast
Posts: 228
Joined: Mon Jul 09, 2007 4:47 pm
Location: Courthouse

Re: Unicode normalization

Post by juror »

Especially since they're dropping ascii support.

If you say you support (only) unicode, you should (support unicode).
Little John
Addict
Addict
Posts: 4779
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Unicode normalization

Post by Little John »

Post Reply