Upper and Lower Case Mapping for Unicode

Share your advanced PureBasic knowledge/code with the community.
User avatar
Sicro
Enthusiast
Enthusiast
Posts: 538
Joined: Wed Jun 25, 2014 5:25 pm
Location: Germany
Contact:

Re: Upper and Lower Case Mapping for Unicode

Post by Sicro »

idle wrote: Mon Nov 14, 2022 3:57 am I would need to implement
https://www.unicode.org/Public/UCD/late ... olding.txt
Yes. Maybe it should be a separate module because the codes posted in this thread completely fulfills the thread topic, at least the simple variant of case mapping. Comparing strings in a case-insensitive manner is a different topic.
idle wrote: Mon Nov 14, 2022 3:57 am the goal for example will say that "MASSE" and "Maße" are equal.
This is full case-folding, where the mapping can have different number of letters. In simple case-folding "ss/Ss/sS/SS" and "ß" is different, because simple case-folding supports only mappings with the same number of letters.

With full case-folding and full case-mapping, the topic becomes even more complicated because some Unicode characters can be created by multiple variants of character combinations. To put these character combinations into a normalized form to then apply case-mapping or case-folding, there are several algorithms for normalization and these algorithms sometimes even have to be run multiple times.

To better understand the complexity, I recommend this documentation:
https://www.w3.org/TR/charmod-norm/
Image
Why OpenSource should have a license :: PB-CodeArchiv-Rebirth :: Pleasant-Dark (syntax color scheme) :: RegEx-Engine (compiles RegExes to NFA/DFA)
Manjaro Xfce x64 (Main system) :: Windows 10 Home (VirtualBox) :: Newest PureBasic version
User avatar
idle
Always Here
Always Here
Posts: 5040
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Upper and Lower Case Mapping for Unicode

Post by idle »

Sicro wrote: Mon Nov 14, 2022 11:29 pm
idle wrote: Mon Nov 14, 2022 3:57 am I would need to implement
https://www.unicode.org/Public/UCD/late ... olding.txt
Yes. Maybe it should be a separate module because the codes posted in this thread completely fulfills the thread topic, at least the simple variant of case mapping. Comparing strings in a case-insensitive manner is a different topic.
idle wrote: Mon Nov 14, 2022 3:57 am the goal for example will say that "MASSE" and "Maße" are equal.
This is full case-folding, where the mapping can have different number of letters. In simple case-folding "ss/Ss/sS/SS" and "ß" is different, because simple case-folding supports only mappings with the same number of letters.

With full case-folding and full case-mapping, the topic becomes even more complicated because some Unicode characters can be created by multiple variants of character combinations. To put these character combinations into a normalized form to then apply case-mapping or case-folding, there are several algorithms for normalization and these algorithms sometimes even have to be run multiple times.

To better understand the complexity, I recommend this documentation:
https://www.w3.org/TR/charmod-norm/
I think this works as intended to preform a full case folding string cmp, it's case in sensitive.
https://www.unicode.org/Public/UCD/late ... olding.txt
I will leave it here for pickings and if it's correct I will post is it's own thread. I don't have any need for if but it might be useful to those following this topic.
https://dnscope.io/idlefiles/casefold.pb
User avatar
Sicro
Enthusiast
Enthusiast
Posts: 538
Joined: Wed Jun 25, 2014 5:25 pm
Location: Germany
Contact:

Re: Upper and Lower Case Mapping for Unicode

Post by Sicro »

idle wrote: Tue Nov 15, 2022 12:38 am I think this works as intended to preform a full case folding string cmp, it's case in sensitive.
https://www.unicode.org/Public/UCD/late ... olding.txt
I will leave it here for pickings and if it's correct I will post is it's own thread. I don't have any need for if but it might be useful to those following this topic.
https://dnscope.io/idlefiles/casefold.pb
Yes, usually case folding is applied for comparing two strings. In this thread, you just included a simple variant in your code and that's totally ok, because this is actually about case mapping and case folding is a different topic. So it's all good. I just wanted to briefly mention how it's usually done. I've written you a PN, so it doesn't get too off-topic here.
Image
Why OpenSource should have a license :: PB-CodeArchiv-Rebirth :: Pleasant-Dark (syntax color scheme) :: RegEx-Engine (compiles RegExes to NFA/DFA)
Manjaro Xfce x64 (Main system) :: Windows 10 Home (VirtualBox) :: Newest PureBasic version
Post Reply