Unicode string sort

Just starting out? Need help? Post your questions and find answers here.
collectordave
Addict
Addict
Posts: 1309
Joined: Fri Aug 28, 2015 6:10 pm
Location: Portugal

Unicode string sort

Post by collectordave »

Hi All

Just(hopefully) a small problem.

I have a database (sqlite) that returns several lists of words in UTF_8. SQLITE it seems will not sort UNICODE. Is there a unicode sort routine anywhere on the forum, if not can someone point me in the right direction?

Kind Regards

collectordave
Any intelligent fool can make things bigger and more complex. It takes a touch of genius — and a lot of courage to move in the opposite direction.
collectordave
Addict
Addict
Posts: 1309
Joined: Fri Aug 28, 2015 6:10 pm
Location: Portugal

Re: Unicode string sort

Post by collectordave »

Just playing around and thought of one way of doing it.

1. Read all UTF-8 characters into a database then add as a second field an ASCII replacement, in my case all lower case letters.

2. When running my programme load a two dimension array with all the characters and replacements.

3. Load my actual strings from the database into a structured array with index number.

4. replace all UNICODE characters in my strings with the defined ASCII replacement

5. Sort structured array and renumber the index number

6. Use this renumbered index as a sort order for that particular language when loading the list of words.

Working on it now so if anyone sees any problems please post here.

Regards

CD
Any intelligent fool can make things bigger and more complex. It takes a touch of genius — and a lot of courage to move in the opposite direction.
User avatar
spikey
Enthusiast
Enthusiast
Posts: 586
Joined: Wed Sep 22, 2010 1:17 pm
Location: United Kingdom

Re: Unicode string sort

Post by spikey »

It depends what you are trying to achieve.

If you are looking for a sort which offers more granular control than the options of the built in sort functions, then its ok.

If you are trying to achieve an output which would be lexographically correct for languages which aren't English then, it will be wrong somewhere along the line for all languages other than English! Sorry!

This article explains the problem: https://en.wikipedia.org/wiki/Alphabeti ... onventions
JHPJHP
Addict
Addict
Posts: 2129
Joined: Sat Oct 09, 2010 3:47 am
Contact:

Re: Unicode string sort

Post by JHPJHP »

Hi collectordave,

See the following explanation/solution: Collation (search based on hints provided by spikey)
- Unicode Collation Algorithm (UCA) comes with a default weight table
Q: My script does not sort right because the characters were assigned to Unicode code points in the wrong order. What can I do about that?

A: There is a misunderstanding here: Linguistically meaningful sorting is done not by comparing code point values (an approach which would fail even for English), but by assigning multi-level weights to characters or sequences of characters and then comparing those weights on each level. There are many algorithms and implementations for this; the standard Unicode Collation Algorithm (UCA) comes with a default weight table for all assigned characters as well as a tailoring mechanism that describes how this table can be modified to conform to local conventions, where necessary.
collectordave
Addict
Addict
Posts: 1309
Joined: Fri Aug 28, 2015 6:10 pm
Location: Portugal

Re: Unicode string sort

Post by collectordave »

Thanks for the replies. I am looking for something which is as close as possible to an alphabetic sort for each language which is why I cam up with the idea above. Once all unicode characters are in the table they can be assigned a replacement ASCII character which can be any ascii character and each language can have it's own, so for example when looking at spanish the crossreference of a character can be completely different to the same unicode character in German or any other language. It only exists at one time to sort the words of that language in close to alphabetical order. I can tweak each character as much as I like.

I am going to look at the weightings though maybe more correct to assign each character a numeric value for each language and then sort based on those numbers. Will have to see.

Kind regards

CD
Any intelligent fool can make things bigger and more complex. It takes a touch of genius — and a lot of courage to move in the opposite direction.
Post Reply