It is currently Tue May 26, 2020 6:02 pm

All times are UTC + 1 hour




Post new topic Reply to topic  [ 5 posts ] 
Author Message
 Post subject: Unicode string sort
PostPosted: Fri Sep 08, 2017 11:42 am 
Offline
Addict
Addict

Joined: Fri Aug 28, 2015 6:10 pm
Posts: 1087
Location: Portugal
Hi All

Just(hopefully) a small problem.

I have a database (sqlite) that returns several lists of words in UTF_8. SQLITE it seems will not sort UNICODE. Is there a unicode sort routine anywhere on the forum, if not can someone point me in the right direction?

Kind Regards

collectordave

_________________
Any intelligent fool can make things bigger and more complex. It takes a touch of genius — and a lot of courage to move in the opposite direction.


Top
 Profile  
Reply with quote  
 Post subject: Re: Unicode string sort
PostPosted: Fri Sep 08, 2017 2:23 pm 
Offline
Addict
Addict

Joined: Fri Aug 28, 2015 6:10 pm
Posts: 1087
Location: Portugal
Just playing around and thought of one way of doing it.

1. Read all UTF-8 characters into a database then add as a second field an ASCII replacement, in my case all lower case letters.

2. When running my programme load a two dimension array with all the characters and replacements.

3. Load my actual strings from the database into a structured array with index number.

4. replace all UNICODE characters in my strings with the defined ASCII replacement

5. Sort structured array and renumber the index number

6. Use this renumbered index as a sort order for that particular language when loading the list of words.

Working on it now so if anyone sees any problems please post here.

Regards

CD

_________________
Any intelligent fool can make things bigger and more complex. It takes a touch of genius — and a lot of courage to move in the opposite direction.


Top
 Profile  
Reply with quote  
 Post subject: Re: Unicode string sort
PostPosted: Fri Sep 08, 2017 4:46 pm 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Wed Sep 22, 2010 1:17 pm
Posts: 327
Location: United Kingdom
It depends what you are trying to achieve.

If you are looking for a sort which offers more granular control than the options of the built in sort functions, then its ok.

If you are trying to achieve an output which would be lexographically correct for languages which aren't English then, it will be wrong somewhere along the line for all languages other than English! Sorry!

This article explains the problem: https://en.wikipedia.org/wiki/Alphabetical_order#Language-specific_conventions


Top
 Profile  
Reply with quote  
 Post subject: Re: Unicode string sort
PostPosted: Fri Sep 08, 2017 4:56 pm 
Offline
Addict
Addict
User avatar

Joined: Sat Oct 09, 2010 3:47 am
Posts: 1617
Hi collectordave,

See the following explanation/solution: Collation (search based on hints provided by spikey)
- Unicode Collation Algorithm (UCA) comes with a default weight table
Quote:
Q: My script does not sort right because the characters were assigned to Unicode code points in the wrong order. What can I do about that?

A: There is a misunderstanding here: Linguistically meaningful sorting is done not by comparing code point values (an approach which would fail even for English), but by assigning multi-level weights to characters or sequences of characters and then comparing those weights on each level. There are many algorithms and implementations for this; the standard Unicode Collation Algorithm (UCA) comes with a default weight table for all assigned characters as well as a tailoring mechanism that describes how this table can be modified to conform to local conventions, where necessary.

_________________

STATUS: Permanently Unavailable :: Downloads moved to My PureBasic Stuff
_________________


Top
 Profile  
Reply with quote  
 Post subject: Re: Unicode string sort
PostPosted: Fri Sep 08, 2017 5:06 pm 
Offline
Addict
Addict

Joined: Fri Aug 28, 2015 6:10 pm
Posts: 1087
Location: Portugal
Thanks for the replies. I am looking for something which is as close as possible to an alphabetic sort for each language which is why I cam up with the idea above. Once all unicode characters are in the table they can be assigned a replacement ASCII character which can be any ascii character and each language can have it's own, so for example when looking at spanish the crossreference of a character can be completely different to the same unicode character in German or any other language. It only exists at one time to sort the words of that language in close to alphabetical order. I can tweak each character as much as I like.

I am going to look at the weightings though maybe more correct to assign each character a numeric value for each language and then sort based on those numbers. Will have to see.

Kind regards

CD

_________________
Any intelligent fool can make things bigger and more complex. It takes a touch of genius — and a lot of courage to move in the opposite direction.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 5 posts ] 

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 8 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  

 


Powered by phpBB © 2008 phpBB Group
subSilver+ theme by Canver Software, sponsor Sanal Modifiye