Page 2 of 2

Re: utf16 string module StrCmp full case folding

Posted: Mon May 08, 2023 11:45 am
by idle
Updated
Added left_ right_ mid_ len_ isutf16 pUpCase plowCase

See 1st post

Re: utf16 string module StrCmp full case folding

Posted: Mon May 08, 2023 7:59 pm
by Sicro
The CaseFolding.txt file is not suitable for converting lowercase letters to uppercase or vice versa. This file is for normalizing two strings (reducing character variants; called case-folding) so that they can then be compared.

For converting lowercase to uppercase or vice versa, the CaseMapping.txt file must be used.

Here is an example of a problem when you use the CaseFolding.txt file for conversion from upper case to lower case or vice versa:
CaseFolding.txt wrote: 01C4; C; 01C6; # LATIN CAPITAL LETTER DZ WITH CARON
01C5; C; 01C6; # LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
https://www.compart.com/en/unicode/U+01C6 wrote:01C6 - Latin Small Letter Dz with Caron
Now suppose you want to do toUpperCase($01C6) with the CaseFolding.txt file. Which uppercase letter should the function now return from these two? You don't have this problem with the CaseMapping.txt file, because the Unicode Standard has defined only one target letter in it for mapping.

Edit: I mean UnicodeData.txt not CaseMapping.txt (does not exist), sorry, too tired.

Re: utf16 string module StrCmp full case folding

Posted: Mon May 08, 2023 9:25 pm
by idle
It uses the 1st instance so if it sees a repeat key it ignores it. I have yet to test it against the casemapping txt. That's another 2 hours work.
the mapping in the casefolding should result in
01c4 01c6 - 01c6 01c4
01c5 01c6 - 01c6 01c4.
Is it correct, I'm not sure that's today's task

Re: utf16 string module StrCmp full case folding

Posted: Mon May 08, 2023 10:19 pm
by Sicro
I mistakenly wrote CaseMapping.txt in the previous post (does not exist), correct is UnicodeData.txt.
idle wrote: Mon May 08, 2023 9:25 pm It uses the 1st instance so if it sees a repeat key it ignores it.
Yes, that's what I thought, that's how your code works. But you can't be sure that the 1st instance is always the right one.
idle wrote: Mon May 08, 2023 9:25 pm 01c4 01c6 - 01c6 01c4
01c5 01c6 - 01c6 01c4.
The value I marked in red can be different when mapping with UnicodeData.txt:

01c5 => 01c4 (LowerCase) or 01c6 (UpperCase)

Re: utf16 string module StrCmp full case folding

Posted: Tue May 09, 2023 2:42 am
by idle
Sicro wrote: Mon May 08, 2023 10:19 pm I mistakenly wrote CaseMapping.txt in the previous post (does not exist), correct is UnicodeData.txt.
idle wrote: Mon May 08, 2023 9:25 pm It uses the 1st instance so if it sees a repeat key it ignores it.
Yes, that's what I thought, that's how your code works. But you can't be sure that the 1st instance is always the right one.
idle wrote: Mon May 08, 2023 9:25 pm 01c4 01c6 - 01c6 01c4
01c5 01c6 - 01c6 01c4.
The value I marked in red can be different when mapping with UnicodeData.txt:

01c5 => 01c4 (LowerCase) or 01c6 (UpperCase)
yes your right the issue is will appear if a character is encoded as Titlecase 01C5
The respective mappings according to unicodedata.txt as upper | lower | Titlecase are

01C4 = 01C4 | 01C6 | 01C5
01C5 = 01C4 | 01C6 | 01C5
01C6 = 01C4 | 01C6 | 01C5

and this would erroneously resulting in returning the character to TitleCase
01C4 = 01C4 | 01C6 | x
01C5 = 01C5 | 01C6 | x
01C6 = 01C4 | 01C6 | x
I will have to reassess how I do the data section, I just wanted to minimize it.

Re: utf16 string module StrCmp full case folding

Posted: Tue May 09, 2023 5:42 am
by idle
have redone it
surrogate pairs equality a𐐀abd = A𐐨ABD
case mapping equality aῼabd = aῳABD
full case folding equaility aßEaİdssf = aSSEai̇dßf
simple case folding equality SomeMixedCaseStringWithNothingSpecialOtherThanBeingLong = sOMEmIXEDcASEsTRINGwithnOTHINGsPECIALoTHERtHANbEINGlONG
Nomal case equality Normal cmp = Normal cmp
Tolower somemixedcasestringwithnothingspecialotherthanbeinglong
To upper SOMEMIXEDCASESTRINGWITHNOTHINGSPECIALOTHERTHANBEINGLONG
ꭰ AB70



to upper ABCDEF 0123456789, ÄÖÜ, ÄÖÜ, ÁÓÚ FEDCBA DŽ
to lower abcdef 0123456789, äöü, äöü, áóú fedcba dž
to Title Abcdef 0123456789, Äöü, Äöü, Áóú Fedcba Dž
🅐A🅚🅐K🅝
left 2 🅐A
right 2 K🅝
mid 1,4 A🅚🅐K
chr_Asc_((Left_example,1))) 🅐
and previous posts result
abcdef 0123456789, äöü, äöü, áóú FEDCBA Dž
1C4
1C6
1C4
Abcdef 0123456789, Äöü, Äöü, Áóú Fedcba Dž
$1C4 = 1C4 | 1C6 | 1C5
$1C5 = 1C4 | 1C6 | 1C5
$1C6 = 1C4 | 1C6 | 1C5
Appears to check out now.

Re: utf16 string module StrCmp full case folding

Posted: Wed May 10, 2023 3:28 am
by idle
updated to v2.0.0 and renamed

UTF16 Utility Module
provides utf16 support to PB
with full case folding compare, similar to CompareMemoryString
in place string case mappings for uppercase lowercase and titlecase
string replacements for left, mid, right, len, ucase, lcase, asc, chr
additional tcase (title case)

https://github.com/idle-PB/UTF16

Re: utf16 string module StrCmp full case folding

Posted: Sun Jun 11, 2023 1:31 am
by idle
Redid strLCase / strUCase, added speed tests

UTF16 Strcmp(s3,s4) 68 ms for 1,000,000
CompareMemoryString(@s3,@s4) 62 ms for 1,000,000

UTF16 strLCase / strUcase 41 ms for 1,000,000
LCase / UCase 486 ms for 1,000,000

https://github.com/idle-PB/UTF16

Re: utf16 string module StrCmp full case folding

Posted: Mon Oct 16, 2023 5:37 pm
by StarBootics
Hello Idle,

Apparently your code is no longer working with PB 6.03 LTS. I got an error on line 4749 (Function : StrChr(v.i))
Chr(): Invalid value for Chr(), should be between 0 and $D7FF or between $E000 and $FFFF.
Best regards
StarBootics

Re: utf16 string module StrCmp full case folding

Posted: Mon Oct 16, 2023 7:51 pm
by STARGÅTE
This should work:

Code: Select all

Procedure.s StrChr(v.i) ;return a proper surrogate pair for unicode values outside the BMP (Basic Multilingual Plane)
	Protected buffer.q
	If v < $10000
		ProcedureReturn Chr(v)
	Else
		Buffer = (v&$3FF)<<16 | (v-$10000)>>10 | $DC00D800
		ProcedureReturn PeekS(@Buffer, 2, #PB_Unicode)
	EndIf
EndProcedure

Re: utf16 string module StrCmp full case folding

Posted: Mon Oct 16, 2023 9:33 pm
by idle
STARGÅTE wrote: Mon Oct 16, 2023 7:51 pm This should work:

Code: Select all

Procedure.s StrChr(v.i) ;return a proper surrogate pair for unicode values outside the BMP (Basic Multilingual Plane)
	Protected buffer.q
	If v < $10000
		ProcedureReturn Chr(v)
	Else
		Buffer = (v&$3FF)<<16 | (v-$10000)>>10 | $DC00D800
		ProcedureReturn PeekS(@Buffer, 2, #PB_Unicode)
	EndIf
EndProcedure
Thanks stargate, there was a range check added in the IDE. it was still compiling from the command line.

Re: utf16 string module StrCmp full case folding

Posted: Mon Oct 16, 2023 9:37 pm
by idle
StarBootics wrote: Mon Oct 16, 2023 5:37 pm Hello Idle,

Apparently your code is no longer working with PB 6.03 LTS. I got an error on line 4749 (Function : StrChr(v.i))
Chr(): Invalid value for Chr(), should be between 0 and $D7FF or between $E000 and $FFFF.
Best regards
StarBootics
The issue was caused by a range check that was added to the ide, I've asked for it to be removed as it really doesn't make much sense.

See stargates fix above :D

Re: utf16 string module StrCmp full case folding

Posted: Wed Oct 18, 2023 10:37 pm
by idle
Fixed the examples back up should contain the :D emoji line 5077

Re: utf16 string module StrCmp full case folding

Posted: Mon Nov 20, 2023 2:44 am
by idle
Added function to strip accents in UTF16a.pb
https://github.com/idle-PB/UTF16
"@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€ ‚ „…†‡ ‰Š‹ŚŤŽŹ ‘’“”•–— ™š›śťžź ˇ˘Ł¤Ą¦§¨©Ş«¬­®Ż°±˛ł´µ¶·¸ąş»Ľ˝ľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖ×ŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ˙"

becomes

@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€ ‚ „.†‡ ‰S‹STZZ ‘’“”•–— Ts›stzz ˇ ٤A¦§ ©S«¬­®Z°± ł μ¶· as»L lzRAAAALCCCEEEEIIDĐNNOOOO×RUUUUYTßraaaalccceeeeiidđnnoooo÷ruuuuyt
performance for 1,000,000
UTF16::Strcmp(s3,s4) 81 ms
CompareMemoryString(@s3,@s4) 782 ms
UTF16::strLCase / UTF16::strUcase 48 ms
LCase / UCase 461 ms