Re: utf16 string module StrCmp full case folding
Posted: Mon May 08, 2023 11:45 am
Updated
Added left_ right_ mid_ len_ isutf16 pUpCase plowCase
See 1st post
Added left_ right_ mid_ len_ isutf16 pUpCase plowCase
See 1st post
http://www.purebasic.com
https://www.purebasic.fr/english/
CaseFolding.txt wrote: 01C4; C; 01C6; # LATIN CAPITAL LETTER DZ WITH CARON
01C5; C; 01C6; # LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
Now suppose you want to do toUpperCase($01C6) with the CaseFolding.txt file. Which uppercase letter should the function now return from these two? You don't have this problem with the CaseMapping.txt file, because the Unicode Standard has defined only one target letter in it for mapping.https://www.compart.com/en/unicode/U+01C6 wrote:01C6 - Latin Small Letter Dz with Caron
Yes, that's what I thought, that's how your code works. But you can't be sure that the 1st instance is always the right one.idle wrote: Mon May 08, 2023 9:25 pm It uses the 1st instance so if it sees a repeat key it ignores it.
The value I marked in red can be different when mapping with UnicodeData.txt:
yes your right the issue is will appear if a character is encoded as Titlecase 01C5Sicro wrote: Mon May 08, 2023 10:19 pm I mistakenly wrote CaseMapping.txt in the previous post (does not exist), correct is UnicodeData.txt.
Yes, that's what I thought, that's how your code works. But you can't be sure that the 1st instance is always the right one.idle wrote: Mon May 08, 2023 9:25 pm It uses the 1st instance so if it sees a repeat key it ignores it.
The value I marked in red can be different when mapping with UnicodeData.txt:
01c5 => 01c4 (LowerCase) or 01c6 (UpperCase)
I will have to reassess how I do the data section, I just wanted to minimize it.01C4 = 01C4 | 01C6 | x
01C5 = 01C5 | 01C6 | x
01C6 = 01C4 | 01C6 | x
and previous posts resultsurrogate pairs equality a𐐀abd = A𐐨ABD
case mapping equality aῼabd = aῳABD
full case folding equaility aßEaİdssf = aSSEai̇dßf
simple case folding equality SomeMixedCaseStringWithNothingSpecialOtherThanBeingLong = sOMEmIXEDcASEsTRINGwithnOTHINGsPECIALoTHERtHANbEINGlONG
Nomal case equality Normal cmp = Normal cmp
Tolower somemixedcasestringwithnothingspecialotherthanbeinglong
To upper SOMEMIXEDCASESTRINGWITHNOTHINGSPECIALOTHERTHANBEINGLONG
ꭰ AB70
Ꭰ
Ꭰ
ꭰ
to upper ABCDEF 0123456789, ÄÖÜ, ÄÖÜ, ÁÓÚ FEDCBA DŽ
to lower abcdef 0123456789, äöü, äöü, áóú fedcba dž
to Title Abcdef 0123456789, Äöü, Äöü, Áóú Fedcba Dž
🅐A🅚🅐K🅝
left 2 🅐A
right 2 K🅝
mid 1,4 A🅚🅐K
chr_Asc_((Left_example,1))) 🅐
Appears to check out now.abcdef 0123456789, äöü, äöü, áóú FEDCBA Dž
1C4
1C6
1C4
Abcdef 0123456789, Äöü, Äöü, Áóú Fedcba Dž
$1C4 = 1C4 | 1C6 | 1C5
$1C5 = 1C4 | 1C6 | 1C5
$1C6 = 1C4 | 1C6 | 1C5
Best regardsChr(): Invalid value for Chr(), should be between 0 and $D7FF or between $E000 and $FFFF.
Code: Select all
Procedure.s StrChr(v.i) ;return a proper surrogate pair for unicode values outside the BMP (Basic Multilingual Plane)
Protected buffer.q
If v < $10000
ProcedureReturn Chr(v)
Else
Buffer = (v&$3FF)<<16 | (v-$10000)>>10 | $DC00D800
ProcedureReturn PeekS(@Buffer, 2, #PB_Unicode)
EndIf
EndProcedure
Thanks stargate, there was a range check added in the IDE. it was still compiling from the command line.STARGÅTE wrote: Mon Oct 16, 2023 7:51 pm This should work:Code: Select all
Procedure.s StrChr(v.i) ;return a proper surrogate pair for unicode values outside the BMP (Basic Multilingual Plane) Protected buffer.q If v < $10000 ProcedureReturn Chr(v) Else Buffer = (v&$3FF)<<16 | (v-$10000)>>10 | $DC00D800 ProcedureReturn PeekS(@Buffer, 2, #PB_Unicode) EndIf EndProcedure
The issue was caused by a range check that was added to the ide, I've asked for it to be removed as it really doesn't make much sense.StarBootics wrote: Mon Oct 16, 2023 5:37 pm Hello Idle,
Apparently your code is no longer working with PB 6.03 LTS. I got an error on line 4749 (Function : StrChr(v.i))
Best regardsChr(): Invalid value for Chr(), should be between 0 and $D7FF or between $E000 and $FFFF.
StarBootics
performance for 1,000,000"@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€ ‚ „…†‡ ‰Š‹ŚŤŽŹ ‘’“”•–— ™š›śťžź ˇ˘Ł¤Ą¦§¨©Ş«¬®Ż°±˛ł´µ¶·¸ąş»Ľ˝ľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖ×ŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ˙"
becomes
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€ ‚ „.†‡ ‰S‹STZZ ‘’“”•–— Ts›stzz ˇ ٤A¦§ ©S«¬®Z°± ł μ¶· as»L lzRAAAALCCCEEEEIIDĐNNOOOO×RUUUUYTßraaaalccceeeeiidđnnoooo÷ruuuuyt
UTF16::Strcmp(s3,s4) 81 ms
CompareMemoryString(@s3,@s4) 782 ms
UTF16::strLCase / UTF16::strUcase 48 ms
LCase / UCase 461 ms