Page 1 of 2

utf16 string module StrCmp full case folding

Posted: Wed Dec 14, 2022 1:10 am
by idle
UTF16 Utility Module provides utf16 support to PB
with full case folding compare, similar to CompareMemoryString
in place string case mappings for uppercase lowercase and titlecase
string replacements for left, mid, right, len, ucase, lcase, asc, chr
additional tcase (title case)
Strip accents

The data supports both implementations that require simple case foldings
(where string lengths don't change), and implementations that allow full case folding
(where string lengths may grow). Note that where they can be supported, the
full case foldings are superior: for example, they allow "MASSE" and "Maße" to match.
;UTF16 v 2.0.0
;Authors idle mk-soft 15/11/22 - 10/5/23
;license MIT
;
;fullcase folding is required when a strings length differs but is deemed equivalent
;see https://www.unicode.org/Public/UCD/late ... olding.txt
;for example "mASSE" and "Maße" are equal.
;this provides a fast scalable string compare
;casemappings
;see "http://www.unicode.org/Public/UCD/lates ... deData.txt"
;

;#CaseNormal s <> S
;#CaseSimple s = S
;#CaseFull ss = ß

;History
;v1.2.1
;redone in table, needs improvment
;added normal strcmp for completeness
;v1.2.2
;fixed stride bug if size of mapped char > $FFFFFF + 3 to other string
;v1.2.3
;swapped around mapping was in reverse order
;v1.2.4
;returns 1 if strings are equal
;v1.2.5 19/12/22
;changed flag to #CASEWITHCASE TO #CASENORMALL
;v1.2.6 Changed to support surrogate pairs for UTF16 support
; added chr_() asc_() functions for surrogate pairs
;v1.2.7 fixed bug in _asc function
;v1.2.8 fixed short string bug
;v1.2.9 fixec bug in same case mapping 1st char
;v1.2.10 fixed start of table
;v1.2.11 Added Left_, Right_, Len_, Is_UTF16 : mk-soft
;v1.2.12a Added Mid_. pUpCase, pLowCase : idle
;v1.2.13a Redid Casemapping data added pTitleCase : idle

;v2.0.0 Renamed module and it's functions as it's grown beyond casefolding
;v2.0.1 Redid strLcase strUcase removed redundant ifs, redid arrays for better cache locality. added speed test for strLcase strUcase : idle


Implementations v2
https://github.com/idle-PB/UTF16

Performance with c backend


UTF16 Strcmp(s3,s4) 68 ms for 1,000,000
PB CompareMemoryString(@s3,@s4) 62 ms for 1,000,000

UTF16 strLCase / strUcase 41 ms for 1,000,000
PB LCase / UCase 486 ms for 1,000,000


Note: If you need to support Turkish with full case folding use the StrcmpTK function.

Re: strcmp string compare for simple case and full case folding

Posted: Sun Jan 15, 2023 3:28 am
by idle
updated added support for surrogate pairs so

Code: Select all

   sa = "a" + _Chr($10400) + "abd" 
   sb = "A" + _Chr($10428) + "ABD" 
   If StrCmp(sa,sb) 
       Debug "surrogate pairs " + sa + " = " + sb  
    EndIf   
surrogate pairs a𐐀abd = A𐐨ABD

Re: strcmp string compare for simple case and full case folding

Posted: Sun Jan 15, 2023 10:18 pm
by idle
v 1.2.7 bug fixed asc function

Re: strcmp string compare for simple case and full case folding

Posted: Sat Jan 21, 2023 12:47 pm
by Sicro
Had now a bit of time to test it again. Sorry, but there is still something wrong:

Code: Select all

Debug CaseFolding::StrCmp("ß", "ss") ; returns `0`
Debug CaseFolding::StrCmp("ßz", "ssz") ; returns `1`
Debug CaseFolding::StrCmp("zß", "zss") ; returns `0`

Re: strcmp string compare for simple case and full case folding

Posted: Sat Jan 21, 2023 8:11 pm
by idle
I broke it adding the surrogate pairs: fixed
The problem was a boundary check had to test it doesn't go over end of string.
It's a complicated bit of code, hope it's all correct now.

Code: Select all

 While (((aa & $ffff) = Casemapping(mode,*b\a[cb])) And *b\a[cb] <> 0) 
Debug StrCmp("ß", "ss") ; returns `1`
Debug StrCmp("ßz", "ssz") ; returns `1`
Debug StrCmp("zß", "zss") ; returns `1`
I've also add it to github see OP for links.

Re: strcmp string compare for simple case and full case folding

Posted: Sun Jan 22, 2023 11:40 am
by Sicro
Now everything works correctly. I have checked it with all characters of the `CaseFolding.txt` file. Well done :)
idle wrote: Sat Jan 21, 2023 8:11 pm I've also add it to github see OP for links.
Nice, I gave it a star.

Re: strcmp string compare for simple case and full case folding

Posted: Sun Jan 22, 2023 12:31 pm
by Sicro
Unfortunately, I have now found something after all:

Code: Select all

; 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
; 1E9E; F; 0073 0073; # LATIN CAPITAL LETTER SHARP S
; 1E9E; S; 00DF; # LATIN CAPITAL LETTER SHARP S
CaseFolding::StrCmp(Chr($00DF), Chr($1E9E))
Produces an infinite loop.

Re: strcmp string compare for simple case and full case folding

Posted: Sun Jan 22, 2023 9:47 pm
by idle
I think I've caught it. when the chars both mapped to the same expanded mapping it resulted in it stalling on the same character. Testing with sharp S isn't really ideal as it evaluates to the same expanded sequence 0073 0073
I've replaced the tests with GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI = 03C9 03B9;

Code: Select all

    sa = "a" + _Chr($1FFC) + "abd" 
    sb = "a" + _Chr($1FF3) + "ABD" 
    
    ;1FFC; F; 03C9 03B9; # GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI   
    ;1FF3; F; 03C9 03B9; # GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI
  
    If StrCmp(sa,sb) 
        Debug "casemapping " + sa + " = " + sb  
    EndIf   

aῼabd = aῳABD
Fingers crossed it's working properly now and after compiling with c backend 6.01b the performance is smoking hot 8)

most languages would parse the strings three times.

1) copy the strings
2) convert to the full case mapped strings
3) compare the strings

Even though you only need one parse and that's why the code is butt ugly!
Smokin! wrote: Strcmp(s3,s4) 67 ms for 1,000,000
CompareMemoryString(@s3,@s4) 64 ms for 1,000,000

Re: strcmp string compare for simple case and full case folding

Posted: Mon Jan 23, 2023 11:10 am
by Fred
Glad to see the improvement in C optimization are working as expected ! What was the timing for ASM backend ?

Re: strcmp string compare for simple case and full case folding

Posted: Mon Jan 23, 2023 7:04 pm
by idle
Fred wrote: Mon Jan 23, 2023 11:10 am Glad to see the improvement in C optimization are working as expected ! What was the timing for ASM backend ?
Around 390ms, It was 160 before my bug fixes. I will take a look at the assembly when I get time tomorrow.
Really cool result and it was a good bit of code for the optimization. I will try it on the elliptic curve module too.

Re: strcmp string compare for simple case and full case folding

Posted: Fri Feb 24, 2023 10:28 pm
by RichAlgeni
Image
Google is a pain!

Re: strcmp string compare for simple case and full case folding

Posted: Fri Feb 24, 2023 11:22 pm
by idle
Thanks for the heads up, Googles embargo is specifically about Dnscope.exe and it's installer, It's laughable that I'm considered to be an existential threat by google. You can still get the casefold.pb from github

Re: strcmp string compare for simple case and full case folding

Posted: Sat Feb 25, 2023 4:01 am
by Rinzwind
Probably helps if you put your executables in zip files instead of directly executable exes. Google rules the internet...

Re: strcmp string compare for simple case and full case folding

Posted: Sat Feb 25, 2023 5:01 pm
by RichAlgeni
idle wrote: Fri Feb 24, 2023 11:22 pmIt's laughable that I'm considered to be an existential threat by google.
Always have to watch out for those New Zealanders!!! Funny accents, and all!

Re: strcmp string compare for simple case and full case folding

Posted: Sat Feb 25, 2023 8:16 pm
by idle
RichAlgeni wrote: Sat Feb 25, 2023 5:01 pm
idle wrote: Fri Feb 24, 2023 11:22 pmIt's laughable that I'm considered to be an existential threat by google.
Always have to watch out for those New Zealanders!!! Funny accents, and all!
The google blocks been removed for now, maybe it was something I said.