Page 1 of 1

Character Count of Unicode String

Posted: Mon Aug 19, 2024 3:31 pm
by tkaltschmidt
Hi there,

i'm evaluating Purebasic and have to say: Very impressed so far!

Litte question: Is Purebasic capable of counting the perceived characters of a Unicode String? The following code returns 2 characters, but there is only one. Do i miss the correct function or is this a known limitation?

Code: Select all

EnableExplicit

Define MyString1$ = "😍"

Define MyLength.i=Len(MyString1$)
best, Thomas

Re: Character Count of Unicode String

Posted: Mon Aug 19, 2024 3:40 pm
by DarkDragon
PureBasic uses Widechar/UCS-2 as internal string representation, so the unicode support is limited to the basic multilingual plane.

Your emoji is U+1F60D (outside the plane)

Re: Character Count of Unicode String

Posted: Mon Aug 19, 2024 3:51 pm
by miskox

Re: Character Count of Unicode String

Posted: Mon Aug 19, 2024 4:11 pm
by tkaltschmidt
Thank you, Daniel and Saso!

Re: Character Count of Unicode String

Posted: Mon Aug 19, 2024 4:59 pm
by Fred
This module should do exactly what you want: https://www.purebasic.fr/english/viewto ... ilit=Utf16

Re: Character Count of Unicode String

Posted: Mon Aug 19, 2024 6:45 pm
by tkaltschmidt
Looks good, thank you, Fred!

Re: Character Count of Unicode String

Posted: Mon Aug 19, 2024 11:10 pm
by idle
I just updated the UTF16.pb to expose strLen(string$) but if all you need is strlen

Code: Select all

Procedure StrLen_(str.s) 
  Protected *Char.Unicode
  Protected cnt
  *Char.Unicode = @str
  If *Char
    While *Char\u
      If *Char\u > $D7FF And *Char\u < $E000
        *Char + 4
      Else
        *Char + 2
      EndIf
      cnt + 1
    Wend
  EndIf
  ProcedureReturn cnt
EndProcedure
  
Define example$ = "😁A😁😁K😁" 
Debug StrLen_(example$) 

Re: Character Count of Unicode String

Posted: Tue Aug 20, 2024 2:56 pm
by tkaltschmidt
That's awesome, thanks, idle!

May i ask: What is the difference between UTF16.pb and UTF16a.pb?

Re: Character Count of Unicode String

Posted: Tue Aug 20, 2024 8:22 pm
by idle
Utf16a includes a mapping to strip accents.