Ucase() fails with UTF

Torf · Post by **Torf** » Thu Jul 13, 2023 9:31 pm

Hi.
Perhaps it's a fault of me.. when I try convert a string with upper case, Ucase(), it does not work with 'ñ' letter.

infratec · Post by **infratec** » Thu Jul 13, 2023 10:24 pm

Code: Select all

Debug UCase("ñ")

works.

PB works with unicode, not utf.
So you need to convert utf to unicode first. (PeekS())

jassing · Post by **jassing** » Thu Jul 13, 2023 11:08 pm

Torf wrote: Thu Jul 13, 2023 9:31 pm Hi.
Perhaps it's a fault of me.. when I try convert a string with upper case, Ucase(), it does not work with 'ñ' letter.

Try idle's case-folding code... it's robust & fast.

infratec wrote: Thu Jul 13, 2023 10:24 pm PB works with unicode, not utf.

UTF is unicode. (the U, in UTF, is "Unicode")

mk-soft · Post by **mk-soft** » Thu Jul 13, 2023 11:19 pm

jassing wrote: Thu Jul 13, 2023 11:08 pm
infratec wrote: Thu Jul 13, 2023 10:24 pm PB works with unicode, not utf.
UTF is unicode. (the U, in UTF, is "Unicode")

Unicode (UC16): One char one word.
UTF8: Dynamische byte len per char

jassing · Post by **jassing** » Thu Jul 13, 2023 11:24 pm

mk-soft wrote: Thu Jul 13, 2023 11:19 pm
Unicode (UC16): One char one word.
UTF8: Dynamische byte len per char

it doesn't mean "utf" is not unicode. Unicode is a broader term.

utf7, utf8, utf16, utf32 are all types of unicode encoding.

Olli · Post by **Olli** » Fri Jul 14, 2023 8:07 am

Ow... A debate ! I love this

In the term 'U.T.F.' there is a 'T'

T = Transformation

So, except my fault (sure !

), UTF is not the unicode format but a format to transform small sized character code (ASCII) to larger sized character code (UNICODE), that is due to the set of international characters, a set too big to be contained in a simple set of 256 characters.

"Historically" ASCII was a set of 128 characters :
33 system characters and 95 typing characters (from #32 to #126)

Then, 128 extra characters was added from index #128 to #255, named graphic characters. To respond to the diversity of local characters in the world, these graphic characters was available through country pages.

To allow to store any characters of any country pages in a single character string, a convention format was created : UTF.

And, after years a new fixed sized format was created as the 7-bits character format : the 16-bits character format (unicode format).

Note that one UTF character can allocate an infinity of bytes.

Sicro · Post by **Sicro** » Fri Jul 14, 2023 1:32 pm

I think this link will clear up the misconceptions here about Unicode:

General questions, relating to UTF or Encoding Form

When we talk about Unicode in PureBasic, we generally mean UCS-2. This is how Unicode strings are officially encoded in PureBasic.

Olli · Post by **Olli** » Sat Jul 15, 2023 10:43 am

Hello Sicro,

thank you for these additional datas. But I am not sure the first link is perfect (however unicode.org it is a very strong reference). Please note, for example, in the paragraphe Q: Is a BOM used only in 16-bit Unicode text ?, there is an illustrating array which gives two description for the same code prefix :

Code: Select all

FF FE 00 00 = UTF-32 little endian
FF FE = UTF-16 little endian

I think it is an adaptated page, but not a perfect reference page.

I also insure you : I do not know all the process to decrypt a UTF character code. It is complex. But I studied it personnally when I wanted to know why such a distorked rule.

What we can conclude is :

ASCII and UNICODE conventions are fixed from one constant sized code to a character. This is used to work without a page change. But there are any pages for ASCII as there are any pages for UNICODE. We normally consider only one page for each of these two formats.

UTF convention are fixed from one variable sized code to a character. This depends from the prefix of the code. And we must consider a bug which consists to prefix the code to tell the task to consider a big (very big sometimes) chunk of bytes means a single character.

UTF-8 (endianess is ignored in my message)
(p= prefix ; c= character code content)
[p+c] : string byte length = 1 byte
[p+c][c][p+c] : string byte length = 3 bytes
[p+c][c][p+c][c][p+c] : string byte length = 5 bytes
etc ...

UTF-16
[P+C] : string byte length = 2 bytes
[P+C][C][P+C] : string byte length = 6 bytes
[P+C][C][P+C][C][P+C] : string byte length = 10 bytes
etc...

UTF has been created ever to resist to hardware and software changes. But, we can well see, it resists near to nothing, if we use specific characters, and allows a awkward to insert x86-based code in a string data.

Sicro · Post by **Sicro** » Sat Jul 15, 2023 2:57 pm

Olli wrote: Sat Jul 15, 2023 10:43 am But I am not sure the first link is perfect (however unicode.org it is a very strong reference).
[...]
I think it is an adaptated page, but not a perfect reference page.

It is the official site of the Unicode Consortium. If you know a better reference site, please name it.

I think for the misconceptions in this thread - and that's what I was referring to in my post - it's good.

Olli wrote: Sat Jul 15, 2023 10:43 am Please note, for example, in the paragraphe Q: Is a BOM used only in 16-bit Unicode text ?, there is an illustrating array which gives two description for the same code prefix :
Code: Select all
FF FE 00 00 = UTF-32 little endian
FF FE = UTF-16 little endian

The prefix "FF FE 00 00" and "FF FE" are not the same for me.

Olli wrote: Sat Jul 15, 2023 10:43 am ASCII and UNICODE conventions are fixed from one constant sized code to a character. This is used to work without a page change. But there are any pages for ASCII as there are any pages for UNICODE. We normally consider only one page for each of these two formats.

What do you mean by "pages"? Do you mean Code Pages, which cause an identical character code in (extended) ASCII to result in different characters depending on the set code page? There are no different code pages in Unicode; Unicode was invented to eliminate the chaos of different code pages.

Olli wrote: Sat Jul 15, 2023 10:43 am UTF convention are fixed from one variable sized code to a character. This depends from the prefix of the code. And we must consider a bug which consists to prefix the code to tell the task to consider a big (very big sometimes) chunk of bytes means a single character.

Yes, that's a problem with UTF encodings, that the character codes that are far back in the Unicode table need a lot of bytes. Therefore, for texts written with these characters, these encodings create a wasteful amount of data. For all other characters, however, UTF encodings reduce the amount of data required.

Olli wrote: Sat Jul 15, 2023 10:43 am UTF has been created ever to resist to hardware and software changes. But, we can well see, it resists near to nothing, if we use specific characters, and allows a awkward to insert x86-based code in a string data.

What do you mean by "insert x86-based code in a string data"? If a UTF decoder processes non-compliant byte sequences and than returns an broken string instead reporting an error, it is a bug in the UTF decoder, not in the corresponding UTF algorithm itself.

Little John · Post by **Little John** » Sat Jul 15, 2023 6:28 pm

Sicro wrote: I think for the misconceptions in this thread - and that's what I was referring to in my post - it's good.

Yes, it is. The information that you provided is correct and sufficient for the purpose of this thread. There is no need to have a fundamental discussion about UTF here.

Olli · Post by **Olli** » Sun Jul 16, 2023 12:20 am

@Sicro

Code: Select all

FFFE0000
FFFExxxx (with xxxx <> 0)

These two codes are not the same for me. In this way, I understand.

You refer to the ASCII Code pages, and it is exactly this, what I talked about. I remember the old ASCII page were 437. Then 850... Then 1252... I am blocked to the #437 to read a 8086 code.

Also,
ASCII has pages
UNICODE has planes (17 in a universe of 65536 planes)

For the UTF convention, it is a bug, having had invented this
( [16 bits] )
[prefix1+codeA][codeB][prefic2+codeC][why?][why?]...[why again?]

Why a second prefix ? Why that, considering a 6 bytes UTF-16 characters have 4000 times more combinations than all the history of unicode has imaginated since 1991 ?

Such a second prefix will be useful in 160 000 years.
And there are a infinite set of prefixes... I think they imagine we are not alone in the universe.

Thank you

Post by **idle** » Sun Jul 16, 2023 12:51 am

If you want to go down the rabbit hole you can begin here
https://www.unicode.org/versions/Unicode15.0.0/

Unicode 15.0 adds 4,489 characters, for a total of 149,186 characters

PB uses UCS2 which is limited to $ffff, it doesn't support surrogate pairs, for this I have written a UTf16 support module with the feedback and help from Sicro and Mk-soft. It's not pretty but its quick. The Utility functions strUcase and strLcase are done in place and are 10 times faster than PBs
viewtopic.php?t=80275
or DDL
https://dnscope.io/idlefiles/UTF16.pb

Torf · Post by **Torf** » Sun Jul 16, 2023 8:24 am

I tested again, this time in several platforms, and I'm sure that in Windows it runs ok, but in MacOS with chip Intel, it fails.

Post by **idle** » Sun Jul 16, 2023 9:55 am

I suggest you provide the exact code that fails.
The reply from infratec shows that it works. So are you saying that it doesn't, if that is the case what are your system specs.

Olli · Post by **Olli** » Sun Jul 16, 2023 10:18 am

But... Who has adopted the black horse in the chess set ?

Anyway, good luck to your work. Even in the CP437 of 80's, IBM forgot the upper and lower cases through the bit 5 !

This mean the letter case will be a problem, again and again...

PureBasic Forums - English

Ucase() fails with UTF

Ucase() fails with UTF

Re: Ucase() fails with UTF

Re: Ucase() fails with UTF

Re: Ucase() fails with UTF

Re: Ucase() fails with UTF

Re: Ucase() fails with UTF

Re: Ucase() fails with UTF

Re: Ucase() fails with UTF

Re: Ucase() fails with UTF

Re: Ucase() fails with UTF

Re: Ucase() fails with UTF

Re: Ucase() fails with UTF

Re: Ucase() fails with UTF

Re: Ucase() fails with UTF

Re: Ucase() fails with UTF