Ucase() fails with UTF

Just starting out? Need help? Post your questions and find answers here.
User avatar
Torf
User
User
Posts: 13
Joined: Thu Apr 27, 2023 8:03 pm

Ucase() fails with UTF

Post by Torf »

Hi.
Perhaps it's a fault of me.. when I try convert a string with upper case, Ucase(), it does not work with 'ñ' letter.
infratec
Always Here
Always Here
Posts: 7588
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: Ucase() fails with UTF

Post by infratec »

Code: Select all

Debug UCase("ñ")
works.

PB works with unicode, not utf.
So you need to convert utf to unicode first. (PeekS())
jassing
Addict
Addict
Posts: 1885
Joined: Wed Feb 17, 2010 12:00 am

Re: Ucase() fails with UTF

Post by jassing »

Torf wrote: Thu Jul 13, 2023 9:31 pm Hi.
Perhaps it's a fault of me.. when I try convert a string with upper case, Ucase(), it does not work with 'ñ' letter.
Try idle's case-folding code... it's robust & fast.

infratec wrote: Thu Jul 13, 2023 10:24 pm PB works with unicode, not utf.
UTF is unicode. (the U, in UTF, is "Unicode")
User avatar
mk-soft
Always Here
Always Here
Posts: 6209
Joined: Fri May 12, 2006 6:51 pm
Location: Germany

Re: Ucase() fails with UTF

Post by mk-soft »

jassing wrote: Thu Jul 13, 2023 11:08 pm
infratec wrote: Thu Jul 13, 2023 10:24 pm PB works with unicode, not utf.
UTF is unicode. (the U, in UTF, is "Unicode")
:?

Unicode (UC16): One char one word.
UTF8: Dynamische byte len per char
My Projects ThreadToGUI / OOP-BaseClass / EventDesigner V3
PB v3.30 / v5.75 - OS Mac Mini OSX 10.xx - VM Window Pro / Linux Ubuntu
Downloads on my Webspace / OneDrive
jassing
Addict
Addict
Posts: 1885
Joined: Wed Feb 17, 2010 12:00 am

Re: Ucase() fails with UTF

Post by jassing »

mk-soft wrote: Thu Jul 13, 2023 11:19 pm
Unicode (UC16): One char one word.
UTF8: Dynamische byte len per char
it doesn't mean "utf" is not unicode. Unicode is a broader term.

utf7, utf8, utf16, utf32 are all types of unicode encoding.
Olli
Addict
Addict
Posts: 1202
Joined: Wed May 27, 2020 12:26 pm

Re: Ucase() fails with UTF

Post by Olli »

Ow... A debate ! I love this :mrgreen:

In the term 'U.T.F.' there is a 'T'

T = Transformation

So, except my fault (sure ! :mrgreen: ), UTF is not the unicode format but a format to transform small sized character code (ASCII) to larger sized character code (UNICODE), that is due to the set of international characters, a set too big to be contained in a simple set of 256 characters.

"Historically" ASCII was a set of 128 characters :
33 system characters and 95 typing characters (from #32 to #126)

Then, 128 extra characters was added from index #128 to #255, named graphic characters. To respond to the diversity of local characters in the world, these graphic characters was available through country pages.

To allow to store any characters of any country pages in a single character string, a convention format was created : UTF.

And, after years a new fixed sized format was created as the 7-bits character format : the 16-bits character format (unicode format).

Note that one UTF character can allocate an infinity of bytes.
User avatar
Sicro
Enthusiast
Enthusiast
Posts: 559
Joined: Wed Jun 25, 2014 5:25 pm
Location: Germany
Contact:

Re: Ucase() fails with UTF

Post by Sicro »

I think this link will clear up the misconceptions here about Unicode:

General questions, relating to UTF or Encoding Form

When we talk about Unicode in PureBasic, we generally mean UCS-2. This is how Unicode strings are officially encoded in PureBasic.
Image
Why OpenSource should have a license :: PB-CodeArchiv-Rebirth :: Pleasant-Dark (syntax color scheme) :: RegEx-Engine (compiles RegExes to NFA/DFA)
Manjaro Xfce x64 (Main system) :: Windows 10 Home (VirtualBox) :: Newest PureBasic version
Olli
Addict
Addict
Posts: 1202
Joined: Wed May 27, 2020 12:26 pm

Re: Ucase() fails with UTF

Post by Olli »

Hello Sicro,

thank you for these additional datas. But I am not sure the first link is perfect (however unicode.org it is a very strong reference). Please note, for example, in the paragraphe Q: Is a BOM used only in 16-bit Unicode text ?, there is an illustrating array which gives two description for the same code prefix :

Code: Select all

FF FE 00 00 = UTF-32 little endian
FF FE = UTF-16 little endian
I think it is an adaptated page, but not a perfect reference page.

I also insure you : I do not know all the process to decrypt a UTF character code. It is complex. But I studied it personnally when I wanted to know why such a distorked rule.

What we can conclude is :

ASCII and UNICODE conventions are fixed from one constant sized code to a character. This is used to work without a page change. But there are any pages for ASCII as there are any pages for UNICODE. We normally consider only one page for each of these two formats.

UTF convention are fixed from one variable sized code to a character. This depends from the prefix of the code. And we must consider a bug which consists to prefix the code to tell the task to consider a big (very big sometimes) chunk of bytes means a single character.

UTF-8 (endianess is ignored in my message)
(p= prefix ; c= character code content)
[p+c] : string byte length = 1 byte
[p+c][c][p+c] : string byte length = 3 bytes
[p+c][c][p+c][c][p+c] : string byte length = 5 bytes
etc ...

UTF-16
[P+C] : string byte length = 2 bytes
[P+C][C][P+C] : string byte length = 6 bytes
[P+C][C][P+C][C][P+C] : string byte length = 10 bytes
etc...

UTF has been created ever to resist to hardware and software changes. But, we can well see, it resists near to nothing, if we use specific characters, and allows a awkward to insert x86-based code in a string data.
User avatar
Sicro
Enthusiast
Enthusiast
Posts: 559
Joined: Wed Jun 25, 2014 5:25 pm
Location: Germany
Contact:

Re: Ucase() fails with UTF

Post by Sicro »

Olli wrote: Sat Jul 15, 2023 10:43 am But I am not sure the first link is perfect (however unicode.org it is a very strong reference).
[...]
I think it is an adaptated page, but not a perfect reference page.
It is the official site of the Unicode Consortium. If you know a better reference site, please name it.

I think for the misconceptions in this thread - and that's what I was referring to in my post - it's good.
Olli wrote: Sat Jul 15, 2023 10:43 am Please note, for example, in the paragraphe Q: Is a BOM used only in 16-bit Unicode text ?, there is an illustrating array which gives two description for the same code prefix :

Code: Select all

FF FE 00 00 = UTF-32 little endian
FF FE = UTF-16 little endian
The prefix "FF FE 00 00" and "FF FE" are not the same for me.
Olli wrote: Sat Jul 15, 2023 10:43 am ASCII and UNICODE conventions are fixed from one constant sized code to a character. This is used to work without a page change. But there are any pages for ASCII as there are any pages for UNICODE. We normally consider only one page for each of these two formats.
What do you mean by "pages"? Do you mean Code Pages, which cause an identical character code in (extended) ASCII to result in different characters depending on the set code page? There are no different code pages in Unicode; Unicode was invented to eliminate the chaos of different code pages.
Olli wrote: Sat Jul 15, 2023 10:43 am UTF convention are fixed from one variable sized code to a character. This depends from the prefix of the code. And we must consider a bug which consists to prefix the code to tell the task to consider a big (very big sometimes) chunk of bytes means a single character.
Yes, that's a problem with UTF encodings, that the character codes that are far back in the Unicode table need a lot of bytes. Therefore, for texts written with these characters, these encodings create a wasteful amount of data. For all other characters, however, UTF encodings reduce the amount of data required.
Olli wrote: Sat Jul 15, 2023 10:43 am UTF has been created ever to resist to hardware and software changes. But, we can well see, it resists near to nothing, if we use specific characters, and allows a awkward to insert x86-based code in a string data.
What do you mean by "insert x86-based code in a string data"? If a UTF decoder processes non-compliant byte sequences and than returns an broken string instead reporting an error, it is a bug in the UTF decoder, not in the corresponding UTF algorithm itself.
Image
Why OpenSource should have a license :: PB-CodeArchiv-Rebirth :: Pleasant-Dark (syntax color scheme) :: RegEx-Engine (compiles RegExes to NFA/DFA)
Manjaro Xfce x64 (Main system) :: Windows 10 Home (VirtualBox) :: Newest PureBasic version
Little John
Addict
Addict
Posts: 4779
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Ucase() fails with UTF

Post by Little John »

Sicro wrote: I think for the misconceptions in this thread - and that's what I was referring to in my post - it's good.
Yes, it is. The information that you provided is correct and sufficient for the purpose of this thread. There is no need to have a fundamental discussion about UTF here.
Olli
Addict
Addict
Posts: 1202
Joined: Wed May 27, 2020 12:26 pm

Re: Ucase() fails with UTF

Post by Olli »

@Sicro

Code: Select all

FFFE0000
FFFExxxx (with xxxx <> 0)
These two codes are not the same for me. In this way, I understand.

You refer to the ASCII Code pages, and it is exactly this, what I talked about. I remember the old ASCII page were 437. Then 850... Then 1252... I am blocked to the #437 to read a 8086 code.

Also,
ASCII has pages
UNICODE has planes (17 in a universe of 65536 planes)

For the UTF convention, it is a bug, having had invented this
( [16 bits] )
[prefix1+codeA][codeB][prefic2+codeC][why?][why?]...[why again?]

Why a second prefix ? Why that, considering a 6 bytes UTF-16 characters have 4000 times more combinations than all the history of unicode has imaginated since 1991 ?

Such a second prefix will be useful in 160 000 years.
And there are a infinite set of prefixes... I think they imagine we are not alone in the universe. :mrgreen:

Thank you
User avatar
idle
Always Here
Always Here
Posts: 5840
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Ucase() fails with UTF

Post by idle »

If you want to go down the rabbit hole you can begin here
https://www.unicode.org/versions/Unicode15.0.0/
Unicode 15.0 adds 4,489 characters, for a total of 149,186 characters
PB uses UCS2 which is limited to $ffff, it doesn't support surrogate pairs, for this I have written a UTf16 support module with the feedback and help from Sicro and Mk-soft. It's not pretty but its quick. The Utility functions strUcase and strLcase are done in place and are 10 times faster than PBs
viewtopic.php?t=80275
or DDL
https://dnscope.io/idlefiles/UTF16.pb
User avatar
Torf
User
User
Posts: 13
Joined: Thu Apr 27, 2023 8:03 pm

Re: Ucase() fails with UTF

Post by Torf »

I tested again, this time in several platforms, and I'm sure that in Windows it runs ok, but in MacOS with chip Intel, it fails.
User avatar
idle
Always Here
Always Here
Posts: 5840
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Ucase() fails with UTF

Post by idle »

I suggest you provide the exact code that fails.
The reply from infratec shows that it works. So are you saying that it doesn't, if that is the case what are your system specs.
Olli
Addict
Addict
Posts: 1202
Joined: Wed May 27, 2020 12:26 pm

Re: Ucase() fails with UTF

Post by Olli »

But... Who has adopted the black horse in the chess set ? :lol:

Anyway, good luck to your work. Even in the CP437 of 80's, IBM forgot the upper and lower cases through the bit 5 !

This mean the letter case will be a problem, again and again...
Post Reply