Ucase() fails with UTF
Ucase() fails with UTF
Hi.
Perhaps it's a fault of me.. when I try convert a string with upper case, Ucase(), it does not work with 'ñ' letter.
Perhaps it's a fault of me.. when I try convert a string with upper case, Ucase(), it does not work with 'ñ' letter.
Re: Ucase() fails with UTF
Code: Select all
Debug UCase("ñ")
PB works with unicode, not utf.
So you need to convert utf to unicode first. (PeekS())
Re: Ucase() fails with UTF
Try idle's case-folding code... it's robust & fast.Torf wrote: Thu Jul 13, 2023 9:31 pm Hi.
Perhaps it's a fault of me.. when I try convert a string with upper case, Ucase(), it does not work with 'ñ' letter.
UTF is unicode. (the U, in UTF, is "Unicode")
Re: Ucase() fails with UTF

Unicode (UC16): One char one word.
UTF8: Dynamische byte len per char
My Projects ThreadToGUI / OOP-BaseClass / EventDesigner V3
PB v3.30 / v5.75 - OS Mac Mini OSX 10.xx - VM Window Pro / Linux Ubuntu
Downloads on my Webspace / OneDrive
PB v3.30 / v5.75 - OS Mac Mini OSX 10.xx - VM Window Pro / Linux Ubuntu
Downloads on my Webspace / OneDrive
Re: Ucase() fails with UTF
it doesn't mean "utf" is not unicode. Unicode is a broader term.mk-soft wrote: Thu Jul 13, 2023 11:19 pm
Unicode (UC16): One char one word.
UTF8: Dynamische byte len per char
utf7, utf8, utf16, utf32 are all types of unicode encoding.
Re: Ucase() fails with UTF
Ow... A debate ! I love this
In the term 'U.T.F.' there is a 'T'
T = Transformation
So, except my fault (sure !
), UTF is not the unicode format but a format to transform small sized character code (ASCII) to larger sized character code (UNICODE), that is due to the set of international characters, a set too big to be contained in a simple set of 256 characters.
"Historically" ASCII was a set of 128 characters :
33 system characters and 95 typing characters (from #32 to #126)
Then, 128 extra characters was added from index #128 to #255, named graphic characters. To respond to the diversity of local characters in the world, these graphic characters was available through country pages.
To allow to store any characters of any country pages in a single character string, a convention format was created : UTF.
And, after years a new fixed sized format was created as the 7-bits character format : the 16-bits character format (unicode format).
Note that one UTF character can allocate an infinity of bytes.

In the term 'U.T.F.' there is a 'T'
T = Transformation
So, except my fault (sure !

"Historically" ASCII was a set of 128 characters :
33 system characters and 95 typing characters (from #32 to #126)
Then, 128 extra characters was added from index #128 to #255, named graphic characters. To respond to the diversity of local characters in the world, these graphic characters was available through country pages.
To allow to store any characters of any country pages in a single character string, a convention format was created : UTF.
And, after years a new fixed sized format was created as the 7-bits character format : the 16-bits character format (unicode format).
Note that one UTF character can allocate an infinity of bytes.
Re: Ucase() fails with UTF
I think this link will clear up the misconceptions here about Unicode:
General questions, relating to UTF or Encoding Form
When we talk about Unicode in PureBasic, we generally mean UCS-2. This is how Unicode strings are officially encoded in PureBasic.
General questions, relating to UTF or Encoding Form
When we talk about Unicode in PureBasic, we generally mean UCS-2. This is how Unicode strings are officially encoded in PureBasic.

Why OpenSource should have a license :: PB-CodeArchiv-Rebirth :: Pleasant-Dark (syntax color scheme) :: RegEx-Engine (compiles RegExes to NFA/DFA)
Manjaro Xfce x64 (Main system) :: Windows 10 Home (VirtualBox) :: Newest PureBasic version
Re: Ucase() fails with UTF
Hello Sicro,
thank you for these additional datas. But I am not sure the first link is perfect (however unicode.org it is a very strong reference). Please note, for example, in the paragraphe Q: Is a BOM used only in 16-bit Unicode text ?, there is an illustrating array which gives two description for the same code prefix :
I think it is an adaptated page, but not a perfect reference page.
I also insure you : I do not know all the process to decrypt a UTF character code. It is complex. But I studied it personnally when I wanted to know why such a distorked rule.
What we can conclude is :
ASCII and UNICODE conventions are fixed from one constant sized code to a character. This is used to work without a page change. But there are any pages for ASCII as there are any pages for UNICODE. We normally consider only one page for each of these two formats.
UTF convention are fixed from one variable sized code to a character. This depends from the prefix of the code. And we must consider a bug which consists to prefix the code to tell the task to consider a big (very big sometimes) chunk of bytes means a single character.
UTF-8 (endianess is ignored in my message)
(p= prefix ; c= character code content)
[p+c] : string byte length = 1 byte
[p+c][c][p+c] : string byte length = 3 bytes
[p+c][c][p+c][c][p+c] : string byte length = 5 bytes
etc ...
UTF-16
[P+C] : string byte length = 2 bytes
[P+C][C][P+C] : string byte length = 6 bytes
[P+C][C][P+C][C][P+C] : string byte length = 10 bytes
etc...
UTF has been created ever to resist to hardware and software changes. But, we can well see, it resists near to nothing, if we use specific characters, and allows a awkward to insert x86-based code in a string data.
thank you for these additional datas. But I am not sure the first link is perfect (however unicode.org it is a very strong reference). Please note, for example, in the paragraphe Q: Is a BOM used only in 16-bit Unicode text ?, there is an illustrating array which gives two description for the same code prefix :
Code: Select all
FF FE 00 00 = UTF-32 little endian
FF FE = UTF-16 little endian
I also insure you : I do not know all the process to decrypt a UTF character code. It is complex. But I studied it personnally when I wanted to know why such a distorked rule.
What we can conclude is :
ASCII and UNICODE conventions are fixed from one constant sized code to a character. This is used to work without a page change. But there are any pages for ASCII as there are any pages for UNICODE. We normally consider only one page for each of these two formats.
UTF convention are fixed from one variable sized code to a character. This depends from the prefix of the code. And we must consider a bug which consists to prefix the code to tell the task to consider a big (very big sometimes) chunk of bytes means a single character.
UTF-8 (endianess is ignored in my message)
(p= prefix ; c= character code content)
[p+c] : string byte length = 1 byte
[p+c][c][p+c] : string byte length = 3 bytes
[p+c][c][p+c][c][p+c] : string byte length = 5 bytes
etc ...
UTF-16
[P+C] : string byte length = 2 bytes
[P+C][C][P+C] : string byte length = 6 bytes
[P+C][C][P+C][C][P+C] : string byte length = 10 bytes
etc...
UTF has been created ever to resist to hardware and software changes. But, we can well see, it resists near to nothing, if we use specific characters, and allows a awkward to insert x86-based code in a string data.
Re: Ucase() fails with UTF
It is the official site of the Unicode Consortium. If you know a better reference site, please name it.Olli wrote: Sat Jul 15, 2023 10:43 am But I am not sure the first link is perfect (however unicode.org it is a very strong reference).
[...]
I think it is an adaptated page, but not a perfect reference page.
I think for the misconceptions in this thread - and that's what I was referring to in my post - it's good.
The prefix "FF FE 00 00" and "FF FE" are not the same for me.Olli wrote: Sat Jul 15, 2023 10:43 am Please note, for example, in the paragraphe Q: Is a BOM used only in 16-bit Unicode text ?, there is an illustrating array which gives two description for the same code prefix :Code: Select all
FF FE 00 00 = UTF-32 little endian FF FE = UTF-16 little endian
What do you mean by "pages"? Do you mean Code Pages, which cause an identical character code in (extended) ASCII to result in different characters depending on the set code page? There are no different code pages in Unicode; Unicode was invented to eliminate the chaos of different code pages.Olli wrote: Sat Jul 15, 2023 10:43 am ASCII and UNICODE conventions are fixed from one constant sized code to a character. This is used to work without a page change. But there are any pages for ASCII as there are any pages for UNICODE. We normally consider only one page for each of these two formats.
Yes, that's a problem with UTF encodings, that the character codes that are far back in the Unicode table need a lot of bytes. Therefore, for texts written with these characters, these encodings create a wasteful amount of data. For all other characters, however, UTF encodings reduce the amount of data required.Olli wrote: Sat Jul 15, 2023 10:43 am UTF convention are fixed from one variable sized code to a character. This depends from the prefix of the code. And we must consider a bug which consists to prefix the code to tell the task to consider a big (very big sometimes) chunk of bytes means a single character.
What do you mean by "insert x86-based code in a string data"? If a UTF decoder processes non-compliant byte sequences and than returns an broken string instead reporting an error, it is a bug in the UTF decoder, not in the corresponding UTF algorithm itself.Olli wrote: Sat Jul 15, 2023 10:43 am UTF has been created ever to resist to hardware and software changes. But, we can well see, it resists near to nothing, if we use specific characters, and allows a awkward to insert x86-based code in a string data.

Why OpenSource should have a license :: PB-CodeArchiv-Rebirth :: Pleasant-Dark (syntax color scheme) :: RegEx-Engine (compiles RegExes to NFA/DFA)
Manjaro Xfce x64 (Main system) :: Windows 10 Home (VirtualBox) :: Newest PureBasic version
-
- Addict
- Posts: 4779
- Joined: Thu Jun 07, 2007 3:25 pm
- Location: Berlin, Germany
Re: Ucase() fails with UTF
Yes, it is. The information that you provided is correct and sufficient for the purpose of this thread. There is no need to have a fundamental discussion about UTF here.Sicro wrote: I think for the misconceptions in this thread - and that's what I was referring to in my post - it's good.
Re: Ucase() fails with UTF
@Sicro
These two codes are not the same for me. In this way, I understand.
You refer to the ASCII Code pages, and it is exactly this, what I talked about. I remember the old ASCII page were 437. Then 850... Then 1252... I am blocked to the #437 to read a 8086 code.
Also,
ASCII has pages
UNICODE has planes (17 in a universe of 65536 planes)
For the UTF convention, it is a bug, having had invented this
( [16 bits] )
[prefix1+codeA][codeB][prefic2+codeC][why?][why?]...[why again?]
Why a second prefix ? Why that, considering a 6 bytes UTF-16 characters have 4000 times more combinations than all the history of unicode has imaginated since 1991 ?
Such a second prefix will be useful in 160 000 years.
And there are a infinite set of prefixes... I think they imagine we are not alone in the universe.
Thank you
Code: Select all
FFFE0000
FFFExxxx (with xxxx <> 0)
You refer to the ASCII Code pages, and it is exactly this, what I talked about. I remember the old ASCII page were 437. Then 850... Then 1252... I am blocked to the #437 to read a 8086 code.
Also,
ASCII has pages
UNICODE has planes (17 in a universe of 65536 planes)
For the UTF convention, it is a bug, having had invented this
( [16 bits] )
[prefix1+codeA][codeB][prefic2+codeC][why?][why?]...[why again?]
Why a second prefix ? Why that, considering a 6 bytes UTF-16 characters have 4000 times more combinations than all the history of unicode has imaginated since 1991 ?
Such a second prefix will be useful in 160 000 years.
And there are a infinite set of prefixes... I think they imagine we are not alone in the universe.

Thank you
Re: Ucase() fails with UTF
If you want to go down the rabbit hole you can begin here
https://www.unicode.org/versions/Unicode15.0.0/
viewtopic.php?t=80275
or DDL
https://dnscope.io/idlefiles/UTF16.pb
https://www.unicode.org/versions/Unicode15.0.0/
PB uses UCS2 which is limited to $ffff, it doesn't support surrogate pairs, for this I have written a UTf16 support module with the feedback and help from Sicro and Mk-soft. It's not pretty but its quick. The Utility functions strUcase and strLcase are done in place and are 10 times faster than PBsUnicode 15.0 adds 4,489 characters, for a total of 149,186 characters
viewtopic.php?t=80275
or DDL
https://dnscope.io/idlefiles/UTF16.pb
Re: Ucase() fails with UTF
I tested again, this time in several platforms, and I'm sure that in Windows it runs ok, but in MacOS with chip Intel, it fails.
Re: Ucase() fails with UTF
I suggest you provide the exact code that fails.
The reply from infratec shows that it works. So are you saying that it doesn't, if that is the case what are your system specs.
The reply from infratec shows that it works. So are you saying that it doesn't, if that is the case what are your system specs.
Re: Ucase() fails with UTF
But... Who has adopted the black horse in the chess set ?
Anyway, good luck to your work. Even in the CP437 of 80's, IBM forgot the upper and lower cases through the bit 5 !
This mean the letter case will be a problem, again and again...

Anyway, good luck to your work. Even in the CP437 of 80's, IBM forgot the upper and lower cases through the bit 5 !
This mean the letter case will be a problem, again and again...