Hello
Sicro,
thank you for these additional datas. But I am not sure the first link is perfect (however
unicode.org it is a very strong reference). Please note, for example, in the paragraphe
Q: Is a BOM used only in 16-bit Unicode text ?, there is an illustrating array which gives two description for the same code prefix :
Code: Select all
FF FE 00 00 = UTF-32 little endian
FF FE = UTF-16 little endian
I think it is an adaptated page, but not a perfect reference page.
I also insure you : I do not know all the process to decrypt a UTF character code. It is complex. But I studied it personnally when I wanted to know why such a distorked rule.
What we can conclude is :
ASCII and UNICODE conventions are fixed from one constant sized code to a character. This is used to work without a page change. But there are any pages for ASCII as there are any pages for UNICODE. We normally consider only one page for each of these two formats.
UTF convention are fixed from one variable sized code to a character. This depends from the prefix of the code. And we must consider a bug which consists to prefix the code to tell the task to consider a big (very big sometimes) chunk of bytes means a single character.
UTF-8 (endianess is ignored in my message)
(p= prefix ; c= character code content)
[p+c] : string byte length = 1 byte
[p+c][c][p+c] : string byte length = 3 bytes
[p+c][c][p+c][c][p+c] : string byte length = 5 bytes
etc ...
UTF-16
[P+C] : string byte length = 2 bytes
[P+C][C][P+C] : string byte length = 6 bytes
[P+C][C][P+C][C][P+C] : string byte length = 10 bytes
etc...
UTF has been created ever to resist to hardware and software changes. But, we can well see, it resists near to nothing, if we use specific characters, and allows a awkward to insert x86-based code in a string data.