Any experience with OpenXMLDialog and chinese language?

Everything else that doesn't fall into one of the other PB categories.
User avatar
Kukulkan
Addict
Addict
Posts: 1415
Joined: Mon Jun 06, 2005 2:35 pm
Location: germany
Contact:

Any experience with OpenXMLDialog and chinese language?

Post by Kukulkan »

Hi,

I was asked to translate some of my PB programs to Chinese language. All using XMLDialog feature of PB and I wonder if there is someone who had good or bad experience with Chinese? Maybe some hints or tips? Any Gadgets making trouble? Any known limitations?

BTW, yes, I use PB in Unicode mode anyway :wink:

Thanks,

Kukulkan
IdeasVacuum
Always Here
Always Here
Posts: 6426
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Any experience with OpenXMLDialog and chinese language?

Post by IdeasVacuum »

Should not be a problem - With classic forms GUIs, I have more issues with the German language than with Chinese (Mandarin Simplified Chinese) since German sometimes needs long words or long sentences but Chinese is very condensed. :wink:
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
User avatar
Kukulkan
Addict
Addict
Posts: 1415
Joined: Mon Jun 06, 2005 2:35 pm
Location: germany
Contact:

Re: Any experience with OpenXMLDialog and chinese language?

Post by Kukulkan »

Hello IdeasVacuum,

thanks for your reply. So you did Chinese and had no problems at all? Especially with the XML Layout library of PB? This is great news.

Any tips or hints for me regarding Chinese? We will also have to translate Mandarin Simplified Chinese, so it is the same task. Maybe I can drop you a PN in case we stuck somewhere?

Best,

Volker
User avatar
Shield
Addict
Addict
Posts: 1021
Joined: Fri Jan 21, 2011 8:25 am
Location: 'stralia!
Contact:

Re: Any experience with OpenXMLDialog and chinese language?

Post by Shield »

The main problem you'll encounter when dealing with languages other than English is that,
depending on the characters, PB will report wrong string lengths and some functions
such as Mid(), UCase() etc. will not work correctly.

Generally, displaying these characters should work as it is more of an OS thing. :)
Image
Blog: Why Does It Suck? (http://whydoesitsuck.com/)
"You can disagree with me as much as you want, but during this talk, by definition, anybody who disagrees is stupid and ugly."
- Linus Torvalds
User avatar
Kukulkan
Addict
Addict
Posts: 1415
Joined: Mon Jun 06, 2005 2:35 pm
Location: germany
Contact:

Re: Any experience with OpenXMLDialog and chinese language?

Post by Kukulkan »

Hi Shield,

thanks for your hints. We already deal with French and German umlauts and accents and found no problem so far. As PB is unicode, there should be no problem, right? Mid(), UCase() and even RegExp should be capable of dealing with all the unicode characters. Otherwise, PB would not be unicode and I thought it is...

Do you know about a specific case where it fails? So we can prepare for it...

Best Kukulkan
User avatar
Shield
Addict
Addict
Posts: 1021
Joined: Fri Jan 21, 2011 8:25 am
Location: 'stralia!
Contact:

Re: Any experience with OpenXMLDialog and chinese language?

Post by Shield »

PB has no notion of Unicode characters* and does not deal with encoding at all.

(* The term "character" is extremely ambiguous in Unicode.)

PB stores strings internally in UCS-2, meaning two bytes per character. This only works for
a subset of the Unicode space but is nowhere near sufficient, especially not for Chinese.

Look at the following example:

Code: Select all

Define a$ = "XXXX" ; Check bottom of this post.
Debug Len(a$) ; Returns 2.
Debug Mid(a$, 1, 1) ; Returns garbage.
So even though "XXX" is logically one character, it requires 4 bytes in UTF-8.
Because PB does not deal with encoding, it splits up the character into two two-byte slots,
which is why Len() returns 2 instead of 1 and which is why Mid() splits up the character
at the 16bit boundary. You will encounter the same problems (or even worse problems)
with things like emoji.

If you are doing any such operations, you have to be extremely careful. Even comparison
for equality may not succeed because PB also does not do normalization.

For PB's defense, lots of other languages suffer from the same problems as this is
extremely difficult to get right. However, it's important to be aware of these things.


Edit: Now that's what I call irony! :mrgreen:
Please go to this website and copy one of the characters listed there and insert it where I wrote "XXX":
http://www.i18nguy.com/unicode/supplementary-test.html

As it turns out, the PB forum crashes when I include such a character into this post. :lol:

Edit 2: Posted a bug report:
http://www.purebasic.fr/english/viewtop ... =7&t=68286
Image
Blog: Why Does It Suck? (http://whydoesitsuck.com/)
"You can disagree with me as much as you want, but during this talk, by definition, anybody who disagrees is stupid and ugly."
- Linus Torvalds
User avatar
Kukulkan
Addict
Addict
Posts: 1415
Joined: Mon Jun 06, 2005 2:35 pm
Location: germany
Contact:

Re: Any experience with OpenXMLDialog and chinese language?

Post by Kukulkan »

Hello Shield,

thank you for the explanation. I was in fear about such problems :| We already found that, on webserver side, the MySQL database needs to utilize utf8mb4 as charset as simple utf8 only uses 3 bytes which is not sufficient for all characters. Now, the problem seems even worse on PureBasic side. There is no choice at all and UCS-2 has to fit.

As you already used Mandarin Simplified Chinese, do you have any experience about the translation? Will the translator be able to use only characters from the 16bit charset? Is it possible to use Mandarin Simplified Chinese with this limitation? Was your translator happy with this limitation or did he use to do workarounds (language workarounds)?

Thanks,

Kukulkan
User avatar
Shield
Addict
Addict
Posts: 1021
Joined: Fri Jan 21, 2011 8:25 am
Location: 'stralia!
Contact:

Re: Any experience with OpenXMLDialog and chinese language?

Post by Shield »

Hi

Yes, utf8/utf8mb4 got me as well...hated MySQL for that.

I have been dealing with Chinese characters for quite some time now,
both in coding and for learning the language, so here is my experience:

Most languages don't have support for proper Unicode, mostly due to speed reasons
but also because Unicode is just extremely complex. You may have heard of ICU (http://site.icu-project.org/),
which is a huge library for dealing with these issues.

Languages I used (Java, C#, JavaScript, PHP) all have the same issues, though the standard libraries are better equipped than PB's.

You have to ask yourself what exactly it is that you need to do in your application.
If you just need to translate messages in your applications to Chinese, you won't run into too many problems.
So if you ask your translator to translate lines in a UTF-8 text file which are then handed to functions like
SetGadgetText(), it should work "as is". The problems start when you need to parse or modify user input.

The way I do it is I treat strings as if they were raw memory buffers, preferably in UTF-8. UTF-8 has the advantage that several
string operations just work (except for normalization issues). For example, concatenation, comparing for equality,
and searching within a string will work and so will functions such as Trim() (if using UTF-8).
-> This should also work with PB's UCS-2, so as long as you stick to these operations you should be fine.

If you need to do operations on the actual text, such as UCase(), Mid(), or if string character length is important,
you will inevitably run into trouble. Not only because of Unicode but also because of cultural differences (e.g. UCase("i")
may work for English/German etc. but won't work for Turkish).

So to give you better advice: what exactly is it your application does and what exactly are the translations used for?
Image
Blog: Why Does It Suck? (http://whydoesitsuck.com/)
"You can disagree with me as much as you want, but during this talk, by definition, anybody who disagrees is stupid and ugly."
- Linus Torvalds
User avatar
Kukulkan
Addict
Addict
Posts: 1415
Joined: Mon Jun 06, 2005 2:35 pm
Location: germany
Contact:

Re: Any experience with OpenXMLDialog and chinese language?

Post by Kukulkan »

Thanks for your detailed information. We mostly deal with user-interface things but our users also use the software to send and read messages. They are composed in Outlook and transferred as utf8 to the appliance (using some OutlookAddIn). I believe that Outlook is also having the problems (2byte Unicode) and therefore I hope they solved it somehow. Also dealing with JSON but escaping and utf8 should do the trick there, either. We also use HTML very much and the content is also utf8 encoded all the time. Today, no problems with German and French but Chinese is different...

As we use utf8 and Unicode all the time, I keep fingers crossed that it works for me. Do you know if the Unicode page with 2 bytes fits the simplified Chinese characters or if we will get in trouble with this?
User avatar
Shield
Addict
Addict
Posts: 1021
Joined: Fri Jan 21, 2011 8:25 am
Location: 'stralia!
Contact:

Re: Any experience with OpenXMLDialog and chinese language?

Post by Shield »

"Simplified Chinese" is a misnomer here. Only roughly 2200 characters have been simplified,
all the rest remain in their traditional form so your application will have to support all of them.

So to answer your question, yes, it is very likely that you will run into trouble if you use any of the "unsafe" functions.

However, it seems that you will get away with it as your application seems to only be doing basic input/output.
If you are transferring messages in UTF-8, read them correctly into PB strings and then don't touch them, it should work correctly.
The same goes for the other direction. Keep in mind that PB strings can still hold all characters, it just happens that some
of them occupy more than one slot.
Image
Blog: Why Does It Suck? (http://whydoesitsuck.com/)
"You can disagree with me as much as you want, but during this talk, by definition, anybody who disagrees is stupid and ugly."
- Linus Torvalds
User avatar
Kukulkan
Addict
Addict
Posts: 1415
Joined: Mon Jun 06, 2005 2:35 pm
Location: germany
Contact:

Re: Any experience with OpenXMLDialog and chinese language?

Post by Kukulkan »

Thank you Shield!
User avatar
Shield
Addict
Addict
Posts: 1021
Joined: Fri Jan 21, 2011 8:25 am
Location: 'stralia!
Contact:

Re: Any experience with OpenXMLDialog and chinese language?

Post by Shield »

不客气。 :)
Image
Blog: Why Does It Suck? (http://whydoesitsuck.com/)
"You can disagree with me as much as you want, but during this talk, by definition, anybody who disagrees is stupid and ugly."
- Linus Torvalds
IdeasVacuum
Always Here
Always Here
Posts: 6426
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Any experience with OpenXMLDialog and chinese language?

Post by IdeasVacuum »

... I would add that you need the most Unicode (UTF-8) compliant font for your app, a font that does not fully support at least all of the characters of Simplified Chinese will instead display ???. Delivered with Windows7 OS (Professional?), the Arial font should be good. However, in mainland China WindowsXP is still King (or should that be Emperor?). WindowsXP is delivered with only one font that covers Simplified Chinese well: Microsoft Sans Serif (absolutely not the same as MS Sans Serif). If you have customers in other Chinese regions such as Taiwan, those fonts should be OK for their Traditional Chinese requirement.

Once you have found a font which suits your requirements, it is best to distribute it with your app so the customers see what you see. You can't rely on the customers already having the exact same font and Windows automatic font substitution is nowhere near as smart as Microsoft think it is....
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
User avatar
Shield
Addict
Addict
Posts: 1021
Joined: Fri Jan 21, 2011 8:25 am
Location: 'stralia!
Contact:

Re: Any experience with OpenXMLDialog and chinese language?

Post by Shield »

This is a good point. However, standard controls on Windows should be able to deal
with it by automatically rendering Characters in a fallback font. Not sure if this is also the case for XP.
Image
Blog: Why Does It Suck? (http://whydoesitsuck.com/)
"You can disagree with me as much as you want, but during this talk, by definition, anybody who disagrees is stupid and ugly."
- Linus Torvalds
User avatar
Kukulkan
Addict
Addict
Posts: 1415
Joined: Mon Jun 06, 2005 2:35 pm
Location: germany
Contact:

Re: Any experience with OpenXMLDialog and chinese language?

Post by Kukulkan »

Does one of you know a good font which is allowed to be delivered and covers most European and also Chinese characters?

The list found here (https://en.wikipedia.org/wiki/Unicode_f ... code_fonts) is incredible long and complex. Is there any font you can recommend which is also looking good in Europe?

Also, the font will become the biggest part in the software setup - blowing up the current setup from 6MB to somewhere around 25MB (if I see this right). Is it really needed to provide the fonts????

Kukulkan
Post Reply