Any experience with OpenXMLDialog and chinese language?

Kukulkan · Post by **Kukulkan** » Tue Apr 04, 2017 7:22 am

Hi,

I was asked to translate some of my PB programs to Chinese language. All using XMLDialog feature of PB and I wonder if there is someone who had good or bad experience with Chinese? Maybe some hints or tips? Any Gadgets making trouble? Any known limitations?

BTW, yes, I use PB in Unicode mode anyway

Thanks,

Kukulkan

IdeasVacuum · Post by **IdeasVacuum** » Sun Apr 09, 2017 1:41 am

Should not be a problem - With classic forms GUIs, I have more issues with the German language than with Chinese (Mandarin Simplified Chinese) since German sometimes needs long words or long sentences but Chinese is very condensed.

Kukulkan · Post by **Kukulkan** » Mon Apr 10, 2017 11:35 am

Hello IdeasVacuum,

thanks for your reply. So you did Chinese and had no problems at all? Especially with the XML Layout library of PB? This is great news.

Any tips or hints for me regarding Chinese? We will also have to translate Mandarin Simplified Chinese, so it is the same task. Maybe I can drop you a PN in case we stuck somewhere?

Best,

Volker

Shield · Post by **Shield** » Mon Apr 10, 2017 12:48 pm

The main problem you'll encounter when dealing with languages other than English is that,
depending on the characters, PB will report wrong string lengths and some functions
such as Mid(), UCase() etc. will not work correctly.

Generally, displaying these characters should work as it is more of an OS thing.

Kukulkan · Post by **Kukulkan** » Mon Apr 10, 2017 1:49 pm

Hi Shield,

thanks for your hints. We already deal with French and German umlauts and accents and found no problem so far. As PB is unicode, there should be no problem, right? Mid(), UCase() and even RegExp should be capable of dealing with all the unicode characters. Otherwise, PB would not be unicode and I thought it is...

Do you know about a specific case where it fails? So we can prepare for it...

Best Kukulkan

Shield · Post by **Shield** » Mon Apr 10, 2017 2:04 pm

PB has no notion of Unicode characters* and does not deal with encoding at all.

(* The term "character" is extremely ambiguous in Unicode.)

PB stores strings internally in UCS-2, meaning two bytes per character. This only works for
a subset of the Unicode space but is nowhere near sufficient, especially not for Chinese.

Look at the following example:

Code: Select all

Define a$ = "XXXX" ; Check bottom of this post.
Debug Len(a$) ; Returns 2.
Debug Mid(a$, 1, 1) ; Returns garbage.

So even though "XXX" is logically one character, it requires 4 bytes in UTF-8.
Because PB does not deal with encoding, it splits up the character into two two-byte slots,
which is why Len() returns 2 instead of 1 and which is why Mid() splits up the character
at the 16bit boundary. You will encounter the same problems (or even worse problems)
with things like emoji.

If you are doing any such operations, you have to be extremely careful. Even comparison
for equality may not succeed because PB also does not do normalization.

For PB's defense, lots of other languages suffer from the same problems as this is
extremely difficult to get right. However, it's important to be aware of these things.

Edit: Now that's what I call irony!

Please go to this website and copy one of the characters listed there and insert it where I wrote "XXX":
http://www.i18nguy.com/unicode/supplementary-test.html

As it turns out, the PB forum crashes when I include such a character into this post.

Edit 2: Posted a bug report:
http://www.purebasic.fr/english/viewtop ... =7&t=68286

Kukulkan · Post by **Kukulkan** » Mon Apr 10, 2017 3:55 pm

Hello Shield,

thank you for the explanation. I was in fear about such problems

We already found that, on webserver side, the MySQL database needs to utilize utf8mb4 as charset as simple utf8 only uses 3 bytes which is not sufficient for all characters. Now, the problem seems even worse on PureBasic side. There is no choice at all and UCS-2 has to fit.

As you already used Mandarin Simplified Chinese, do you have any experience about the translation? Will the translator be able to use only characters from the 16bit charset? Is it possible to use Mandarin Simplified Chinese with this limitation? Was your translator happy with this limitation or did he use to do workarounds (language workarounds)?

Thanks,

Kukulkan

Shield · Post by **Shield** » Mon Apr 10, 2017 4:26 pm

Hi

Yes, utf8/utf8mb4 got me as well...hated MySQL for that.

I have been dealing with Chinese characters for quite some time now,
both in coding and for learning the language, so here is my experience:

Most languages don't have support for proper Unicode, mostly due to speed reasons
but also because Unicode is just extremely complex. You may have heard of ICU (http://site.icu-project.org/),
which is a huge library for dealing with these issues.

Languages I used (Java, C#, JavaScript, PHP) all have the same issues, though the standard libraries are better equipped than PB's.

You have to ask yourself what exactly it is that you need to do in your application.
If you just need to translate messages in your applications to Chinese, you won't run into too many problems.
So if you ask your translator to translate lines in a UTF-8 text file which are then handed to functions like
SetGadgetText(), it should work "as is". The problems start when you need to parse or modify user input.

The way I do it is I treat strings as if they were raw memory buffers, preferably in UTF-8. UTF-8 has the advantage that several
string operations just work (except for normalization issues). For example, concatenation, comparing for equality,
and searching within a string will work and so will functions such as Trim() (if using UTF-8).
-> This should also work with PB's UCS-2, so as long as you stick to these operations you should be fine.

If you need to do operations on the actual text, such as UCase(), Mid(), or if string character length is important,
you will inevitably run into trouble. Not only because of Unicode but also because of cultural differences (e.g. UCase("i")
may work for English/German etc. but won't work for Turkish).

So to give you better advice: what exactly is it your application does and what exactly are the translations used for?

Kukulkan · Post by **Kukulkan** » Mon Apr 10, 2017 4:53 pm

Thanks for your detailed information. We mostly deal with user-interface things but our users also use the software to send and read messages. They are composed in Outlook and transferred as utf8 to the appliance (using some OutlookAddIn). I believe that Outlook is also having the problems (2byte Unicode) and therefore I hope they solved it somehow. Also dealing with JSON but escaping and utf8 should do the trick there, either. We also use HTML very much and the content is also utf8 encoded all the time. Today, no problems with German and French but Chinese is different...

As we use utf8 and Unicode all the time, I keep fingers crossed that it works for me. Do you know if the Unicode page with 2 bytes fits the simplified Chinese characters or if we will get in trouble with this?

Shield · Post by **Shield** » Mon Apr 10, 2017 4:59 pm

"Simplified Chinese" is a misnomer here. Only roughly 2200 characters have been simplified,
all the rest remain in their traditional form so your application will have to support all of them.

So to answer your question, yes, it is very likely that you will run into trouble if you use any of the "unsafe" functions.

However, it seems that you will get away with it as your application seems to only be doing basic input/output.
If you are transferring messages in UTF-8, read them correctly into PB strings and then don't touch them, it should work correctly.
The same goes for the other direction. Keep in mind that PB strings can still hold all characters, it just happens that some
of them occupy more than one slot.

Kukulkan · Post by **Kukulkan** » Tue Apr 11, 2017 6:58 am

Thank you Shield!

Shield · Post by **Shield** » Tue Apr 11, 2017 9:22 am

不客气。

IdeasVacuum · Post by **IdeasVacuum** » Tue Apr 11, 2017 11:16 am

... I would add that you need the most Unicode (UTF-8) compliant font for your app, a font that does not fully support at least all of the characters of Simplified Chinese will instead display ???. Delivered with Windows7 OS (Professional?), the Arial font should be good. However, in mainland China WindowsXP is still King (or should that be Emperor?). WindowsXP is delivered with only one font that covers Simplified Chinese well: Microsoft Sans Serif (absolutely not the same as MS Sans Serif). If you have customers in other Chinese regions such as Taiwan, those fonts should be OK for their Traditional Chinese requirement.

Once you have found a font which suits your requirements, it is best to distribute it with your app so the customers see what you see. You can't rely on the customers already having the exact same font and Windows automatic font substitution is nowhere near as smart as Microsoft think it is....

Shield · Post by **Shield** » Tue Apr 11, 2017 12:13 pm

This is a good point. However, standard controls on Windows should be able to deal
with it by automatically rendering Characters in a fallback font. Not sure if this is also the case for XP.

Kukulkan · Post by **Kukulkan** » Tue Apr 11, 2017 1:44 pm

Does one of you know a good font which is allowed to be delivered and covers most European and also Chinese characters?

The list found here (https://en.wikipedia.org/wiki/Unicode_f ... code_fonts) is incredible long and complex. Is there any font you can recommend which is also looking good in Europe?

Also, the font will become the biggest part in the software setup - blowing up the current setup from 6MB to somewhere around 25MB (if I see this right). Is it really needed to provide the fonts????

Kukulkan

PureBasic Forums - English

Any experience with OpenXMLDialog and chinese language?

Any experience with OpenXMLDialog and chinese language?

Re: Any experience with OpenXMLDialog and chinese language?

Re: Any experience with OpenXMLDialog and chinese language?

Re: Any experience with OpenXMLDialog and chinese language?

Re: Any experience with OpenXMLDialog and chinese language?

Re: Any experience with OpenXMLDialog and chinese language?

Re: Any experience with OpenXMLDialog and chinese language?

Re: Any experience with OpenXMLDialog and chinese language?

Re: Any experience with OpenXMLDialog and chinese language?

Re: Any experience with OpenXMLDialog and chinese language?

Re: Any experience with OpenXMLDialog and chinese language?

Re: Any experience with OpenXMLDialog and chinese language?

Re: Any experience with OpenXMLDialog and chinese language?

Re: Any experience with OpenXMLDialog and chinese language?

Re: Any experience with OpenXMLDialog and chinese language?