Removing 'ASCII' switch from PureBasic

Post by **Fred** » Thu Aug 07, 2014 9:33 pm

Thanks for trying anyway

juror · Post by **juror** » Thu Aug 07, 2014 10:01 pm

Maybe an Ascii to unicode guide or tips.

Some sort of forum section for those who need help and/or can offer help/tips/suggestions, so we have one place to look and where maybe even some "experts" (Fred/Freak) could visit occasionally?

IdeasVacuum · Post by **IdeasVacuum** » Thu Aug 07, 2014 11:05 pm

Is it the case that the most prominent use of ASCII mode is for embedded hardware and older network protocols? Meaning that kind of work does not need Unicode (indeed, sometimes may not even need a GUI).

So, why not 'freeze' a PB5.23LTS or PB5.40 version as a critical maintenance only Ascii version - PureBasic Ascii (still including the Unicode switch, no extra effort or changes required). There after, PB5.xxLTS, PB5.50 etc are Unicode Only.

Danilo · Post by **Danilo** » Thu Aug 07, 2014 11:15 pm

Yep. All current PureBasic Versions < 5.4x would still work for years to come. No problem at all.

Tenaja · Post by **Tenaja** » Thu Aug 07, 2014 11:56 pm

freak wrote:I want to clear up some points:
...
About the speed:
A unicode program is definitely not slower than an ascii one. The reason is that the entire OS layer is Unicode (at least on Windows), so in an ascii program, every call to an API function must be converted from ascii->unicode and back for the result. Even if the program uses only minimum OS interaction, the difference from the longer strings is pretty small, so for the average program, unicode mode is a gain in performance.

This is making big assumptions about how heavily the string functions are used. Every benchmark I have seen and/or tried shows PB's Unicode 60-80 percent slower. When strings are the primary function of the project, that is enormous. (BTW, I chose PB over C primarily because of the native strings.)

Have you benchmarked the PB compiler itself? How long does it take to compile the IDE when the compiler is in ascii mode, vs unicode? I would be very interested in that real-world comparison. Here is one that does concatenation as well as search, with timings of two consecutive tests on my Core i5:

Code: Select all

; ASCII vs. Unicode speed test
; Ascii:	18977, 20155
; Unicode:	35464, 33615
s.s = "My start string"
#Times = 100000
T = ElapsedMilliseconds()
For I = 0 To #Times
	s = s + Chr(Random(122, 32))
	o = FindString(s,"add", 1)
Next
T1 = ElapsedMilliseconds()-T		;564 at 100,000, 18288 at 500,000
MessageRequester("", Str(T1))

freak wrote:About the size:
A unicode program will need space for the longer strings, but this too is not really an issue. To check a real life case, the following are the sizes of the PureBasic IDE (a 100k lines PB program) compiled in both modes:

ASCII: 2.979 KB
Unicode: 3.117 KB

So its about 5% more size. Yes, this is a difference, but is really not an issue in a time where hard drive sizes are measured in TB. That is just my personal opinion though.

Again, you are making big assumptions about how the string functions are used. Maybe for something as simple as an IDE (where 90% of the time is waiting for user input rather than actually processing strings, and where the files are relatively small), the application size is similar, but when you are handling data that is significantly larger than the actual application (i.e. "small" files are measured in MB), then the string library efficiency and speed begins to become significant.

Rescator · Post by **Rescator** » Fri Aug 08, 2014 1:38 am

Never thought I'd see so many programmers say such dumb things or speak out with such ignorance about something, I feel almost ashamed to be on the same forum as some of you here now.

FACT: Windows 2000, XP, Vista, 7, 8, and all the server variants and other variants of the Windows NT core are Unicode, all OS API functions are natively Unicode: using ASCII calls on a unicode system incurs a small overhead as the OS need to convert To/from ASCII each time you use a call.

FACT: Fred is not removing the ability to read/write Ascii strings, you will still be able to do that just like you can do with UTF8 string (the first 127 characters of which is ASCII also BTW)

FACT: Fred want to make PureBasic store and process strings as UCS-2 internally (what you know as the Unicode mode), this is actually a variant of UTF16 but the native MicroSoft implementation.

FACT: Unicode does not change anything with sound/audio quality, if it does it's because somebody did something really stupid, and used string processing with audio. Windows does not use nor Ascii for Unicode for audio, and for those that use BASS Audio Library or similar, BASS also support Unicode (get the latest version, read the manual, there are Unicode flags for all tings related to filepaths and strings there).
NOTE: trust me my own stream player has been Unicode since the start pretty much an uses BASS and has no issue with 16bit nor 32bit (floating point) audio.

FACT: You will still be able to read/write Ascii and UTF8 to and from files (and XML is UTF-8 by default as per the XML standard itself anyway).

FACT: If you have your own routines for dealing with strings and you treat them as .b type instead of .c then you need to recode all that. If you already do use .c or Character structure then you probably do not need to change a thing. (assuming you properly tested both the Ascii and Unicode executables, you did that right?)
NOTE: When Fred added Unicode and UTF8 to PureBasic is when I felt that PureBasic had finally caught up. My stream player works happily on Windows 2000 compiled as Unicode with PB5.30, Ascii was obsolete since Windows NT 4.

FACT: Testing string performance with the debugger is very slow, if you want to test string performance the do it without the debugger.

FACT: Fred's mistake is perhaps that he did not do with for v5.30 instead of 5.40.
NOTE: Fred should probably have dumped Ascii compile mode many years ago.

FACT: Unicode filesize increase is minimal, unless you plan to write a novel in PureBasic. Help text and a manual can be stored in separate file and you can use deflate based compression (like zlib or zip) if you want to minimize size.
NOTE: Or store the text for the help and manual as UTF8 text instead, this is what I do, all text from/into the program is UTF8 as that is not endian influenced (as UTF16 is), and Unicode in the program itself and I interface only with Unicode OS API calls.

FACT: Unicode exes will not run/work properly on anything older than Windows 2000, but considering newer PureBasic no longer support the old Windows 9x line this is not an issue.
NOTE: Why people are having such a big fuzz over Fred wanting to use 16bit characters internally in PureBasic I don't understand. There was less noise and panic when Fred stopped supporting old Windows 9x line.

FACT: People complain that Fred doesn't improve PureBasic or doesn't add new features or change this or that, but when he wishes to get rid of something that is no longer needed and that should no longer be used (Ascii mode is part of the Windows 3.1 and Windows 95 legacy after all) by anyone and people start complaining about that?
NOTE: You people are really weird and backwards, besides you still have PB 5.20 for legacy support right?

FACT: Ascii will never "vanish" as mentioned before commands to handle it will still be there, certain OS API calls, some 3rd party libraries still use/need Ascii, his is where PokeS and PeekS come in, this is where the pseudo types come in.

Let me repeat this in a way so all of you can understand this, YOU WANT ALL CAPS? AND BOLD TEXT?

PUREBASIC WILL ONLY STORE/PROCESS TEXT AS UNICODE INTERNALLY FROM NOW ON AS IF UNICODE MODE WAS ALWAYS ON, AND API CALLS THAT HAVE "A" and "W" VARIANTS WILL HAVE THE "A" VARIANT DROPPED AS THEY ARE NO LONGER NEEDED/USED. PUREBASIC COMMANDS TO CONVERT TO/FROM ASCII WILL STILL REMAIN BECAUSE THERE WILL ALWAYS BE ASCII/ISO8859-1/LATIN-1/WHATEVER TEXT OUT THERE.

*sigh* and I thought "programmers" where supposed to be smart.

I dropped the use of Ascii in my software years ago, the only thing I needed to do was look over and fix some of by text/string handling in my code, as I did a few stupid things like treating text as if it was a string of bytes, once I started to treat bytes as a string of character units (which may use more than once character unit to represent unusual letters so even with 16bit/UCS-2 a character may take up 2 to 6 bytes if I recall correctly.
If your code is really old then the least of your worries are Ascii you probably have a ton of other old crappy code you need to fix anyway.

FACT: If your software already support Ascii and Unicode, well now you can just get rid of the Ascii CompilerIf stuff (or make the Unicode CompilerIf the default if vice versa).
FACT: If you already only use Unicode you don't have to change a thing.

FACT: And if you only use/support Ascii....The 1980s called and they....don't want you back!

(that last one was a joke).

Oh an the all caps and bold text was be being ironic/sarcastic, but judging from low intelligence/panic behavior around here 'll assume nobody noticed that either.

juror · Post by **juror** » Fri Aug 08, 2014 1:59 am

And if you don't understand Rescator you need to understand Dunning-Kruger.

skywalk · Post by **skywalk** » Fri Aug 08, 2014 2:24 am

@rescator - well said. But, (there's always but's), why can't we have our cake and eat it too? I mean, there are double and float and quad and long datatypes and they have their place. I'm all for a more robust compiler and the benefits of a less stressed development team, but it is unclear to me from a user point of view, why Ascii$ cannot coexist with Unicode$$? This way, there is no Unicode switch. All api/gadgets must use wide character calls but allow the user to choose how to represent and manipulate strings according to their specific needs.
By the way, all my code works in Unicode or Ascii compile. But, Unicode requires more steps to debug and has subtle error conditions when sorting or compressing in and out of ascii data.
Nothing is impossible. We are programmers. But, it is painful to see a useful datatype obsoleted.

Rescator · Post by **Rescator** » Fri Aug 08, 2014 2:49 am

Tenaja wrote:This is making big assumptions about how heavily the string functions are used. Every benchmark I have seen and/or tried shows PB's Unicode 60-80 percent slower. When strings are the primary function of the project, that is enormous. (BTW, I chose PB over C primarily because of the native strings.)

Have you benchmarked the PB compiler itself? How long does it take to compile the IDE when the compiler is in ascii mode, vs unicode? I would be very interested in that real-world comparison. Here is one that does concatenation as well as search, with timings of two consecutive tests on my Core i5:
(code)
...

Again, you are making big assumptions about how the string functions are used. Maybe for something as simple as an IDE (where 90% of the time is waiting for user input rather than actually processing strings, and where the files are relatively small), the application size is similar, but when you are handling data that is significantly larger than the actual application

If you are handling that much text then you do not store that in the program, you use binary files or databases (which are designed for such massive data handling).
That loop code uses ElapsedMilliseconds, you might want to use timeGetTime_() instead and (older versions of PB had a less accurate ElapsedMilliseconds) and timeBeginPeriod ti increase timer resolution, remember to use timeEndPeriod when done though.

Also, it would be interesting to see the actual generated code for that loop. Is it really Unicode all the way through?
Moving a pointer two bytes at a time and doing a 16bit character comparison should be faster than moving 1 byte at a time and doing a 8bit character.
If the text being compared goes from Unicode to Ascii to Unicode (comparison) to Ascii to Unicode then obviously thee would be a overhead.
(That is assuming the Unicode text is fixed 16bit and not multibyte. Linux and wXWidgets seems to use fixed 4byte (32bit) characters for that reason)

Your example is also horribly flawed.
If that was actual code then the random characters would be pre-generated and pulled from a table instead. You are also just generating random values below 255, unicode has characters above that, to be fair you would need to generate unicode characters too (and do a conversion to ascii in the process which would negate most speed differences)
Also, you keep increasing the string length, which means a new string is created, the old is copied and the new text concatenated to it then the old is free.
A proper program would use buffer setup of some sort, and instead of FindString it would be faster to use a pointer to a structure of Character and do you own code, if possible code it in Assembler even.

"It's just an example" you say, but that's the issue isn't it? That code is not a real program, no program in t wild does that. You claim Freak make assumptions and ask about the IDE and real world numbers then you go making your own assumptions and a non-real test code? What hypocritical double morality is that?

I'm sure that your example is of interest to the PB team and I'm sure they will look closely at various string handling code from now on, why? Because now they do not have to tip toe around and make sure the code work with both Ascii and Unicode, now they can actually technically do some string processing optimization they could not do before.

Another thing you should check (I did) is to make table to avoid the Chr and Random from cluttering things up.
And this is interesting, if you use the #PB_String_NoCase on FindString the time it takes increases a lot for both Ascii and Unicode.
The issue is not the unicode string but the FindString function, nocase should technically be faster as you can do simple binary comparison (more or less, it's tricky if Widechar or multibyte is involved) I don't know what FindString does so I can not explain that. The issue could even be the CRT lib function used (or not used).

BTW! Ascii is limited to 7 bits, the Ascii in PureBasic is mislabeled it's actually a 8bit codepage and if you are to compare "Ascii" text with "Ascii" text then you may need to do codepage conversions (especially if your program is written with a different codepage than the codepage on the users system) with Unicode you no longer have to worry about codepages.

Also note that "Ascii" can only represent 255 characters while Unicode (or in this case UCS-2 or UTF16) can represent 65534 directly in single 16bit characters and up to 4 million different characters (the full potential of the Unicode range).

That example of your would lend itself better to binary stream manipulation functions (add one byte, compare memory range, rinse and repeat) rather than character/string manipulation.

RichAlgeni · Post by **RichAlgeni** » Fri Aug 08, 2014 2:53 am

I have the same issue as Skywalk. My code interfaces with quite a few external processes and machines. A couple of them seem to have been written in the 70's. Not one of them supports Unicode right now.

If it helps create a more advanced and stable product, then I am for it as well, and will adapt as needed.

A mentor of mine maintained that 'once you have the patient on the table, make all the cuts you need to make.' If we could get SSE42 based string functions as Wilbert suggested, maybe this would be a little less painful?

On a side note, and if I'm not mistaken, all of Microsoft's functions are now Unicode.

Rescator · Post by **Rescator** » Fri Aug 08, 2014 3:06 am

skywalk wrote:why can't we have our cake and eat it too? I mean, there are double and float and quad and long datatypes and they have their place. I'm all for a more robust compiler and the benefits of a less stressed development team, but it is unclear to me from a user point of view, why Ascii$ cannot coexist with Unicode$$?

There is no need for ascii (internally), it is no different than a row of bytes with a 0 at the end. From a user point of view yo will be oblivious to the changes except tat suddenly a lot of PureBasic programs will be able to display my name "Roger Hågensen" correctly, or that "foreign" filenames and characters will look and work properly in filenames.

skywalk wrote:All api/gadgets must use wide character calls but allow the user to choose how to represent and manipulate strings according to their specific needs.

Actually, all API/Gadgets/String gadgets etc are Unicode today, you just don't know it. Siletly behid the scees te OS translalates strings to/from Ascii and to Unicode (or wide/multibyte as it may be called) the only Windows OS this does not happen are in the old Windows 9x line when you use Ascii. On the NT line of Windows the conversion is always done if the exe is in Ascii mode. (I thought I explained that above earlier?)

skywalk wrote:By the way, all my code works in Unicode or Ascii compile.

Then you will probably have 0 issues due to this change.

skywalk wrote:Unicode requires more steps to debug and has subtle error conditions when sorting or compressing in and out of ascii data.

Unicode does not require more steps, if you are talking about reading from a text file then that is not a issue of Unicode.
In my case I read and write to and from text files as UTF8 to preserve all characters. If yo read to/from text files in Ascii how do you know what codepage the text is stored as? You cant.
As to sorting and compression, sorting should not be an issue, while Unicode does allow more than one way to represent a character the shortest/simplest is usually used and unicode strings are/should be normalized for comparison (which means that thins like Å and å and A and A are taken into consideration).

skywalk wrote:But, it is painful to see a useful datatype obsoleted.

You do realize that the "datatype" .a is basically just .b right?
How often do you manually manipulate string characters? If you support both Unicode and ascii then you are using .c instead and Size(Character) instead right?
As to strings themselves a string$ type is automatic and in that case the datatype is never directly used so not an issue at all.

You will still be able to manipulate ascii binary data as it is identical to binary data, it's just the internal string representation that will no longer be ascii so doing nasty tricky and hacks (that worked in the past in Ascii mode) will break your program.
But if your program support both Ascii and Unicode then you aren't doing any nasty hacks anyway right?

Rescator · Post by **Rescator** » Fri Aug 08, 2014 3:30 am

RichAlgeni wrote:I have the same issue as Skywalk. My code interfaces with quite a few external processes and machines. A couple of them seem to have been written in the 70's. Not one of them supports Unicode right now.

That part is pretty simple. You use PeekS or PokeS or you use pseudo types for Import and dlls etc. I have to mess with some of that today in some of my projects because the code in 3rd party stuff is archaic. (luckily BASS started supporting Unicode in most areas)
A lot of network stuff uses Ascii-7 still, but I just ignore that and use Ascii-8 (or rather treat it as a binary stream instead).
Network stuff also use Big Endian values so strings are not the only issues there.

RichAlgeni wrote:Not one of them supports Unicode right now.

What codepage do they expect and do you convert to/from your and their codepage? If not then you can never guarantee the right characters being preserved.
They may not be "Ascii" (I assume you mean ISO 8859-1 or a variant or the windows equivalent with is 12xx something, then you got 8859-2 and, well you get the point, which Ascii are you talking about?!)

RichAlgeni wrote:If it helps create a more advanced and stable product, then I am for it as well, and will adapt as needed.

I'm guessing more optimization where possible will be done to the Unicode stuff as things can be implemented a little different internally I suspect.
The end users will love PB programmed though as the programs will suddenly no longer mess up their names etc.
Ever tried to enter your name in game and it refuses to let you type it? I have had that happen, heck I've had that happen on websites even, just because they always assume Ascii-7 or similar.

RichAlgeni wrote:If we could get SSE42 based string functions as Wilbert suggested, maybe this would be a little less painful?

I'd suggest against it as SSE42 is mostly Intel only, the majority of AMD and Via CPUs etc will be left out.
SSE3 is "safe" for x64 as 99.99% of all x64 CPUs out there support it so I'm hoping the PB team will take advantage of the "new" CPU features that are guaranteed to be there with x64. With x86 it's all very messy as to what can or can not be used.

RichAlgeni wrote:I'm not mistaken, all of Microsoft's functions are now Unicode.

AFAIK they have been "Unicode" since Windows NT 4 in um 1996? After that all "Ascii" API stuff has been transparently converted to/from "Unicode" in the OS.
Think about Unicode is over 17 years old, the NT Win32 API is over 17 years old. MicroSoft has not "natively" used "Ascii" for over 17 years.

Again I must stress that this is mostly internally in PB, you will still be able to (and curse and swear) at stupid old external programs and libraries and systems and servers on the net that uses ascii-7 etc. And that will remain an issue as long as those standards and protocols are not replaced or updated.
Though luckily a few uses UTF-8 like XML does (which dates back to um. 1996) so no exactly new that either is it?!

skywalk · Post by **skywalk** » Fri Aug 08, 2014 3:40 am

Yes yes, I already agreed with you. And I already stated I am not Unicode illiterate. But, you cannot say to me having a simple ascii$ datatype in a Unicode compiler is a disadvantage. Did C/C++ drop them from their string libs? Maybe that is an unfair comparison due to scale, but I think it is important. I guess the point of my replies is:
Yes to Unicode only compiles.
Yes to Ascii$ and Unicode$$ datatypes.
With Ascii$ being a new feature request when Ascii$ are dropped.

Rescator · Post by **Rescator** » Fri Aug 08, 2014 3:59 am

skywalk wrote:With Ascii$ being a new feature request

Why? That is exactly the way it is now. You are asking them to remove it then re-add it?
Ascii$ would require all the same double code as today.

Show an example of problem software to run into.
Normally you call a API or DLL function, and pass along pointer to a memory buffer, then that memory get filled (with lets say ascii-7 text).
What do yo do next? Well you either use PeekS() or you use a ascii pseudotype if you use prototypes to call DLL functions etc. In that case the PureBasic does a PeekS for you. ('im sure Fred or Freak will correct me if I'm wrong on that assumption).

The other direction however is more painful.
But Fred stated (in the first post) that he will add
*AsciiBuffer = ToAscii(String$)
*UTF8Buffer = ToUTF8(String$)

(I assume that FreeMemory() can be used to free the buffer!?)

ToAscii() will save me a lot of extra code as in at least one of my current programs I have to mess with allocating memory then using PokeS with the Ascii flag.
ToUTF8() is also very welcome as a few networking related things will be needing that.

You only need to convert to or from Ascii, once inside PureBasic is does not matter (to you) what it is, just use the PureBasic string functions. If you use your own string code then you are on your own in anyway ad you'd better know what you are doing.
All OS API functions (with a few exceptions) are of the Wide variant so no conversion will/should be needed.

juror · Post by **juror** » Fri Aug 08, 2014 4:01 am

The issue for us is not whether we can convert to Unicode. Yes, we can.

The issue is one of opportunity costs. Unlike some, we provide programs (systems) for real (paying) customers. In the over 6 years we've have been selling systems, we have have not had 1 request for Unicode. Everything our customers request and need has been satisfied with ASCII.

These systems, which were began over 7 years ago, consist of dozens of modules and close to 100,000 lines of code. Ascii code.

We have a major release scheduled for January 2015 which has just entered alpha testing. Specifications are locked. We will not be modifying it for unicode. We have also mentioned features which will be added subsequent to that release as part of the next feature release cycle.

PB 5.22 LTS expires in mid 2015. In order to remain on a supported release of PB we will need to convert to Unicode prior to the end of support. So we'll be spending the time following our major release in a conversion, which from a customer perspective adds little to no value against our committed and their requested, features. I'm sure that announcing unicode compliance will really increase sales

Actually, we will not announce it since if we did, we would likely be expected to explain some real "customer" benefit to it. And for our customers, there is no tangible benefit.

So instead of a feature and major release in 2015, we'll be lucky to provide "unicode" and bug-fix releases. But our systems will be "modern". Will that get us any additional sales/revenue? Hardly.

It would be nice to have a 5.3 or thereabouts LTS with a 2 year support which supports ASCII so our conversion didn't have to be quite so disrupting to our planned release cycles.

PureBasic Forums - English

Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic