Re: Removing 'ASCII' switch from PureBasic
Posted: Thu Aug 07, 2014 9:33 pm
Thanks for trying anyway 

http://www.purebasic.com
https://www.purebasic.fr/english/
This is making big assumptions about how heavily the string functions are used. Every benchmark I have seen and/or tried shows PB's Unicode 60-80 percent slower. When strings are the primary function of the project, that is enormous. (BTW, I chose PB over C primarily because of the native strings.)freak wrote:I want to clear up some points:
...
About the speed:
A unicode program is definitely not slower than an ascii one. The reason is that the entire OS layer is Unicode (at least on Windows), so in an ascii program, every call to an API function must be converted from ascii->unicode and back for the result. Even if the program uses only minimum OS interaction, the difference from the longer strings is pretty small, so for the average program, unicode mode is a gain in performance.
Code: Select all
; ASCII vs. Unicode speed test
; Ascii: 18977, 20155
; Unicode: 35464, 33615
s.s = "My start string"
#Times = 100000
T = ElapsedMilliseconds()
For I = 0 To #Times
s = s + Chr(Random(122, 32))
o = FindString(s,"add", 1)
Next
T1 = ElapsedMilliseconds()-T ;564 at 100,000, 18288 at 500,000
MessageRequester("", Str(T1))
Again, you are making big assumptions about how the string functions are used. Maybe for something as simple as an IDE (where 90% of the time is waiting for user input rather than actually processing strings, and where the files are relatively small), the application size is similar, but when you are handling data that is significantly larger than the actual application (i.e. "small" files are measured in MB), then the string library efficiency and speed begins to become significant.freak wrote:About the size:
A unicode program will need space for the longer strings, but this too is not really an issue. To check a real life case, the following are the sizes of the PureBasic IDE (a 100k lines PB program) compiled in both modes:
ASCII: 2.979 KB
Unicode: 3.117 KB
So its about 5% more size. Yes, this is a difference, but is really not an issue in a time where hard drive sizes are measured in TB. That is just my personal opinion though.
If you are handling that much text then you do not store that in the program, you use binary files or databases (which are designed for such massive data handling).Tenaja wrote:This is making big assumptions about how heavily the string functions are used. Every benchmark I have seen and/or tried shows PB's Unicode 60-80 percent slower. When strings are the primary function of the project, that is enormous. (BTW, I chose PB over C primarily because of the native strings.)
Have you benchmarked the PB compiler itself? How long does it take to compile the IDE when the compiler is in ascii mode, vs unicode? I would be very interested in that real-world comparison. Here is one that does concatenation as well as search, with timings of two consecutive tests on my Core i5:
(code)
...
Again, you are making big assumptions about how the string functions are used. Maybe for something as simple as an IDE (where 90% of the time is waiting for user input rather than actually processing strings, and where the files are relatively small), the application size is similar, but when you are handling data that is significantly larger than the actual application
There is no need for ascii (internally), it is no different than a row of bytes with a 0 at the end. From a user point of view yo will be oblivious to the changes except tat suddenly a lot of PureBasic programs will be able to display my name "Roger Hågensen" correctly, or that "foreign" filenames and characters will look and work properly in filenames.skywalk wrote:why can't we have our cake and eat it too? I mean, there are double and float and quad and long datatypes and they have their place. I'm all for a more robust compiler and the benefits of a less stressed development team, but it is unclear to me from a user point of view, why Ascii$ cannot coexist with Unicode$$?
Actually, all API/Gadgets/String gadgets etc are Unicode today, you just don't know it. Siletly behid the scees te OS translalates strings to/from Ascii and to Unicode (or wide/multibyte as it may be called) the only Windows OS this does not happen are in the old Windows 9x line when you use Ascii. On the NT line of Windows the conversion is always done if the exe is in Ascii mode. (I thought I explained that above earlier?)skywalk wrote:All api/gadgets must use wide character calls but allow the user to choose how to represent and manipulate strings according to their specific needs.
Then you will probably have 0 issues due to this change.skywalk wrote:By the way, all my code works in Unicode or Ascii compile.
Unicode does not require more steps, if you are talking about reading from a text file then that is not a issue of Unicode.skywalk wrote:Unicode requires more steps to debug and has subtle error conditions when sorting or compressing in and out of ascii data.
You do realize that the "datatype" .a is basically just .b right?skywalk wrote:But, it is painful to see a useful datatype obsoleted.
That part is pretty simple. You use PeekS or PokeS or you use pseudo types for Import and dlls etc. I have to mess with some of that today in some of my projects because the code in 3rd party stuff is archaic. (luckily BASS started supporting Unicode in most areas)RichAlgeni wrote:I have the same issue as Skywalk. My code interfaces with quite a few external processes and machines. A couple of them seem to have been written in the 70's. Not one of them supports Unicode right now.
What codepage do they expect and do you convert to/from your and their codepage? If not then you can never guarantee the right characters being preserved.RichAlgeni wrote:Not one of them supports Unicode right now.
I'm guessing more optimization where possible will be done to the Unicode stuff as things can be implemented a little different internally I suspect.RichAlgeni wrote:If it helps create a more advanced and stable product, then I am for it as well, and will adapt as needed.
I'd suggest against it as SSE42 is mostly Intel only, the majority of AMD and Via CPUs etc will be left out.RichAlgeni wrote:If we could get SSE42 based string functions as Wilbert suggested, maybe this would be a little less painful?
AFAIK they have been "Unicode" since Windows NT 4 in um 1996? After that all "Ascii" API stuff has been transparently converted to/from "Unicode" in the OS.RichAlgeni wrote:I'm not mistaken, all of Microsoft's functions are now Unicode.
Why? That is exactly the way it is now. You are asking them to remove it then re-add it?skywalk wrote:With Ascii$ being a new feature request