Removing 'ASCII' switch from PureBasic

Post by **Fred** » Thu Aug 07, 2014 3:30 pm

wilbert wrote:@Marroh,
Mac is so modern that one single 64 bit unicode version would probably be good enough for most Mac users

Actually you're right, as the bare minimum for PB is OS X 10.5, and this one already supports 64-bit

Tenaja · Post by **Tenaja** » Thu Aug 07, 2014 3:55 pm

Fred wrote:It's not decided for now

As I said, I use ascii almost exclusively. Removing ascii will create much work for me. However, I do see the merits of it.

Here is my suggestion for a compromise:
Make Unicode default for new installations, and add warnings to the Help file that ascii will be depreciated. Then schedule to make 2016 LTS the final release with ascii strings. That gives us time to ween ourselves off of ascii, and another two years with a solid version for old code.

Post by **freak** » Thu Aug 07, 2014 4:13 pm

I want to clear up some points:

About the speed:
A unicode program is definitely not slower than an ascii one. The reason is that the entire OS layer is Unicode (at least on Windows), so in an ascii program, every call to an API function must be converted from ascii->unicode and back for the result. Even if the program uses only minimum OS interaction, the difference from the longer strings is pretty small, so for the average program, unicode mode is a gain in performance.

About the size:
A unicode program will need space for the longer strings, but this too is not really an issue. To check a real life case, the following are the sizes of the PureBasic IDE (a 100k lines PB program) compiled in both modes:

ASCII: 2.979 KB
Unicode: 3.117 KB

So its about 5% more size. Yes, this is a difference, but is really not an issue in a time where hard drive sizes are measured in TB. That is just my personal opinion though.

About other featurs such as 32bit mode:
Different features produce a different amount of work in terms of maintenance. Speaking mostly about the PB libraries, the maintenance costs are roughly in this order (by my estimation):
1) Support for the 3 OS
2) Support for ascii/unicode
3) Support for quirks of specific OS versions within the same OS type (largely fixes for glitches in specific Windows versions)
4) Support for threaded programs
5) Support for 32bit/64bit

It all comes down to a cost/benefit analysis: The 1) produces the highest costs, but giving up this feature is out of the question as it is one of the main points of PB. We are able to get rid of quite some hacks of the 3) group, now that we have set the minimum Windows Version to XP. No 4) and 5) have a surprisingly low impact on the library maintenance. The story is different in the compiler, but for the libraries once these features were solid, there is not much more work to do for them. The Ascii support has quite a high impact on the library code (even multiplied because much of it is in OS specific code too), and at least in our opinion, the benefits of maintaining this feature are decreasing slowly. Hence the thought of removing it.

wilbert · Post by **wilbert** » Thu Aug 07, 2014 4:23 pm

As for 32bit/64bit.
Are the 64 bit libraries already compiled with SSE2 optimizations since all x86-64 processors support this ?

luis · Post by **luis** » Thu Aug 07, 2014 4:33 pm

freak wrote: The reason is that the entire OS layer is Unicode (at least on Windows), so in an ascii program, every call to an API function must be converted from ascii->unicode and back for the result.

True, and that's a valid point for the average program with a GUI, user input, a lot of idle time and doing a normal use of strings.
You won't notice a difference.

freak wrote: A unicode program is definitely not slower than an ascii one.

Definitely not true if you use the string library on substantial strings.

Code: Select all

ntimes = 100000

time1 = ElapsedMilliseconds()

For k = 1 To ntimes
 a$ = a$ + "*"
 b$ = Mid(a$, Len(a$) / 2, 1)
Next

time2 = ElapsedMilliseconds() - time1
MessageRequester("Run 1", "Time = " + Str(time2))

20 seconds in unicode, 12 in ascii (unsurprisingly).

User_Russian · Post by **User_Russian** » Thu Aug 07, 2014 4:36 pm

freak wrote:About the speed:

Unicode strings in PB, much slower than the ASCII. And this problem is not solved! http://www.purebasic.fr/english/viewtop ... =3&t=58892
For example, run this code in ASCII and Unicode, and compare the execution time.

Code: Select all

DisableDebugger

Str.s
#Text = "1234567890"

Time = ElapsedMilliseconds()

For i=1 To 10000
  Str + #Text
Next i

MessageRequester("", StrF((ElapsedMilliseconds()-Time)/1000, 3))

Lebostein · Post by **Lebostein** » Thu Aug 07, 2014 4:52 pm

User_Russian wrote:
freak wrote:About the speed:
Unicode strings in PB, much slower than the ASCII.

Here on Mac:
Ascii: Time = 4934
Unicode: Time = 11244
Factor 2.3
This is not acceptable!

PS: Shell compiler option for unicode? For Windows there is the /UNICODE flag, but with Linux and Mac I found nothing in doc..

Shield · Post by **Shield** » Thu Aug 07, 2014 6:42 pm

It's not acceptable because you simply don't code that way. It's a terrible example because any sane person would use a string builder or other mechanisms for this.

Of course Unicode is slower due to the overhead, but it's neglectable if you're doing normal string operations. For high performance processing you need other algorithms anyway.

Danilo · Post by **Danilo** » Thu Aug 07, 2014 6:49 pm

Really funny guys. Coming too late to the party and complaining now... LOL

PureBasic V4.00 (May 2006)

- Added: Native unicode support

Fred simply can't add OOP support to PB because it adds overhead and may be slower in some circumstances, and User_Russian would only complain.

Many people here use UNICODE for years without problems, so stop to exaggerate.

luis · Post by **luis** » Thu Aug 07, 2014 6:53 pm

Shiled wrote: It's a terrible example because any sane person would use a string builder or other mechanisms for this.

The point is there is a great speed difference compiling the same code in the two different modes, so telling "A unicode program is definitely not slower than an ascii one" is false. I'm interested only in that statement.

You can use different algorithms but as long the data you move is twice as large and you use the PB string library that difference will be always present.

langinagel · Post by **langinagel** » Thu Aug 07, 2014 7:08 pm

Other question to those who tell us that they definetely need ASCII support to write programs for older/weaker PC:

Would it be acceptable to stay with the 32bit versions with the ASCII version?
Specificly:
Would it be feasible to stay with the Windows/Linux x68-versions with ASCII supprt??

I doubt that the general Apple support needs ASCII so much, as well as 64bit versions of Windows and Linux.

Nevertheless I would like to keep the chance to build Programs for simple (maybe also embedded) x86-platforms.

skywalk · Post by **skywalk** » Thu Aug 07, 2014 7:28 pm

I agree it is pointless to complain of native string concatenation speeds either in Ascii or Unicode. This is not a valid argument against dropping Ascii compile.
For me the problem is the extra hoops I will need to do while debugging/handling ascii data from network and instruments while in a Unicode only IDE/debugger. I will be forced to convert a lot of this to memory transfers and that is not improving my efficiency. This decision means extra work for the users or continued maintenance work for the developers.
I ask again if we could have simultaneous Ascii$ and Unicode$$ datatypes?
Then force all api handling to always use Unicode$$.
But at least allow the users to define and use plain old Ascii$.

Danilo · Post by **Danilo** » Thu Aug 07, 2014 9:08 pm

@Fred / freak:
Would it be possible for you to change the build- and library-system? Beside what has been said already,
I still think you make your life harder than it needs to be.

Code: Select all

Function()                  (Ascii)
Function_THREAD()           (Ascii + Threaded)
Function_UNICODE            (Unicode)
Function_UNICODE_THREAD     (Unicode + Threaded)
Function_Debug()            (Runtime Debug)

When creating PB-Libs, we usually create 5 files for this 5 functions. It is an optimization thing, so every function
goes into an extra .obj within a .lib, and unnecessary functions are not linked into the final executable.
(I'm not sure all newest linkers address this issue using the flag for link-time-optimizations)

When modifying the build-system, you could only have 1 file, 'function.c'. Then, recompile it
using different flags (ascii/unicode, threaded ascii/threaded unicode).

I don't know about what build-times we are talking here. It could be 16 hours for a complete build,
and without ASCII support it could be 10 or 12 hours only. If it is in that range, it is an significant improvement.
I could be totally wrong and it could be 32 hours vs. 20 hours for a complete build for 3 platforms and ascii/unicode/threaded/32bit/64bit/MMX/SSE/SSE2/SSE3/SSE4/etc...

I see the bigger problem in maintenance, when you have 5 different functions in 5 files, instead 1 function and 1 file.

Using some smart C macros, in combination with compiler flags and some more #define, I think it should be
possible to re-compile 1 source file to many different targets.
I mean, in C/C++ I can also re-compile the same file for ASCII and UNICODE and SSE and SSE4 by using just different compiler flags
and putting all text strings into macros like _T("Hello") or TEXT("World"). ( "Hello" vs. L"Hello", you know )

Function names like MyFunction / MyFunction_Unicode / MyFunction_MMX / MyFunction_Threaded_Unicode_SSE4
can easily be generated automatically using macros and compiler flags/defines.

Of course this wouldn't affect the build-times in a positive way, if you still compile one file 5 times, just with different compiler flags.

But it would affect the maintenance time significantly, because you only manage 1 file with 1 function and just re-compile it for:

Code: Select all

- Ascii
- Ascii MMX
- Ascii SSE
- Ascii SSE2
- Ascii SSE3
- Ascii SSE4
- Threaded Ascii
- Threaded Ascii MMX
- Threaded Ascii SSE
- Threaded Ascii SSE2
- Threaded Ascii SSE3
- Threaded Ascii SSE4
- Unicode
- Unicode MMX
- Unicode SSE
- Unicode SSE2
- Unicode SSE3
- Unicode SSE4
- Threaded Unicode
- Threaded Unicode MMX
- Threaded Unicode SSE
- Threaded Unicode SSE2
- Threaded Unicode SSE3
- Threaded Unicode SSE4
- Debug

I think the PureBasic library system, and how you write PB library functions, could also be improved to make your life, and the maintenance part of PB, easier.

The big question is, what disturbs you more? Is it the build-times, or the maintenance of the many different function versions and files?

If the bottleneck is the build-times, you can't do much (assuming only changed files are re-compiled and tools like ccache are used,
in addition to compiling 16 sources simultaneously on a build-machine with 8 cores and HyperThreading, enough RAM, SSD disks, etc.)

If the main problem is maintenance, I think you could improve some things within the PB library system, without dropping support for ASCII mode.
Living without ASCII mode is not a big problem in my opinion (after getting used to it), but of course it would be generally better to continue to support both modes,
if easily possible and maintainable.

Post by **Fred** » Thu Aug 07, 2014 9:19 pm

We already have only one file per function, with macros

. The main issue is to test all the variants and handling ascii quirks here and here due to locale specificity. Example on the ValF() file:

Code: Select all

/* === Copyright Notice ===
 *
 *
 *                  PureBasic source code file
 *
 *
 * This file is part of the PureBasic Software package. It may not
 * be distributed or published in source code or binary form without
 * the expressed permission by Fantaisie Software.
 *
 * By contributing modifications or additions to this file, you grant
 * Fantaisie Software the rights to use, modify and distribute your
 * work in the PureBasic package.
 *
 *
 * Copyright (C) 2000-2010 Fantaisie Software - all rights reserved
 *
 */

#include "String.h"

#ifdef WINDOWS
   // see http://blogs.msdn.com/oldnewthing/archive/2010/03/05/9973225.aspx
   #define INFINITY ((float)(1e308 * 10))
   #define NAN      ((float)((1e308 * 10)*0.))
   #include <locale.h>
#else
  #include <math.h>
#endif

M_PBFUNCTION(double) PB_ValD(const TCHAR *String)
{
  if (String == NULL)
    return 0;
    
  if (stricmp(String, TEXT("+Infinity")) == 0)
  {
    return (double)INFINITY;
  }
  else if (stricmp(String, TEXT("-Infinity")) == 0)
  {
    return -(double)(INFINITY);
  }
  else if (stricmp(String, TEXT("NaN")) == 0)
  {
    return (double)NAN;
  }
  
  #ifdef WINDOWS
    setlocale(LC_NUMERIC, "English"); // on Windows, the local can be changed, so update it (3x slower now) ! http://www.purebasic.fr/english/viewtopic.php?f=4&t=47944&start=30
  #endif

  #ifdef UNICODE
    #ifdef WINDOWS
    {
      double Result = 0;

      sscanf(String, TEXT("%lf"), &Result);
      return Result;
    }
    #else
      char Buffer[1024];
      int Length;

      Length = SYS_WideCharToMultiByte(CP_ACP, 0, String, -1, Buffer, 1023, 0, 0);
      Buffer[Length] = 0;

      #if defined(LINUX) && !defined(PB_MACOS)
        PB_String_ReplaceDecimalCharacter(Buffer);
      #endif

      return atof(Buffer); // returns a double on linux
    #endif

  #else

    #if defined(LINUX) && !defined(PB_MACOS)
      double Result = 0;
      char *ConvertedString;
      
      if (ConvertedString = strdup(String))
      {
        PB_String_ReplaceDecimalCharacter(ConvertedString);
        Result = atof(ConvertedString);
        free(ConvertedString);
      }

      return Result;
    #else
      return atof(String);
    #endif
  #endif
}

A full build takes about one hour per version, so building the six versions can be long (we have an incremental build system to reduce this time). Our goal is to have less code which means less possibility to introduce bugs.

Danilo · Post by **Danilo** » Thu Aug 07, 2014 9:28 pm

I see. Can't help then...

PureBasic Forums - English

Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic