Removing 'ASCII' switch from PureBasic

Post by **flaith** » Sat Aug 09, 2014 6:08 am

This is the way I had to change an old program with UNICODE enabled:
The program read char by char to create a token:

In UNICODE, cannot work so I added this:

Code: Select all

Global.i FORMAT_BYTE = StringByteLength("a", #PB_Ascii)
CompilerIf #PB_Compiler_Unicode
  FORMAT_BYTE = StringByteLength("a", #PB_Unicode)
CompilerEndIf

The GetToken() function:

Code: Select all

;-Tokenizer
Procedure.s GetCurrentChar()
  Protected car.s
  
  ;car = Chr(PeekC(@LINE+CurrentPos)) ;ORIGINAL
  car = Chr(PeekA(@LINE+CurrentPos)) ;VERSION 1
  ;car = PeekS(@LINE+CurrentPos,1,#PB_Ascii) ;VERSION 2
  ProcedureReturn car
EndProcedure

Procedure.i SkipSpace()
  Protected nbspace.i = 0
  
  While GetCurrentChar() = " " Or GetCurrentChar() = #TAB$
    CurrentPos + FORMAT_BYTE
    nbspace + 1
  Wend
  
  ProcedureReturn nbspace  
EndProcedure

Procedure.s GetToken()
  Protected sTok.s = "", c.s
  Repeat
    c = GetCurrentChar()
    ; Car = '"' and Not inside a string ?     //String definition
    If c = #DBL_QUOTE And _QUOTE = #False
      sTok+c
      CurrentPos + FORMAT_BYTE
      _QUOTE = #True
      c = GetCurrentChar()
    EndIf
    ; Car = '"' and inside a string ?         //String definition
    If c = #DBL_QUOTE And _QUOTE = #True
      _QUOTE = #False
    EndIf
    ; Car = TAB and Not inside a string ?     //Tabulation
    If c = #CHAR_TAB And _QUOTE = #False
      CurrentPos + FORMAT_BYTE
      Break
    EndIf
    ; Car = ';' or Car = '*' in the beginning 
    ; of the line and Not inside a string ?   //Remark
    If c = ";" And _QUOTE = #False
      CurrentPos = LenLine
      Break
    EndIf
    If c = "*" And _QUOTE = #False And CurrentPos = 0
      CurrentPos = LenLine : sTok = ""
    EndIf
    ; Current car position >= current Line length ?
    If CurrentPos >= LenLine
      Break
    EndIf
    ; if it's a space outside a quoted string
    If c = " " And _QUOTE = #False
      CurrentPos + FORMAT_BYTE
      Break
    EndIf
    ; Make the Token
    CurrentPos + FORMAT_BYTE
    sTok + c
  ForEver  
  ProcedureReturn sTok
EndProcedure

And the Init section:

Code: Select all

Global.s LINE = "   BAS2H	EQU $2B" ; TAB inside
CurrentPos = 0:PosTok = 1

;*** IMPORTANT TO MULTIPLY HERE ***
LenLine = Len(LINE)*FORMAT_BYTE           ;For ASCII/UNICODE

Debug "Line to tokenize: "+LINE+" - (len:"+Str(LenLine)+")"

While CurrentPos < LenLine
  nbspace = SkipSpace()
  If nbspace > 0                          ;going to the next token
    PosTok + 1
  EndIf
  a$=GetToken()
  If a$ <> ""
    Debug a$+" ["+Str(PosTok)+"]"
    PosTok + 1
  EndIf
  SkipSpace()
Wend

You can see that each time I need to go to the next char, I add to add 1(ASCII) or 2(UNICODE), and multiply by 1 or 2 for the length of the line.

I hope you can find a way to handle that more easily than my messy way

wilbert · Post by **wilbert** » Sat Aug 09, 2014 6:24 am

flaith wrote:I hope you can find a way to handle that more easily than my messy way

I hope my answer doesn't pollute this thread too much but maybe an approach like this is less messy.
It works both in ASCII and Unicode mode.

Code: Select all

#CharSize = SizeOf(Character)

Structure CharStructure
  StructureUnion
    c.c
    s.s{1}
  EndStructureUnion
EndStructure

S.s = "String"

*CharPtr.CharStructure = @S

While *CharPtr\c
  
  Debug *CharPtr\s + " value : " + *CharPtr\c
  
  *CharPtr + #CharSize
  
Wend

or (a bit slower compared to the code above)

Code: Select all

Structure CharStructure
  StructureUnion
    c.c
    s.s{1}
  EndStructureUnion
EndStructure

Structure CharArray
  p.CharStructure[0]
EndStructure

S.s = "String"

*CharArray.CharArray = @S

CurrentPos = 0

While *CharArray\p[CurrentPos]\c
  
  Debug *CharArray\p[CurrentPos]\s + " value : " + *CharArray\p[CurrentPos]\c
  
  CurrentPos + 1
  
Wend

chris319 · Post by **chris319** » Sat Aug 09, 2014 7:59 am

What are your thoughts about it ? Is it a deal breaker for you ?

It seems you've already made the decision so why are you soliciting input from the user base after the fact?

It's going to muck up something I'm working on which requires me to pass ASCII strings to and from an API.

ASCII is an old tech and is condamned to disappear sooner or later, as unicode can handle it as well.

So you're legislating obsolescence. All legacy technology is going to be dropped from PureBasic like a hot potato and tough luck to anybody who still uses it, is that the idea?

Is it feasible to ask that mystring$ or mystring.s be a legacy ASCII string and mystring.x* a unicode string, or is the die cast and it's too late to ask for this?

I'm glad to hear it will make things easier for the PB team, to say nothing of the code rewriting the user base will have to do when their code has been broken.

Yes, I'll be looking around for something which isn't built on ever-shifting sands.

*The letter "x" is an arbitrary choice and could be any unused character or symbol deemed appropriate.

wilbert · Post by **wilbert** » Sat Aug 09, 2014 8:27 am

chris319 wrote:I'm glad to hear it will make things easier for the PB team, to say nothing of the code rewriting the user base will have to do when their code has been broken.
Yes, I'll be looking around for something which isn't built on ever-shifting sands.

PureBasic always has evolved this way; adding things and removing things.
You always know a next version isn't guaranteed to be fully backwards compatible. Of course most of it is but there's always the chance you have to change existing code to make it work in a new version or keep using an older compiler next to a newer version.
I'm not saying this is good or bad but to me it is understandable to do it this way if you have so little people working on it and want to keep things manageable.

Post by **flaith** » Sat Aug 09, 2014 8:43 am

Thanks Wilbert

davido · Post by **davido** » Sat Aug 09, 2014 9:23 am

I wonder; will this mean the demise of the variable-type .a ?

Shield · Post by **Shield** » Sat Aug 09, 2014 10:06 am

davido wrote:I wonder; will this mean the demise of the variable-type .a ?

No.

NikitaOdnorob98 · Post by **NikitaOdnorob98** » Sat Aug 09, 2014 10:20 am

Fred, please make a poll. It will be better.

P.S. I think what it's bad idea

Danilo · Post by **Danilo** » Sat Aug 09, 2014 10:29 am

NikitaOdnorob98 wrote:P.S. I think what it's bad idea

Don't you think it could make your life easier, when getting used to it?

You work with english/latin and cyrillic alphabet every day (cyrillic in strings, comments and user interfaces; english/latin for PB keywords).
That's what Unicode is about, supporting Cyrillic, Chinese, Latin, Thai, Korean, ... character sets... all at the same time.
You can have your apps in English, Russian, Chinese, ... just by loading/using a different catalog/database with the strings. No more codepage conversions.

Especially for the Russian guys here I had expected they would welcome the change. Now I see it's the opposite, and it makes me wonder.

The following problem, mentioned by User_Russian, exist only when an application is NOT fully Unicode:

User_Russian wrote:
Code: Select all
DataSection
  IncludeBinary "C:\Программы\Prog.exe"
EndDataSection
Description of error.
[COMPILER] Line 2: Included file Not found: C:\?????????\Prog.exe.
The same error with other Include-commands (IncludeFile, XIncludeFile and IncludePath).

luis · Post by **luis** » Sat Aug 09, 2014 10:49 am

Danilo wrote: Especially for the Russian guys here I had expected they would welcome the change. Now I see it's the opposite, and it makes me wonder.

Welcome the change ?

They can use unicode already. The proposal is not to add unicode for 5.40, is to remove the ability to make ascii builds.

Danilo · Post by **Danilo** » Sat Aug 09, 2014 11:08 am

luis wrote:
Danilo wrote:Welcome the change ?

Let me rephrase: Especially for them, Ascii is quite useless, in my opinion. As you can see with the PB compiler,
Ascii applications only create extra trouble (see example "C:\Программы\"). With full Unicode support, you don't
have this problems, and that's what makes Ascii applications pretty obsolete. It is a shame many 3rd party DLLs/libs
are still compiled in Ascii mode, especially in our globalized world.

useful · Post by **useful** » Sat Aug 09, 2014 11:14 am

For those who appreciate in PB cross-platform obviously. In Linux GUI alternatives to standard utf de facto not. To imagine that someone is system software to write on pb difficult. But those who writes something under Linux is unfortunately a little, and the reason is proprietarily pb. So for developers of Cyrillic is not so obvious. Personally I favour of dropping support for ASCII, if the team so it will be easier.

However, I want sequence. I.e. full Unicode support by the compiler in part the names of variables, procedures, and others.

Little John · Post by **Little John** » Sat Aug 09, 2014 11:18 am

Shield wrote:
davido wrote:I wonder; will this mean the demise of the variable-type .a ?
No.

To give some explanation:

A variable of type .a can hold one whole number in the range from 0 to +255. So the correct name of this variable type is "unsigned byte".
This has nothing got to do with strings in the first place. This variable type was just misnamed in the PB documentation by calling it "ASCII".
Same case with the so called "Unicode" data type, the correct name of which is "unsigned word".

davido · Post by **davido** » Sat Aug 09, 2014 11:24 am

@Shield,
Thank you.

@Little John,
Thank you for the detailed explanation.
I didn't realise that the variable type was misnamed.

That is why I asked the question.

Little John · Post by **Little John** » Sat Aug 09, 2014 11:34 am

Danilo wrote:[...] in our globalized world.

Danilo, I absolutely agree with what you wrote.
However, many people still do not think global, and this is no surprise for me anymore.
We can see this "phenomenon" even on this forum, which is supposed to be international.

PureBasic Forums - English

Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic

Re: Removing 'ASCII' switch from PureBasic