PureBasic Forums - English

Posted: **Fri Feb 15, 2008 4:14 pm**

Hi,

a small utility which I have used to rip apart, amongst other things, FASM assembly source code etc. and break each line down into a sequence of 'symbols' which can then be analysed in whatever ways you see fit.

Whilst not a formal tokeniser as such, this could well form the basis of one by the addition of extra logic to incorporate some kind of token 'dictionary' etc. I have not needed to go that far however.

The small example given (taken from an application I am currently working on) shows part of how I am analysing PB source code in which I am only interested in the 'Global' keyword. My 'separator' string is constructed with just the Global keyword in mind.

Example :

Code: Select all

;///////////////////////////////////////////////////////////////////////////////////////////
;Example of using the 'tokeniser'.
;
;This very very contrived example shows how you might break down a line of PB source code
;built around the 'Global' keyword.
;///////////////////////////////////////////////////////////////////////////////////////////

  IncludePath #PB_Compiler_Home+"\Includes"
    XIncludeFile "Tokeniser.pbi"


;My test string which could have been ripped from a PB source file.
  test$ = "Global a, b, Dim c(100), NewList x() : Global a, b"

;The separator string which basically describes how to split lines of text.
;NOTE, if your text lines can contain quoted strings then you would add chr(34) to this
;separator string. Not also how I have included a colon and a semi-colon in this string.
;These act as delimeters.
  #MY_SEPARATORS = " ,;():"

;Now parse the line of text.
  If PARSING_TokeniseCommand(test$, #MY_SEPARATORS)
    ;Let us have a look at the symbols.
    For i = 0 To gParse\numberOfTokens - 1
      Debug gParse\tokens$[i]
    Next  
  EndIf

;Voila - we are doneth! The symbol array is ready for analysis etc.

Tokeniser.pbi :

Code: Select all

;///////////////////////////////////////////////////////////////////////////////////////////
;'Tokeniser'.
;
;February 2008.

;Developed with Purebasic 4.2 beta 2.
;Fully cross-platform.
;///////////////////////////////////////////////////////////////////////////////////////////

;///////////////////////////////////////////////////////////////////////////////////////////
;-NOTES.
; ======
; This small utility will parse a string of text and separate into individual 'symbols' based
; upon a given 'separator' string (which act as delimeters).
; The symbols are placed into the global variable 'gParse' as well as a count of the number of
; symbols etc.
;
; Separator strings can be constructed to parse individual constructs (as in the attached
; example) or to parse entire languages (I used essentially this routine to parse FASM assembly
; code files without a problem).
;
; Of course this does not a 'formal' tokeniser make as the resulting symbols haven't been tokenised.
; This would be a relatively simple addition, however, but would require a 'dictionary' of tokens.
;///////////////////////////////////////////////////////////////////////////////////////////


;-CONSTANTS and STRUCTURES.
  #Tokeniser_MAXNUMSYMBOLSINALINE = 200
  
  Structure _tokeniserParseGlobals
    numberOfTokens.l
    tokens$[#Tokeniser_MAXNUMSYMBOLSINALINE]
  EndStructure

;-GLOBALS.
    Global gParse._tokeniserParseGlobals
  

;///////////////////////////////////////////////////////////////////////////////////////////
;The following function tokenises the given command.
;Returns zero if the line cannot be parsed.
Procedure.l PARSING_TokeniseCommand(line$, separator$)
  Define left, right, length, char$, i
  Define result = #True
  gParse\numberoftokens=0
  length=Len(line$)
  If length
    left=1 : right=1
    Repeat 
      char$=Mid(line$,right,1)
      If FindString(separator$, char$,1)
        If left<right
          gParse\tokens$[gParse\numberoftokens]=Mid(line$,left,right-left)
          gParse\numberoftokens+1
          left=right
        ElseIf char$=Chr(34) ;Open quote. left=right
          right = FindString(line$, char$,left+1)
          If right = 0 ;No end quote.
            result = 0
            Break
          EndIf
          gParse\tokens$[gParse\numberoftokens]=Mid(line$,left,right-left+1)
          gParse\numberoftokens+1
          right+1          
          left = right
        ElseIf char$<>" " ;left=right
          gParse\tokens$[gParse\numberoftokens]=Mid(line$,left,1)
          gParse\numberoftokens+1
          left+1 : right+1
        Else          
          left+1 : right+1
        EndIf
      ElseIf right=length
        right+1
          gParse\tokens$[gParse\numberoftokens]=Mid(line$,left,right-left)
          gParse\numberoftokens+1
      Else          
        right+1
      EndIf
    Until right>length Or gParse\numberOfTokens = #Tokeniser_MAXNUMSYMBOLSINALINE
  EndIf
  ProcedureReturn result
EndProcedure
;///////////////////////////////////////////////////////////////////////////////////////////

Take care with your separator strings. In particular if your text can contain quoted strings then add a chr(34) to your separator string etc. NOTE also how, in the example above, I have included a colon and a semi-colon in this string.

Also, before anyone asks

, I haven't used regular expressions because I am not as yet convinced that they are anything but slow and cumbersome!

Posted: **Fri Feb 15, 2008 4:20 pm**

regular expressions are like APL powerful but almost imposible to understand.

Posted: **Fri Feb 15, 2008 4:24 pm**

What's APL?

Posted: **Fri Feb 15, 2008 4:30 pm**

sorry I thought everyone knew http://en.wikipedia.org/wiki/APL_%28pro ... anguage%29

Posted: **Fri Feb 15, 2008 4:55 pm**

Trim, taut, terrific!

Thanks for sharing.

Posted: **Fri Feb 15, 2008 5:33 pm**

<shudder>

APL reminds me of sysadmin perl scripters code who pride themselves on writing a program that looks like 6 lines of ascii garbage.

Posted: **Fri Feb 15, 2008 8:46 pm**

The following expression sorts a word list stored in matrix X according to word length:
Code: Select all
X[↑X+.≠' ';]

What the *&&*&*@: hell is that?

I think I'll stick with Purebasic!

@Dare : thanks.

Posted: **Fri Feb 15, 2008 10:41 pm**

Thank you srod.

Posted: **Fri Feb 15, 2008 11:28 pm**

You're welcome.

Posted: **Sat Feb 16, 2008 10:53 pm**

I immediately opened this topic because I thought it was called 'womanizer' but it turned out to be some sort of male bonding...

Posted: **Sun Feb 17, 2008 12:30 am**

blueznl wrote:I immediately opened this topic because I thought it was called 'womanizer' but it turned out to be some sort of male bonding and I prefer male bondage.

(uh ho, I think that srod fella may have gone too far this time!

)

Posted: **Thu Feb 21, 2008 4:30 am**

srod wrote:
The following expression sorts a word list stored in matrix X according to word length:
Code: Select all
X[↑X+.≠' ';]

In my country if you presented this to me as "code" you would get a 10 day lockdown and a thorizine drip.

Posted: **Thu Feb 21, 2008 10:13 am**

Dr_Wildrick wrote:
srod wrote:The following expression sorts a word list stored in matrix X according to word length:
Code: Select all
X[↑X+.≠' ';]
In my country if you presented this to me as "code" you would get a 10 day lockdown and a thorizine drip.

Posted: **Thu Feb 21, 2008 10:49 am**

Dr_Wildrick wrote:
srod wrote:
The following expression sorts a word list stored in matrix X according to word length:
Code: Select all
X[↑X+.≠' ';]
In my country if you presented this to me as "code" you would get a 10 day lockdown and a thorizine drip.

Aye, around this neighbourhood you'd probably be clubbed to death with a huge salami.

Who are those guys kidding with that kind of notation and syntax?

Posted: **Mon Aug 30, 2010 9:55 am**

Hi srod,

thanks for sharing!

I modified the code a little, so that no global variable is needed, but the tokens are returned in a linked list. And in case of a missing end quote, this function also returns the corrupt token.

Regards, Little John

Code: Select all

Procedure.i Tokenise (line$, separator$, List token$())
   ; in : line$     : Text to be tokenised
   ;      separator$: List of characters that act as delimiters;
   ;                  If 'line$' can contain quoted strings, then
   ;                  'separator$' should contain #DQUOTE$.
   ; out: token$()    : List of tokens
   ;      return value: #False if 'line$' cannot be parsed, and
   ;                    #True otherwise
   Protected char$, left, right, length, result=#True

   left  = 1
   right = 1
   length = Len(line$)
   ClearList(token$())
   
   While right <= length   
      char$ = Mid(line$, right, 1)
      If FindString(separator$, char$, 1)
         If left < right
            AddElement(token$())
            token$() = Mid(line$, left, right-left)
            left = right
         ElseIf char$ = #DQUOTE$     ; Open quote. left=right
            right = FindString(line$, char$, left+1)
            If right = 0             ; No end quote.
               right = length
               result = #False
            EndIf
            AddElement(token$())
            token$() = Mid(line$, left, right-left+1)
            right + 1
            left = right
         ElseIf char$ <> " "         ; left=right
            AddElement(token$())
            token$() = Mid(line$, left, 1)
            left  + 1
            right + 1
         Else         
            left  + 1
            right + 1
         EndIf
      ElseIf right = length
         right + 1
         AddElement(token$())
         token$() = Mid(line$, left, right-left)
      Else         
         right + 1
      EndIf
   Wend

   ProcedureReturn result
EndProcedure


#MY_SEPARATORS = " ,;():" + #DQUOTE$

NewList token$()

test$ = "Global a, b, Dim c(100), NewList x() : Global a, b"

; Parse the line of text:
If Tokenise(test$, #MY_SEPARATORS, token$()) = 0
   Debug "Missing end quote in last element."
EndIf

; Let us have a look at the symbols:
ForEach token$()
   Debug token$()
Next

PureBasic Forums - English

Tokeniser!

Tokeniser!

Thats code????

Re: Thats code????

Re: Thats code????

Re: Tokeniser!