Tokeniser!

Share your advanced PureBasic knowledge/code with the community.
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Tokeniser!

Post by srod »

Hi,

a small utility which I have used to rip apart, amongst other things, FASM assembly source code etc. and break each line down into a sequence of 'symbols' which can then be analysed in whatever ways you see fit.

Whilst not a formal tokeniser as such, this could well form the basis of one by the addition of extra logic to incorporate some kind of token 'dictionary' etc. I have not needed to go that far however.

The small example given (taken from an application I am currently working on) shows part of how I am analysing PB source code in which I am only interested in the 'Global' keyword. My 'separator' string is constructed with just the Global keyword in mind.

Example :

Code: Select all

;///////////////////////////////////////////////////////////////////////////////////////////
;Example of using the 'tokeniser'.
;
;This very very contrived example shows how you might break down a line of PB source code
;built around the 'Global' keyword.
;///////////////////////////////////////////////////////////////////////////////////////////

  IncludePath #PB_Compiler_Home+"\Includes"
    XIncludeFile "Tokeniser.pbi"


;My test string which could have been ripped from a PB source file.
  test$ = "Global a, b, Dim c(100), NewList x() : Global a, b"

;The separator string which basically describes how to split lines of text.
;NOTE, if your text lines can contain quoted strings then you would add chr(34) to this
;separator string. Not also how I have included a colon and a semi-colon in this string.
;These act as delimeters.
  #MY_SEPARATORS = " ,;():"

;Now parse the line of text.
  If PARSING_TokeniseCommand(test$, #MY_SEPARATORS)
    ;Let us have a look at the symbols.
    For i = 0 To gParse\numberOfTokens - 1
      Debug gParse\tokens$[i]
    Next  
  EndIf

;Voila - we are doneth! The symbol array is ready for analysis etc.

Tokeniser.pbi :

Code: Select all

;///////////////////////////////////////////////////////////////////////////////////////////
;'Tokeniser'.
;
;February 2008.

;Developed with Purebasic 4.2 beta 2.
;Fully cross-platform.
;///////////////////////////////////////////////////////////////////////////////////////////

;///////////////////////////////////////////////////////////////////////////////////////////
;-NOTES.
; ======
; This small utility will parse a string of text and separate into individual 'symbols' based
; upon a given 'separator' string (which act as delimeters).
; The symbols are placed into the global variable 'gParse' as well as a count of the number of
; symbols etc.
;
; Separator strings can be constructed to parse individual constructs (as in the attached
; example) or to parse entire languages (I used essentially this routine to parse FASM assembly
; code files without a problem).
;
; Of course this does not a 'formal' tokeniser make as the resulting symbols haven't been tokenised.
; This would be a relatively simple addition, however, but would require a 'dictionary' of tokens.
;///////////////////////////////////////////////////////////////////////////////////////////


;-CONSTANTS and STRUCTURES.
  #Tokeniser_MAXNUMSYMBOLSINALINE = 200
  
  Structure _tokeniserParseGlobals
    numberOfTokens.l
    tokens$[#Tokeniser_MAXNUMSYMBOLSINALINE]
  EndStructure

;-GLOBALS.
    Global gParse._tokeniserParseGlobals
  

;///////////////////////////////////////////////////////////////////////////////////////////
;The following function tokenises the given command.
;Returns zero if the line cannot be parsed.
Procedure.l PARSING_TokeniseCommand(line$, separator$)
  Define left, right, length, char$, i
  Define result = #True
  gParse\numberoftokens=0
  length=Len(line$)
  If length
    left=1 : right=1
    Repeat 
      char$=Mid(line$,right,1)
      If FindString(separator$, char$,1)
        If left<right
          gParse\tokens$[gParse\numberoftokens]=Mid(line$,left,right-left)
          gParse\numberoftokens+1
          left=right
        ElseIf char$=Chr(34) ;Open quote. left=right
          right = FindString(line$, char$,left+1)
          If right = 0 ;No end quote.
            result = 0
            Break
          EndIf
          gParse\tokens$[gParse\numberoftokens]=Mid(line$,left,right-left+1)
          gParse\numberoftokens+1
          right+1          
          left = right
        ElseIf char$<>" " ;left=right
          gParse\tokens$[gParse\numberoftokens]=Mid(line$,left,1)
          gParse\numberoftokens+1
          left+1 : right+1
        Else          
          left+1 : right+1
        EndIf
      ElseIf right=length
        right+1
          gParse\tokens$[gParse\numberoftokens]=Mid(line$,left,right-left)
          gParse\numberoftokens+1
      Else          
        right+1
      EndIf
    Until right>length Or gParse\numberOfTokens = #Tokeniser_MAXNUMSYMBOLSINALINE
  EndIf
  ProcedureReturn result
EndProcedure
;///////////////////////////////////////////////////////////////////////////////////////////
Take care with your separator strings. In particular if your text can contain quoted strings then add a chr(34) to your separator string etc. NOTE also how, in the example above, I have included a colon and a semi-colon in this string.

Also, before anyone asks :wink: , I haven't used regular expressions because I am not as yet convinced that they are anything but slow and cumbersome! :)
Last edited by srod on Tue May 22, 2012 11:47 am, edited 1 time in total.
I may look like a mule, but I'm not a complete ass.
jack
Addict
Addict
Posts: 1358
Joined: Fri Apr 25, 2003 11:10 pm

Post by jack »

regular expressions are like APL powerful but almost imposible to understand.
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

What's APL?
I may look like a mule, but I'm not a complete ass.
jack
Addict
Addict
Posts: 1358
Joined: Fri Apr 25, 2003 11:10 pm

Post by jack »

Dare
Addict
Addict
Posts: 1965
Joined: Mon May 29, 2006 1:01 am
Location: Outback

Post by Dare »

Trim, taut, terrific!

Thanks for sharing.
Dare2 cut down to size
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

<shudder>

APL reminds me of sysadmin perl scripters code who pride themselves on writing a program that looks like 6 lines of ascii garbage.
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

The following expression sorts a word list stored in matrix X according to word length:

Code: Select all

X[↑X+.≠' ';]
What the *&&*&*@: hell is that? :shock: I think I'll stick with Purebasic! :wink:

@Dare : thanks. :)
I may look like a mule, but I'm not a complete ass.
SFSxOI
Addict
Addict
Posts: 2970
Joined: Sat Dec 31, 2005 5:24 pm
Location: Where ya would never look.....

Post by SFSxOI »

Thank you srod.
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

You're welcome. :)
I may look like a mule, but I'm not a complete ass.
User avatar
blueznl
PureBasic Expert
PureBasic Expert
Posts: 6166
Joined: Sat May 17, 2003 11:31 am
Contact:

Post by blueznl »

I immediately opened this topic because I thought it was called 'womanizer' but it turned out to be some sort of male bonding...

:twisted:
( PB6.00 LTS Win11 x64 Asrock AB350 Pro4 Ryzen 5 3600 32GB GTX1060 6GB)
( The path to enlightenment and the PureBasic Survival Guide right here... )
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

blueznl wrote:I immediately opened this topic because I thought it was called 'womanizer' but it turned out to be some sort of male bonding and I prefer male bondage.

:twisted:
:twisted:

(uh ho, I think that srod fella may have gone too far this time! :) )
I may look like a mule, but I'm not a complete ass.
Dr_Wildrick
User
User
Posts: 36
Joined: Fri Feb 23, 2007 8:00 pm
Location: New York

Thats code????

Post by Dr_Wildrick »

srod wrote:
The following expression sorts a word list stored in matrix X according to word length:

Code: Select all

X[↑X+.≠' ';]
In my country if you presented this to me as "code" you would get a 10 day lockdown and a thorizine drip.
:lol:
Dare
Addict
Addict
Posts: 1965
Joined: Mon May 29, 2006 1:01 am
Location: Outback

Re: Thats code????

Post by Dare »

Dr_Wildrick wrote:
srod wrote:The following expression sorts a word list stored in matrix X according to word length:

Code: Select all

X[↑X+.≠' ';]
In my country if you presented this to me as "code" you would get a 10 day lockdown and a thorizine drip.
:lol:
:lol:
Dare2 cut down to size
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Re: Thats code????

Post by srod »

Dr_Wildrick wrote:
srod wrote:
The following expression sorts a word list stored in matrix X according to word length:

Code: Select all

X[↑X+.≠' ';]
In my country if you presented this to me as "code" you would get a 10 day lockdown and a thorizine drip.
:lol:
:lol:

Aye, around this neighbourhood you'd probably be clubbed to death with a huge salami.

Who are those guys kidding with that kind of notation and syntax?
I may look like a mule, but I'm not a complete ass.
Little John
Addict
Addict
Posts: 4777
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Tokeniser!

Post by Little John »

Hi srod,

thanks for sharing!

I modified the code a little, so that no global variable is needed, but the tokens are returned in a linked list. And in case of a missing end quote, this function also returns the corrupt token.

Regards, Little John

Code: Select all

Procedure.i Tokenise (line$, separator$, List token$())
   ; in : line$     : Text to be tokenised
   ;      separator$: List of characters that act as delimiters;
   ;                  If 'line$' can contain quoted strings, then
   ;                  'separator$' should contain #DQUOTE$.
   ; out: token$()    : List of tokens
   ;      return value: #False if 'line$' cannot be parsed, and
   ;                    #True otherwise
   Protected char$, left, right, length, result=#True

   left  = 1
   right = 1
   length = Len(line$)
   ClearList(token$())
   
   While right <= length   
      char$ = Mid(line$, right, 1)
      If FindString(separator$, char$, 1)
         If left < right
            AddElement(token$())
            token$() = Mid(line$, left, right-left)
            left = right
         ElseIf char$ = #DQUOTE$     ; Open quote. left=right
            right = FindString(line$, char$, left+1)
            If right = 0             ; No end quote.
               right = length
               result = #False
            EndIf
            AddElement(token$())
            token$() = Mid(line$, left, right-left+1)
            right + 1
            left = right
         ElseIf char$ <> " "         ; left=right
            AddElement(token$())
            token$() = Mid(line$, left, 1)
            left  + 1
            right + 1
         Else         
            left  + 1
            right + 1
         EndIf
      ElseIf right = length
         right + 1
         AddElement(token$())
         token$() = Mid(line$, left, right-left)
      Else         
         right + 1
      EndIf
   Wend

   ProcedureReturn result
EndProcedure


#MY_SEPARATORS = " ,;():" + #DQUOTE$

NewList token$()

test$ = "Global a, b, Dim c(100), NewList x() : Global a, b"

; Parse the line of text:
If Tokenise(test$, #MY_SEPARATORS, token$()) = 0
   Debug "Missing end quote in last element."
EndIf

; Let us have a look at the symbols:
ForEach token$()
   Debug token$()
Next
Post Reply