Tokeniser!
Posted: Fri Feb 15, 2008 4:14 pm
Hi,
a small utility which I have used to rip apart, amongst other things, FASM assembly source code etc. and break each line down into a sequence of 'symbols' which can then be analysed in whatever ways you see fit.
Whilst not a formal tokeniser as such, this could well form the basis of one by the addition of extra logic to incorporate some kind of token 'dictionary' etc. I have not needed to go that far however.
The small example given (taken from an application I am currently working on) shows part of how I am analysing PB source code in which I am only interested in the 'Global' keyword. My 'separator' string is constructed with just the Global keyword in mind.
Example :
Tokeniser.pbi :
Take care with your separator strings. In particular if your text can contain quoted strings then add a chr(34) to your separator string etc. NOTE also how, in the example above, I have included a colon and a semi-colon in this string.
Also, before anyone asks
, I haven't used regular expressions because I am not as yet convinced that they are anything but slow and cumbersome! 
a small utility which I have used to rip apart, amongst other things, FASM assembly source code etc. and break each line down into a sequence of 'symbols' which can then be analysed in whatever ways you see fit.
Whilst not a formal tokeniser as such, this could well form the basis of one by the addition of extra logic to incorporate some kind of token 'dictionary' etc. I have not needed to go that far however.
The small example given (taken from an application I am currently working on) shows part of how I am analysing PB source code in which I am only interested in the 'Global' keyword. My 'separator' string is constructed with just the Global keyword in mind.
Example :
Code: Select all
;///////////////////////////////////////////////////////////////////////////////////////////
;Example of using the 'tokeniser'.
;
;This very very contrived example shows how you might break down a line of PB source code
;built around the 'Global' keyword.
;///////////////////////////////////////////////////////////////////////////////////////////
IncludePath #PB_Compiler_Home+"\Includes"
XIncludeFile "Tokeniser.pbi"
;My test string which could have been ripped from a PB source file.
test$ = "Global a, b, Dim c(100), NewList x() : Global a, b"
;The separator string which basically describes how to split lines of text.
;NOTE, if your text lines can contain quoted strings then you would add chr(34) to this
;separator string. Not also how I have included a colon and a semi-colon in this string.
;These act as delimeters.
#MY_SEPARATORS = " ,;():"
;Now parse the line of text.
If PARSING_TokeniseCommand(test$, #MY_SEPARATORS)
;Let us have a look at the symbols.
For i = 0 To gParse\numberOfTokens - 1
Debug gParse\tokens$[i]
Next
EndIf
;Voila - we are doneth! The symbol array is ready for analysis etc.
Tokeniser.pbi :
Code: Select all
;///////////////////////////////////////////////////////////////////////////////////////////
;'Tokeniser'.
;
;February 2008.
;Developed with Purebasic 4.2 beta 2.
;Fully cross-platform.
;///////////////////////////////////////////////////////////////////////////////////////////
;///////////////////////////////////////////////////////////////////////////////////////////
;-NOTES.
; ======
; This small utility will parse a string of text and separate into individual 'symbols' based
; upon a given 'separator' string (which act as delimeters).
; The symbols are placed into the global variable 'gParse' as well as a count of the number of
; symbols etc.
;
; Separator strings can be constructed to parse individual constructs (as in the attached
; example) or to parse entire languages (I used essentially this routine to parse FASM assembly
; code files without a problem).
;
; Of course this does not a 'formal' tokeniser make as the resulting symbols haven't been tokenised.
; This would be a relatively simple addition, however, but would require a 'dictionary' of tokens.
;///////////////////////////////////////////////////////////////////////////////////////////
;-CONSTANTS and STRUCTURES.
#Tokeniser_MAXNUMSYMBOLSINALINE = 200
Structure _tokeniserParseGlobals
numberOfTokens.l
tokens$[#Tokeniser_MAXNUMSYMBOLSINALINE]
EndStructure
;-GLOBALS.
Global gParse._tokeniserParseGlobals
;///////////////////////////////////////////////////////////////////////////////////////////
;The following function tokenises the given command.
;Returns zero if the line cannot be parsed.
Procedure.l PARSING_TokeniseCommand(line$, separator$)
Define left, right, length, char$, i
Define result = #True
gParse\numberoftokens=0
length=Len(line$)
If length
left=1 : right=1
Repeat
char$=Mid(line$,right,1)
If FindString(separator$, char$,1)
If left<right
gParse\tokens$[gParse\numberoftokens]=Mid(line$,left,right-left)
gParse\numberoftokens+1
left=right
ElseIf char$=Chr(34) ;Open quote. left=right
right = FindString(line$, char$,left+1)
If right = 0 ;No end quote.
result = 0
Break
EndIf
gParse\tokens$[gParse\numberoftokens]=Mid(line$,left,right-left+1)
gParse\numberoftokens+1
right+1
left = right
ElseIf char$<>" " ;left=right
gParse\tokens$[gParse\numberoftokens]=Mid(line$,left,1)
gParse\numberoftokens+1
left+1 : right+1
Else
left+1 : right+1
EndIf
ElseIf right=length
right+1
gParse\tokens$[gParse\numberoftokens]=Mid(line$,left,right-left)
gParse\numberoftokens+1
Else
right+1
EndIf
Until right>length Or gParse\numberOfTokens = #Tokeniser_MAXNUMSYMBOLSINALINE
EndIf
ProcedureReturn result
EndProcedure
;///////////////////////////////////////////////////////////////////////////////////////////
Also, before anyone asks

