Tokenizer
Posted: Thu Jun 18, 2009 8:28 am
Hello.
I recently needed a way to tokenize some of my sources (for a precompiler), and came up with this:
I think it works, but I'm definitely no regular expression wizard. The regex that finds the tokens can probably be improved a thousandfold, so if you have any suggestions for improvements, please post them!
I recently needed a way to tokenize some of my sources (for a precompiler), and came up with this:
Code: Select all
RegexLines = CreateRegularExpression ( #PB_Any , ".*\r\n" )
RegexTokens = CreateRegularExpression ( #PB_Any , #DOUBLEQUOTE$ + "[^" + #DOUBLEQUOTE$ + "]*" + #DOUBLEQUOTE$ + "|[\*]?[a-zA-Z_]+[\w]*[\x24]?|#[a-zA-Z_]+[\w]*[\x24]?|[\[\]\(\)\{\}]|[-+]?[0-9]*\.?[0-9]+|;.*|\.|\+|-|[&@!\\\/\*,\|]|::|:|\|<>|>>|<<|=>{1}|>={1}|<={1}|=<{1}|={1}|<{1}|>{1}|\x24+[0-9a-fA-F]+|\%[0-1]*|%|'" )
If ReadFile ( 0 , #PB_Compiler_Home + "\Examples\Sources\GadgetAdvanced.pb" )
Length = Lof ( 0 )
*Memory = AllocateMemory ( Length )
If Not *Memory : End : EndIf
ReadData ( 0 , *Memory , Length )
CloseFile ( 0 )
String$ = PeekS ( *Memory )
Else
End
EndIf
Dim Lines$ ( 0 )
Dim Tokens$ ( 0 )
LineCount = ExtractRegularExpression ( RegexLines , String$ , Lines$ ( ) )
For LineCounter = 0 To LineCount - 1
TokenCount = ExtractRegularExpression ( RegexTokens , Lines$ ( LineCounter ) , Tokens$ ( ) )
Debug ""
Debug "Line " + Str ( LineCounter + 1 )
For TokenCounter = 0 To TokenCount - 1
Debug "Token " + Str ( TokenCounter + 1 ) + ": " + Tokens$ ( TokenCounter )
Next
Next