Page 1 of 1

Tokenizer

Posted: Thu Jun 18, 2009 8:28 am
by eesau
Hello.

I recently needed a way to tokenize some of my sources (for a precompiler), and came up with this:

Code: Select all

RegexLines = CreateRegularExpression ( #PB_Any , ".*\r\n" )
RegexTokens = CreateRegularExpression ( #PB_Any , #DOUBLEQUOTE$ + "[^" + #DOUBLEQUOTE$ + "]*" + #DOUBLEQUOTE$ + "|[\*]?[a-zA-Z_]+[\w]*[\x24]?|#[a-zA-Z_]+[\w]*[\x24]?|[\[\]\(\)\{\}]|[-+]?[0-9]*\.?[0-9]+|;.*|\.|\+|-|[&@!\\\/\*,\|]|::|:|\|<>|>>|<<|=>{1}|>={1}|<={1}|=<{1}|={1}|<{1}|>{1}|\x24+[0-9a-fA-F]+|\%[0-1]*|%|'" )

If ReadFile ( 0 , #PB_Compiler_Home + "\Examples\Sources\GadgetAdvanced.pb" )

	Length = Lof ( 0 )
	*Memory = AllocateMemory ( Length )
		
	If Not *Memory : End : EndIf
	
	ReadData ( 0 , *Memory , Length )
	CloseFile ( 0 )
					
	String$ = PeekS ( *Memory )
	
Else

	End
	
EndIf

Dim Lines$ ( 0 )
Dim Tokens$ ( 0 )

LineCount = ExtractRegularExpression ( RegexLines , String$ , Lines$ ( ) )

For LineCounter = 0 To LineCount - 1

	TokenCount = ExtractRegularExpression ( RegexTokens , Lines$ ( LineCounter ) , Tokens$ ( ) )

	Debug ""
	Debug "Line " + Str ( LineCounter + 1 )

	For TokenCounter = 0 To TokenCount - 1
		
		Debug "Token " + Str ( TokenCounter + 1 ) + ":   " + Tokens$ ( TokenCounter )
		
	Next
		
Next
I think it works, but I'm definitely no regular expression wizard. The regex that finds the tokens can probably be improved a thousandfold, so if you have any suggestions for improvements, please post them!

Posted: Fri Jun 19, 2009 6:04 am
by fsw
Nice one.
The Regular Expression Library simplifies things a lot.

fsw

Posted: Fri Jun 19, 2009 10:54 am
by srod
fsw wrote:Nice one.
The Regular Expression Library simplifies things a lot.

fsw
hehe, are you kidding? That particular RegExp looks like my dog's breakfast after it has been thrown up all over the kitchen floor! :)

No offence eesau, looks like some very slick code there; I just hate reg exps! :) Not sure that using a reg exp for anything but a very simple tokeniser would be suitable; too slow probably. I wait to be corrected though on that score as my knowledge of reg exps is about on a par with my knowledge of nuclear physics - a bit sketchy at best! :wink:

Posted: Fri Jun 19, 2009 12:55 pm
by eesau
srod wrote:That particular RegExp looks like my dog's breakfast after it has been thrown up all over the kitchen floor! :)
Understatement of the year :)
No offence eesau, looks like some very slick code there; I just hate reg exps! :) Not sure that using a reg exp for anything but a very simple tokeniser would be suitable; too slow probably.
None taken! And yes, it's probably slower with large sources than traditional tokenizers, but fast enough for me with small ones.

Posted: Fri Jun 19, 2009 1:05 pm
by srod
eesau wrote:
srod wrote:That particular RegExp looks like my dog's breakfast after it has been thrown up all over the kitchen floor! :)
Understatement of the year :)
You haven't seen how much my dog eats! :wink:

I understand what you are saying about small sources - in those cases yes, very concise code. Must admit that I use reg exps to check individual tokens; e.g. checking whether a token represents a valid variable name etc. Anything bigger though and it's roll the sleeves up, allocate a memory buffer and grab ahold of a few pointers etc. No where near as concise as using a reg exp! :)

Posted: Fri Jun 19, 2009 2:10 pm
by jack
what we need is regular expression translator/compiler, where you write the expression in human readable form then feed it to the translator to output a regex.

Posted: Sat Jun 20, 2009 7:00 pm
by luis
jack wrote:where you write the expression in human readable form then feed it to the translator to output a regex.
http://www.regexbuddy.com/

If someone knows something better than this please let me (us) know...

Re:

Posted: Wed Sep 16, 2009 8:41 pm
by rsts
luis wrote:
jack wrote:where you write the expression in human readable form then feed it to the translator to output a regex.
http://www.regexbuddy.com/

If someone knows something better than this please let me (us) know...
Possibly their new product http://regexMagic.com ?

cheers

Re: Tokenizer

Posted: Wed Sep 16, 2009 9:51 pm
by luis
Oh, thank you. I'll look into it !

Re: Tokenizer

Posted: Wed Sep 16, 2009 11:21 pm
by jack
yes, it looks promising :)

Re: Tokenizer

Posted: Thu Sep 17, 2009 3:29 am
by rsts
the demo looks promising, but then i know very little about regexs'.

On the one hand it seems that this approach could lead to my still being regex illiterate, just let the tool do it for me, but if it generates quality code, as I imagine it will, AND I study the output and learn from it, it will be worth it.

RegexBuddy has an option to generate the code for RealBasic and several other languages, but not PB:(

Don't want to take this topic into "offtopic" land, but any other comments on either tool?

cheers

Re: Tokenizer

Posted: Thu Sep 17, 2009 9:22 am
by Little John
Hi!

A tool that I like is Regex Coach. It might not be as powerful as the other tools mentioned before. But it's very useful anyway, and since it's free you just can try it without risk. It's even portable.

Regards, Little John