Tokenizer

Share your advanced PureBasic knowledge/code with the community.
eesau
Enthusiast
Enthusiast
Posts: 589
Joined: Fri Apr 27, 2007 12:38 pm
Location: Finland

Tokenizer

Post by eesau »

Hello.

I recently needed a way to tokenize some of my sources (for a precompiler), and came up with this:

Code: Select all

RegexLines = CreateRegularExpression ( #PB_Any , ".*\r\n" )
RegexTokens = CreateRegularExpression ( #PB_Any , #DOUBLEQUOTE$ + "[^" + #DOUBLEQUOTE$ + "]*" + #DOUBLEQUOTE$ + "|[\*]?[a-zA-Z_]+[\w]*[\x24]?|#[a-zA-Z_]+[\w]*[\x24]?|[\[\]\(\)\{\}]|[-+]?[0-9]*\.?[0-9]+|;.*|\.|\+|-|[&@!\\\/\*,\|]|::|:|\|<>|>>|<<|=>{1}|>={1}|<={1}|=<{1}|={1}|<{1}|>{1}|\x24+[0-9a-fA-F]+|\%[0-1]*|%|'" )

If ReadFile ( 0 , #PB_Compiler_Home + "\Examples\Sources\GadgetAdvanced.pb" )

	Length = Lof ( 0 )
	*Memory = AllocateMemory ( Length )
		
	If Not *Memory : End : EndIf
	
	ReadData ( 0 , *Memory , Length )
	CloseFile ( 0 )
					
	String$ = PeekS ( *Memory )
	
Else

	End
	
EndIf

Dim Lines$ ( 0 )
Dim Tokens$ ( 0 )

LineCount = ExtractRegularExpression ( RegexLines , String$ , Lines$ ( ) )

For LineCounter = 0 To LineCount - 1

	TokenCount = ExtractRegularExpression ( RegexTokens , Lines$ ( LineCounter ) , Tokens$ ( ) )

	Debug ""
	Debug "Line " + Str ( LineCounter + 1 )

	For TokenCounter = 0 To TokenCount - 1
		
		Debug "Token " + Str ( TokenCounter + 1 ) + ":   " + Tokens$ ( TokenCounter )
		
	Next
		
Next
I think it works, but I'm definitely no regular expression wizard. The regex that finds the tokens can probably be improved a thousandfold, so if you have any suggestions for improvements, please post them!
User avatar
fsw
Addict
Addict
Posts: 1603
Joined: Tue Apr 29, 2003 9:18 pm
Location: North by Northwest

Post by fsw »

Nice one.
The Regular Expression Library simplifies things a lot.

fsw
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

fsw wrote:Nice one.
The Regular Expression Library simplifies things a lot.

fsw
hehe, are you kidding? That particular RegExp looks like my dog's breakfast after it has been thrown up all over the kitchen floor! :)

No offence eesau, looks like some very slick code there; I just hate reg exps! :) Not sure that using a reg exp for anything but a very simple tokeniser would be suitable; too slow probably. I wait to be corrected though on that score as my knowledge of reg exps is about on a par with my knowledge of nuclear physics - a bit sketchy at best! :wink:
I may look like a mule, but I'm not a complete ass.
eesau
Enthusiast
Enthusiast
Posts: 589
Joined: Fri Apr 27, 2007 12:38 pm
Location: Finland

Post by eesau »

srod wrote:That particular RegExp looks like my dog's breakfast after it has been thrown up all over the kitchen floor! :)
Understatement of the year :)
No offence eesau, looks like some very slick code there; I just hate reg exps! :) Not sure that using a reg exp for anything but a very simple tokeniser would be suitable; too slow probably.
None taken! And yes, it's probably slower with large sources than traditional tokenizers, but fast enough for me with small ones.
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

eesau wrote:
srod wrote:That particular RegExp looks like my dog's breakfast after it has been thrown up all over the kitchen floor! :)
Understatement of the year :)
You haven't seen how much my dog eats! :wink:

I understand what you are saying about small sources - in those cases yes, very concise code. Must admit that I use reg exps to check individual tokens; e.g. checking whether a token represents a valid variable name etc. Anything bigger though and it's roll the sleeves up, allocate a memory buffer and grab ahold of a few pointers etc. No where near as concise as using a reg exp! :)
I may look like a mule, but I'm not a complete ass.
jack
Addict
Addict
Posts: 1358
Joined: Fri Apr 25, 2003 11:10 pm

Post by jack »

what we need is regular expression translator/compiler, where you write the expression in human readable form then feed it to the translator to output a regex.
User avatar
luis
Addict
Addict
Posts: 3895
Joined: Wed Aug 31, 2005 11:09 pm
Location: Italy

Post by luis »

jack wrote:where you write the expression in human readable form then feed it to the translator to output a regex.
http://www.regexbuddy.com/

If someone knows something better than this please let me (us) know...
rsts
Addict
Addict
Posts: 2736
Joined: Wed Aug 24, 2005 8:39 am
Location: Southwest OH - USA

Re:

Post by rsts »

luis wrote:
jack wrote:where you write the expression in human readable form then feed it to the translator to output a regex.
http://www.regexbuddy.com/

If someone knows something better than this please let me (us) know...
Possibly their new product http://regexMagic.com ?

cheers
User avatar
luis
Addict
Addict
Posts: 3895
Joined: Wed Aug 31, 2005 11:09 pm
Location: Italy

Re: Tokenizer

Post by luis »

Oh, thank you. I'll look into it !
"Have you tried turning it off and on again ?"
jack
Addict
Addict
Posts: 1358
Joined: Fri Apr 25, 2003 11:10 pm

Re: Tokenizer

Post by jack »

yes, it looks promising :)
rsts
Addict
Addict
Posts: 2736
Joined: Wed Aug 24, 2005 8:39 am
Location: Southwest OH - USA

Re: Tokenizer

Post by rsts »

the demo looks promising, but then i know very little about regexs'.

On the one hand it seems that this approach could lead to my still being regex illiterate, just let the tool do it for me, but if it generates quality code, as I imagine it will, AND I study the output and learn from it, it will be worth it.

RegexBuddy has an option to generate the code for RealBasic and several other languages, but not PB:(

Don't want to take this topic into "offtopic" land, but any other comments on either tool?

cheers
Little John
Addict
Addict
Posts: 4791
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Tokenizer

Post by Little John »

Hi!

A tool that I like is Regex Coach. It might not be as powerful as the other tools mentioned before. But it's very useful anyway, and since it's free you just can try it without risk. It's even portable.

Regards, Little John
Post Reply