PureBasic Forums - English

Posted: **Thu Jun 18, 2009 8:28 am**

Hello.

I recently needed a way to tokenize some of my sources (for a precompiler), and came up with this:

RegexLines = CreateRegularExpression ( #PB_Any , ".*\r\n" )
RegexTokens = CreateRegularExpression ( #PB_Any , #DOUBLEQUOTE$ + "[^" + #DOUBLEQUOTE$ + "]*" + #DOUBLEQUOTE$ + "|[\*]?[a-zA-Z_]+[\w]*[\x24]?|#[a-zA-Z_]+[\w]*[\x24]?|[\[\]\(\)\{\}]|[-+]?[0-9]*\.?[0-9]+|;.*|\.|\+|-|[&@!\\\/\*,\|]|::|:|\|<>|>>|<<|=>{1}|>={1}|<={1}|=<{1}|={1}|<{1}|>{1}|\x24+[0-9a-fA-F]+|\%[0-1]*|%|'" )

If ReadFile ( 0 , #PB_Compiler_Home + "\Examples\Sources\GadgetAdvanced.pb" )

	Length = Lof ( 0 )
	*Memory = AllocateMemory ( Length )
		
	If Not *Memory : End : EndIf
	
	ReadData ( 0 , *Memory , Length )
	CloseFile ( 0 )
					
	String$ = PeekS ( *Memory )
	
Else

	End
	
EndIf

Dim Lines$ ( 0 )
Dim Tokens$ ( 0 )

LineCount = ExtractRegularExpression ( RegexLines , String$ , Lines$ ( ) )

For LineCounter = 0 To LineCount - 1

	TokenCount = ExtractRegularExpression ( RegexTokens , Lines$ ( LineCounter ) , Tokens$ ( ) )

	Debug ""
	Debug "Line " + Str ( LineCounter + 1 )

	For TokenCounter = 0 To TokenCount - 1
		
		Debug "Token " + Str ( TokenCounter + 1 ) + ":   " + Tokens$ ( TokenCounter )
		
	Next
		
Next

I think it works, but I'm definitely no regular expression wizard. The regex that finds the tokens can probably be improved a thousandfold, so if you have any suggestions for improvements, please post them!

Posted: **Fri Jun 19, 2009 6:04 am**

Nice one.
The Regular Expression Library simplifies things a lot.

fsw

Posted: **Fri Jun 19, 2009 10:54 am**

fsw wrote:Nice one.
The Regular Expression Library simplifies things a lot.

fsw

hehe, are you kidding? That particular RegExp looks like my dog's breakfast after it has been thrown up all over the kitchen floor!

No offence eesau, looks like some very slick code there; I just hate reg exps!

Not sure that using a reg exp for anything but a very simple tokeniser would be suitable; too slow probably. I wait to be corrected though on that score as my knowledge of reg exps is about on a par with my knowledge of nuclear physics - a bit sketchy at best!

Posted: **Fri Jun 19, 2009 12:55 pm**

srod wrote:That particular RegExp looks like my dog's breakfast after it has been thrown up all over the kitchen floor!

Understatement of the year

No offence eesau, looks like some very slick code there; I just hate reg exps! Not sure that using a reg exp for anything but a very simple tokeniser would be suitable; too slow probably.

None taken! And yes, it's probably slower with large sources than traditional tokenizers, but fast enough for me with small ones.

Posted: **Fri Jun 19, 2009 1:05 pm**

eesau wrote:
srod wrote:That particular RegExp looks like my dog's breakfast after it has been thrown up all over the kitchen floor!
Understatement of the year

You haven't seen how much my dog eats!

I understand what you are saying about small sources - in those cases yes, very concise code. Must admit that I use reg exps to check individual tokens; e.g. checking whether a token represents a valid variable name etc. Anything bigger though and it's roll the sleeves up, allocate a memory buffer and grab ahold of a few pointers etc. No where near as concise as using a reg exp!

Posted: **Fri Jun 19, 2009 2:10 pm**

what we need is regular expression translator/compiler, where you write the expression in human readable form then feed it to the translator to output a regex.

Posted: **Sat Jun 20, 2009 7:00 pm**

jack wrote:where you write the expression in human readable form then feed it to the translator to output a regex.

http://www.regexbuddy.com/

If someone knows something better than this please let me (us) know...

Posted: **Wed Sep 16, 2009 8:41 pm**

luis wrote:
jack wrote:where you write the expression in human readable form then feed it to the translator to output a regex.
http://www.regexbuddy.com/

If someone knows something better than this please let me (us) know...

Possibly their new product http://regexMagic.com ?

cheers

Posted: **Wed Sep 16, 2009 9:51 pm**

Oh, thank you. I'll look into it !

Posted: **Wed Sep 16, 2009 11:21 pm**

yes, it looks promising

Posted: **Thu Sep 17, 2009 3:29 am**

the demo looks promising, but then i know very little about regexs'.

On the one hand it seems that this approach could lead to my still being regex illiterate, just let the tool do it for me, but if it generates quality code, as I imagine it will, AND I study the output and learn from it, it will be worth it.

RegexBuddy has an option to generate the code for RealBasic and several other languages, but not PB:(

Don't want to take this topic into "offtopic" land, but any other comments on either tool?

cheers

Posted: **Thu Sep 17, 2009 9:22 am**

Hi!

A tool that I like is Regex Coach. It might not be as powerful as the other tools mentioned before. But it's very useful anyway, and since it's free you just can try it without risk. It's even portable.

Regards, Little John

PureBasic Forums - English

Tokenizer

Tokenizer

Re:

Re: Tokenizer

Re: Tokenizer

Re: Tokenizer

Re: Tokenizer