Page 1 of 1

Help with RegExp library for PB

Posted: Fri Mar 10, 2006 1:55 am
by neomember
I'm trying to use regular expression to parse HTML file.

I've did some tests with the RegExp Library for PB from 'FloHimself' found at the link below:
http://www.purearea.net/pb/english/userlibs.php

The expression <HTML\b[^>]*> should match the tag <HTML> in:
<HTML><HEAD><TITLE>PureBasic : ...i/TITLE></HEAD><BODY></BODY></HTML>

I want to remove the tag.

This is the code i've tried

Code: Select all

; Simple example on using the RegExp PureBasic library
; FloHimself (FloHimself@web.de) - Oct 14, 2003

;*Reg = RegComp ("(<TITLE>|<title>)(.*)(</TITLE>|</title>)") ; compiles a regular expression 

*Reg = RegComp ("<HTML\b[^>]*>") ; compiles a regular expression 
Html$ = "<HTML><HEAD><TITLE>PureBasic : visual basic compiler, easy & optimized basic programming language, basic, compiler</TITLE></HEAD><BODY></BODY></HTML>"
RegExec(*Reg, Html$)  ; Returns 1 for success (match) and 0 for failure (no match)
Title$ = Space(500)          ; Size destination buffer to store substitution
RegSub(*Reg, "\2", Title$)   ; Copy substitution to destination buffer
Debug Title$
It doesn't work.

I don't understand the second argument in the RegSub() function ("\2").
From the help file it says:
RegSub()

Syntax

RegSub(*Reg, Source$, Dest$)
Description

RegSub(*Reg, Source$, Dest$) copies Source$ to Dest$, making substitutions according to the most recent regexec performed using *Reg. Size the Dest$ buffer large enough to store the substitution, otherwise a runtime error may occur!
Each instance of '&' in Source$ is replaced by the substring. Each instance of '\n', where n is a digit, is replaced by a stored substring. To get a literal '&' or '\n' into Dest$, prefix it with \; to get a literal \ preceding '&' or \n, prefix it with another \.
Anybody got any luck with that library??

Here's a link to test regular expression:
http://www.javaregex.com/test.html

Posted: Fri Mar 10, 2006 3:33 am
by neomember
Ok, i found out that the second argument is for something called "backreference".

I still need help with making any expression working.

From:
http://www.regular-expressions.info/brackets.html

Backreference
Besides grouping part of a regular expression together, round brackets also create a "backreference". A backreference stores the part of the string matched by the part of the regular expression inside the parentheses.

To figure out the number of a particular backreference, scan the regular expression from left to right and count the opening round brackets.
Example:
(<TITLE>|<title>)(.*)(</TITLE>|</title>)

Applied to:
<HTML><HEAD><TITLE>PureBasic : visual basic compiler, easy & optimized basic programming language, basic, compiler</TITLE></HEAD><BODY></BODY></HTML>


"&" = is equivalent to one entire regex match (can be used multiple times)
"\0" = will select the entire regex match as backreference zero
"\1" = will select the match of the first backreference (group)
"\2" = will select the match of the second backreference (group)
"\3" = will select the match of the third backreference (group)
... = and so on...

so...

"&" = <HTML><HEAD><TITLE>PureBasic : visual basic compiler, easy & optimized basic programming language, basic, compiler</TITLE></HEAD><BODY></BODY></HTML>

"&&" = <HTML><HEAD><TITLE>PureBasic : visual basic compiler, easy & optimized basic programming language, basic, compiler</TITLE></HEAD><BODY></BODY></HTML><HTML><HEAD><TITLE>PureBasic : visual basic compiler, easy & optimized basic programming language, basic, compiler</TITLE></HEAD><BODY></BODY></HTML>

"\0" = <HTML><HEAD><TITLE>PureBasic : visual basic compiler, easy & optimized basic programming language, basic, compiler</TITLE></HEAD><BODY></BODY></HTML>

"\1" = <TITLE>

"\2" = PureBasic : visual basic compiler, easy & optimized basic programming language, basic, compiler

"\3" =</TITLE>

Posted: Fri Mar 10, 2006 7:58 pm
by neomember
Well... i guess the library doesn't support all the expressions then.

Posted: Sat Mar 11, 2006 2:57 am
by Armoured
Hi neomember. :)

Code: Select all

; Simple example on using the RegExp PureBasic library
; FloHimself (FloHimself@web.de) - Oct 14, 2003

;*Reg = RegComp ("(<TITLE>|<title>)(.*)(</TITLE>|</title>)") ; compiles a regular expression

*Reg = RegComp ("(<HTML>)(.*)") ; compiles a regular expression
Html$ = "<HTML><HEAD><TITLE>PureBasic : visual basic compiler, easy & optimized basic programming language, basic, compiler</TITLE></HEAD><BODY></BODY></HTML>"
RegExec(*Reg, Html$)  ; Returns 1 for success (match) and 0 for failure (no match)
Title$ = Space(500)          ; Size destination buffer to store substitution
RegSub(*Reg, "\2", Title$)   ; Copy substitution to destination buffer
Debug Title$
and if you want remove the "<HTML>" and "</HTML>":
; Simple example on using the RegExp PureBasic library
; FloHimself (FloHimself@web.de) - Oct 14, 2003

;*Reg = RegComp ("(<TITLE>|<title>)(.*)(</TITLE>|</title>)") ; compiles a regular expression

*Reg = RegComp ("(<HTML>)(.*)(</HTML>)") ; compiles a regular expression
Html$ = "<HTML><HEAD><TITLE>PureBasic : visual basic compiler, easy & optimized basic programming language, basic, compiler</TITLE></HEAD><BODY></BODY></HTML>"
RegExec(*Reg, Html$) ; Returns 1 for success (match) and 0 for failure (no match)
Title$ = Space(500) ; Size destination buffer to store substitution
RegSub(*Reg, "\2", Title$) ; Copy substitution to destination buffer
Debug Title$
bye!