Page 1 of 1
ReplaceRegularExpression
Posted: Fri Nov 07, 2008 12:02 pm
by lionel_om
Hi all,
I got an issue : I'll like to use advanced regex replacements.
Normally when we create a Regex, we can create groups/catches (with parenthesis). Then in Replace functions we can replace a part of the string by a group catch.
In PHP we use "\\x" with x = index of the group. But it doesn't work in PB.
Here is my code :
Code: Select all
Procedure.s Ereg_Replace(Text$, Pattern$, Replace$ = "", Options.l = #PB_RegularExpression_DotAll | #PB_RegularExpression_Extended | #PB_RegularExpression_AnyNewLine)
hRegex = CreateRegularExpression(#PB_Any, Pattern$, Options)
If hRegex
Text$ = ReplaceRegularExpression(hRegex, Text$, Replace$)
FreeRegularExpression(hRegex)
Else
Debug "Can't create a Regex with this pattern : " + Pattern$
EndIf
ProcedureReturn Text$
EndProcedure
; HTML code : removes tags properties (id, class, name, onXXX, ...)
Text$ = "<a onclick='test'></a>"
Text$ = Ereg_Replace(Text$, "<([a-zA-Z]+)\ *[^>]+>", "<\\1>")
Debug Text$
It doesn't work. It replace any tag containing properties by "<\\1>".
Please share tips/fixes if you have some on it.
Regards
/Lionel
Posted: Fri Nov 07, 2008 5:02 pm
by lionel_om
Solve thanks to AND51 and this post :
http://www.purebasic.fr/english/viewtop ... c&start=15:
Replacements strings should be used tis way : "/\1".
/Lio
Posted: Thu Dec 18, 2008 3:25 pm
by lionel_om
Hi all,
This seems to doesn't work anymore.
Here an example:
Code: Select all
Options.l = #PB_RegularExpression_DotAll | #PB_RegularExpression_Extended | #PB_RegularExpression_AnyNewLine
Pattern$ = "<([a-zA-Z]+)\ *[^>]+>"
Text$ = " <p class=hello> <a href=test><script> text inside </script></a> after"
Replace$ = "</\1>"
hRegex = CreateRegularExpression(#PB_Any, Pattern$, Options)
If hRegex
Debug ReplaceRegularExpression(hRegex, Text$, Replace$)
FreeRegularExpression(hRegex)
Else
Debug "Can't create a Regex with this pattern : " + Pattern$
EndIf
It's a simple Regex to remove the properties of every HTML tag.
Do someone have any idea on how could it be fixed ?
Thanks
/Lio
Posted: Thu Dec 18, 2008 4:40 pm
by AND51
It's a simple Regex to remove the properties of every HTML tag.
Do I understand you right?
You just want to eliminate the attributes?
<a href="...">

<a>
<img src="...">

<src>
<p class="hello">

<p>
Posted: Thu Dec 18, 2008 4:43 pm
by lionel_om
AND51 wrote:Do I understand you right?
You just want to eliminate the attributes?
You're right !
Posted: Thu Dec 18, 2008 4:47 pm
by AND51
I'm working on it...
Posted: Thu Dec 18, 2008 5:02 pm
by AND51
This could be a possible solution, but I didn't know that
look behind assertions must have a fixed length.
Code: Select all
Procedure.s RemoveHtmlAttributes(html$)
Protected exp=CreateRegularExpression(#PB_Any, "(?Us)(?<=<\w+).*(?=>)")
If Not exp
Debug RegularExpressionError()
End
EndIf
html$=ReplaceRegularExpression(exp, html$, "")
FreeRegularExpression(exp)
ProcedureReturn html$
EndProcedure
Define test.s="<a href=http://www.and51.de>Click this <hr size=6>image to get there <img src='images/logo.png' border=0></a>"
Debug RemoveHtmlAttributes(test)
Posted: Thu Dec 18, 2008 5:09 pm
by AND51
This version gives my
look behind a fixed length, because in a
For-loop I am counting from 1 to 15 (15 should be enough).
The look behind makes sure that A, P, CENTER, BODY, etc. will remain, but its attributes will be eliminated. The longest HTML-tag, that quickly came to my mind is BLOCKQUOTE (10 letters). Is there any tag that is longer? To catch all tags, my
For counts up to 15.
Code: Select all
Procedure.s RemoveHtmlAttributes(html$)
Protected exp, n
For n=1 To 15 ; <BLOCKQUOTE> has 10 letters, but to be sure we take 15
exp=CreateRegularExpression(#PB_Any, "(?Us)(?<=<\w{"+Str(n)+"})\s.*(?=>)")
html$=ReplaceRegularExpression(exp, html$, "")
FreeRegularExpression(exp)
Next
ProcedureReturn html$
EndProcedure
Define test.s="<a href=http://www.and51.de>Click this <hr "+#CRLF$+"size=6>image To get there <img src='images/logo.png' border=0></a>"
Debug RemoveHtmlAttributes(test)
Posted: Thu Dec 18, 2008 5:20 pm
by AND51
Sorry for spamming.
You can also use a
Repeat loop, to
replace automatically until there is nothing more that you can replace.
Although, this solution copies the string (and thus needs 2x memory), it might be the fastest, because it only counts as long as neccessary.
For example, if there is no tag longer than 6 letters (e. g. CENTER), then my loop only counts up to 7.
Code: Select all
Procedure.s RemoveHtmlAttributes(html$)
Protected exp, n=1, old.s
Repeat
old=html$
exp=CreateRegularExpression(#PB_Any, "(?Us)(?<=<\w{"+Str(n)+"})\s.*(?=>)")
html$=ReplaceRegularExpression(exp, html$, "")
FreeRegularExpression(exp)
n+1
Until html$ = old ; until there's nothing left to replace
ProcedureReturn html$
EndProcedure
Define test.s="<a href=http://www.and51.de>Click this <hr "+#CRLF$+"size=6>image To get there <img src='images/logo.png' border=0></a>"
Debug RemoveHtmlAttributes(test)
Posted: Thu Dec 18, 2008 5:45 pm
by lionel_om
Thanks for your help AND51.
I know what is a look head. I prefer the way I've tried as it is faster and consume less resources. It was working before but It's not working anymore. It's strange.
I've read the documentation of the Regex plugin, but I can't find the pattern to retrieve the caught group.
I hope someone can come up with this solution.
/Lio
Posted: Thu Dec 18, 2008 5:53 pm
by lionel_om
AND51 wrote:Sorry for spamming.
Your help is welcome !!!
/Lio