Page 1 of 1

Strip away all HTML - Regular Expression

Posted: Mon Sep 14, 2009 4:37 pm
by naw
Hi, I posted this as an answer to another thread, but its a useful trick and shows how powerful Regular Expressions are.
Basically, this short routine strips away all the HTML leaving only the content.

RESULT1$ Only strips off the HTML (so there may be space problems)
RESULT2$ Replaces the HTML with a single space, then trims away any double spaces

Code: Select all

CreateRegularExpression(0, "\<[^\<]+\>")

STRING$="hELLO<P ALIGN=left HEIGHT=22>Hello World</p>     wORLD <Br>"   ;  Yes I know this is bullsh*t HTML
RESULT1$=ReplaceRegularExpression(0, STRING$,"")
RESULT2$=trim(ReplaceString(ReplaceString(ReplaceRegularExpression(0, STRING$," "),"  "," "),"  "," "))
Debug RESULT1$
Debug RESULT2$
End

Re: Strip away all HTML - Regular Expression

Posted: Mon Sep 14, 2009 4:43 pm
by dige
naw wrote:Hi, I posted this as an answer to another thread, but its a useful trick and shows how powerful Regular Expressions are.
Indeed! Thanky for sharing.
It's a pity that my lonely braincell is overstrained with reading or creating regular expressions..

Re: Strip away all HTML - Regular Expression

Posted: Mon Sep 14, 2009 9:33 pm
by UserOfPure
naw wrote:this short routine strips away all the HTML leaving only the content
Totally fails with http://www.google.com and most sites' HTML pages, sorry to say. Even this thread's HTML.

Re: Strip away all HTML - Regular Expression

Posted: Tue Sep 15, 2009 3:07 pm
by naw
UserOfPure wrote:
naw wrote:this short routine strips away all the HTML leaving only the content
Totally fails with http://www.google.com and most sites' HTML pages, sorry to say. Even this thread's HTML.
Well it essentially just strips away all "<" & ">" signs and anything thats inside them, so yes, there are limitations eg:

Code: Select all

<HEAD>
<TITLE>Demo of stuff that wont work</TITLE>
<STYLE>
  H1 { color: blue; line-heght: 10px }
</STYLE>
</HEAD>
<H1>Hello World</H1>
<SCRIPT>
 window.open("x","y")
</SCRIPT>
would likely show:

Code: Select all

Demo of stuff that wont work H1 { color: blue; line-height: 10px } Hello World window.open("x","y")
which obviously you don't want, but a couple more Regular Expressions could soon be written to remove any embedded style sheets or Javascript.

It was meant as a bit of a taster to Regular Expressions and how powerful they are