Strip away all HTML - Regular Expression

Share your advanced PureBasic knowledge/code with the community.
naw
Enthusiast
Enthusiast
Posts: 573
Joined: Fri Apr 25, 2003 4:57 pm

Strip away all HTML - Regular Expression

Post by naw »

Hi, I posted this as an answer to another thread, but its a useful trick and shows how powerful Regular Expressions are.
Basically, this short routine strips away all the HTML leaving only the content.

RESULT1$ Only strips off the HTML (so there may be space problems)
RESULT2$ Replaces the HTML with a single space, then trims away any double spaces

Code: Select all

CreateRegularExpression(0, "\<[^\<]+\>")

STRING$="hELLO<P ALIGN=left HEIGHT=22>Hello World</p>     wORLD <Br>"   ;  Yes I know this is bullsh*t HTML
RESULT1$=ReplaceRegularExpression(0, STRING$,"")
RESULT2$=trim(ReplaceString(ReplaceString(ReplaceRegularExpression(0, STRING$," "),"  "," "),"  "," "))
Debug RESULT1$
Debug RESULT2$
End
Ta - N
dige
Addict
Addict
Posts: 1410
Joined: Wed Apr 30, 2003 8:15 am
Location: Germany
Contact:

Re: Strip away all HTML - Regular Expression

Post by dige »

naw wrote:Hi, I posted this as an answer to another thread, but its a useful trick and shows how powerful Regular Expressions are.
Indeed! Thanky for sharing.
It's a pity that my lonely braincell is overstrained with reading or creating regular expressions..
UserOfPure
Enthusiast
Enthusiast
Posts: 469
Joined: Sun Mar 16, 2008 9:18 am

Re: Strip away all HTML - Regular Expression

Post by UserOfPure »

naw wrote:this short routine strips away all the HTML leaving only the content
Totally fails with http://www.google.com and most sites' HTML pages, sorry to say. Even this thread's HTML.
naw
Enthusiast
Enthusiast
Posts: 573
Joined: Fri Apr 25, 2003 4:57 pm

Re: Strip away all HTML - Regular Expression

Post by naw »

UserOfPure wrote:
naw wrote:this short routine strips away all the HTML leaving only the content
Totally fails with http://www.google.com and most sites' HTML pages, sorry to say. Even this thread's HTML.
Well it essentially just strips away all "<" & ">" signs and anything thats inside them, so yes, there are limitations eg:

Code: Select all

<HEAD>
<TITLE>Demo of stuff that wont work</TITLE>
<STYLE>
  H1 { color: blue; line-heght: 10px }
</STYLE>
</HEAD>
<H1>Hello World</H1>
<SCRIPT>
 window.open("x","y")
</SCRIPT>
would likely show:

Code: Select all

Demo of stuff that wont work H1 { color: blue; line-height: 10px } Hello World window.open("x","y")
which obviously you don't want, but a couple more Regular Expressions could soon be written to remove any embedded style sheets or Javascript.

It was meant as a bit of a taster to Regular Expressions and how powerful they are
Ta - N
Post Reply