Get Text From HTML

Share your advanced PureBasic knowledge/code with the community.
collectordave
Addict
Addict
Posts: 1310
Joined: Fri Aug 28, 2015 6:10 pm
Location: Portugal

Get Text From HTML

Post by collectordave »

Just playing around trying to get the text from an HTML file and came up with this:

Code: Select all

Define MyString.s,ReturnString.s
Define WordCount.i,iLoop.i
Define ignore.i
Define BodyFound.i

ReadFile(0,"My Test.html") ;Your HTML File

While Not Eof(0)
  
   ;Ignore everything Until Body
    While BodyFound = #False
         If FindString(ReadString(0),"<body>",0,#PB_String_NoCase )

        BodyFound = #True
        Break
      EndIf   
    Wend

  MyString =  ReadString(0)
  
  Ignore = #False

For iLoop = 1 To Len(MyString)
  
  If Mid(Mystring,iLoop,1) = "<"
    Ignore = #True
  EndIf

  If ignore = #False
    returnstring = returnstring + Mid(Mystring,iLoop,1)
  EndIf
  
   If Mid(Mystring,iLoop,1) = ">"
    Ignore = #False
  EndIf
  
Next

Debug ReturnString

ReturnString = ""

Wend

CloseFile(0)

End
It reads the file until <body> is found then removes anything enclosed by <> to return the text line by line.

Any improvements welcome.

CD
Any intelligent fool can make things bigger and more complex. It takes a touch of genius — and a lot of courage to move in the opposite direction.
User avatar
Derren
Enthusiast
Enthusiast
Posts: 316
Joined: Sat Jul 23, 2011 1:13 am
Location: Germany

Re: Get Text From HTML

Post by Derren »

You should parse tags like <p></p> and <br> (or <br/>)

This html code will look like the following quote.

Code: Select all

<p>Hello World</p><p>This is going to be a new line:<br/>And here it is</p>
Hello World

This is going to be a new line:
And here it is
While your code will return:
Hello WorldThis is going to be a new line:And here it is
One easy wayto do this is to do a multi pass and just replace any #CR$ #LF$ and #CRLF$ with "" and then replace any <br> <br/> and </p> with #CRLF$
The problem is, technically <br /> is a valid line break. It's probably rare, but it could happen in hand-written code.
collectordave
Addict
Addict
Posts: 1310
Joined: Fri Aug 28, 2015 6:10 pm
Location: Portugal

Re: Get Text From HTML

Post by collectordave »

Hi

Not intended to parse HTML just get the text in as simple a manner as possible.

I am in a few spare moments looking at writing a parser based around regular expressions. long way off.

CD
Any intelligent fool can make things bigger and more complex. It takes a touch of genius — and a lot of courage to move in the opposite direction.
Little John
Addict
Addict
Posts: 4791
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Get Text From HTML

Post by Little John »

Separators such as spaces, line breaks, start and end of a paragraph are part of a text! Derren's example shows, that not taking them into account will give undesired results. A longer text extracted this way might be hard to read.

Another problem is e.g., that HTML allows attribute values to contain the '>' character. So the following is valid HTML 5 code:

Code: Select all

<!DOCTYPE html>
<html lang="en">

<head>
  <meta charset="utf-8">
  <title>Demo</title>
</head>

<body>
  <p title="3 > 2, that' true.">
    This is a paragraph.
  </p>
</body>
</html>
Reading this HTML code, your program returns
2, that' true.">
This is a paragraph.
collectordave wrote:Not intended to parse HTML just get the text in as simple a manner as possible.
For seriously extracting the text of any valid HTML code (in a way so that it's readable), some basic parsing of that HTML code is required.
Post Reply