Page 1 of 1

Get Text From HTML

Posted: Fri Jan 24, 2020 11:22 am
by collectordave
Just playing around trying to get the text from an HTML file and came up with this:

Code: Select all

Define MyString.s,ReturnString.s
Define WordCount.i,iLoop.i
Define ignore.i
Define BodyFound.i

ReadFile(0,"My Test.html") ;Your HTML File

While Not Eof(0)
  
   ;Ignore everything Until Body
    While BodyFound = #False
         If FindString(ReadString(0),"<body>",0,#PB_String_NoCase )

        BodyFound = #True
        Break
      EndIf   
    Wend

  MyString =  ReadString(0)
  
  Ignore = #False

For iLoop = 1 To Len(MyString)
  
  If Mid(Mystring,iLoop,1) = "<"
    Ignore = #True
  EndIf

  If ignore = #False
    returnstring = returnstring + Mid(Mystring,iLoop,1)
  EndIf
  
   If Mid(Mystring,iLoop,1) = ">"
    Ignore = #False
  EndIf
  
Next

Debug ReturnString

ReturnString = ""

Wend

CloseFile(0)

End
It reads the file until <body> is found then removes anything enclosed by <> to return the text line by line.

Any improvements welcome.

CD

Re: Get Text From HTML

Posted: Fri Jan 24, 2020 4:35 pm
by Derren
You should parse tags like <p></p> and <br> (or <br/>)

This html code will look like the following quote.

Code: Select all

<p>Hello World</p><p>This is going to be a new line:<br/>And here it is</p>
Hello World

This is going to be a new line:
And here it is
While your code will return:
Hello WorldThis is going to be a new line:And here it is
One easy wayto do this is to do a multi pass and just replace any #CR$ #LF$ and #CRLF$ with "" and then replace any <br> <br/> and </p> with #CRLF$
The problem is, technically <br /> is a valid line break. It's probably rare, but it could happen in hand-written code.

Re: Get Text From HTML

Posted: Sat Jan 25, 2020 4:15 am
by collectordave
Hi

Not intended to parse HTML just get the text in as simple a manner as possible.

I am in a few spare moments looking at writing a parser based around regular expressions. long way off.

CD

Re: Get Text From HTML

Posted: Sat Jan 25, 2020 11:31 am
by Little John
Separators such as spaces, line breaks, start and end of a paragraph are part of a text! Derren's example shows, that not taking them into account will give undesired results. A longer text extracted this way might be hard to read.

Another problem is e.g., that HTML allows attribute values to contain the '>' character. So the following is valid HTML 5 code:

Code: Select all

<!DOCTYPE html>
<html lang="en">

<head>
  <meta charset="utf-8">
  <title>Demo</title>
</head>

<body>
  <p title="3 > 2, that' true.">
    This is a paragraph.
  </p>
</body>
</html>
Reading this HTML code, your program returns
2, that' true.">
This is a paragraph.
collectordave wrote:Not intended to parse HTML just get the text in as simple a manner as possible.
For seriously extracting the text of any valid HTML code (in a way so that it's readable), some basic parsing of that HTML code is required.