It is currently Thu Oct 01, 2020 2:25 pm

All times are UTC + 1 hour




Post new topic Reply to topic  [ 4 posts ] 
Author Message
 Post subject: Get Text From HTML
PostPosted: Fri Jan 24, 2020 11:22 am 
Offline
Addict
Addict

Joined: Fri Aug 28, 2015 6:10 pm
Posts: 1112
Location: Portugal
Just playing around trying to get the text from an HTML file and came up with this:

Code:
Define MyString.s,ReturnString.s
Define WordCount.i,iLoop.i
Define ignore.i
Define BodyFound.i

ReadFile(0,"My Test.html") ;Your HTML File

While Not Eof(0)
 
   ;Ignore everything Until Body
    While BodyFound = #False
         If FindString(ReadString(0),"<body>",0,#PB_String_NoCase )

        BodyFound = #True
        Break
      EndIf   
    Wend

  MyString =  ReadString(0)
 
  Ignore = #False

For iLoop = 1 To Len(MyString)
 
  If Mid(Mystring,iLoop,1) = "<"
    Ignore = #True
  EndIf

  If ignore = #False
    returnstring = returnstring + Mid(Mystring,iLoop,1)
  EndIf
 
   If Mid(Mystring,iLoop,1) = ">"
    Ignore = #False
  EndIf
 
Next

Debug ReturnString

ReturnString = ""

Wend

CloseFile(0)

End


It reads the file until <body> is found then removes anything enclosed by <> to return the text line by line.

Any improvements welcome.

CD

_________________
Any intelligent fool can make things bigger and more complex. It takes a touch of genius — and a lot of courage to move in the opposite direction.


Top
 Profile  
Reply with quote  
 Post subject: Re: Get Text From HTML
PostPosted: Fri Jan 24, 2020 4:35 pm 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Sat Jul 23, 2011 1:13 am
Posts: 304
Location: Germany
You should parse tags like <p></p> and <br> (or <br/>)

This html code will look like the following quote.

Code:
<p>Hello World</p><p>This is going to be a new line:<br/>And here it is</p>

Quote:
Hello World

This is going to be a new line:
And here it is


While your code will return:
Quote:
Hello WorldThis is going to be a new line:And here it is


One easy wayto do this is to do a multi pass and just replace any #CR$ #LF$ and #CRLF$ with "" and then replace any <br> <br/> and </p> with #CRLF$
The problem is, technically <br /> is a valid line break. It's probably rare, but it could happen in hand-written code.


Top
 Profile  
Reply with quote  
 Post subject: Re: Get Text From HTML
PostPosted: Sat Jan 25, 2020 4:15 am 
Offline
Addict
Addict

Joined: Fri Aug 28, 2015 6:10 pm
Posts: 1112
Location: Portugal
Hi

Not intended to parse HTML just get the text in as simple a manner as possible.

I am in a few spare moments looking at writing a parser based around regular expressions. long way off.

CD

_________________
Any intelligent fool can make things bigger and more complex. It takes a touch of genius — and a lot of courage to move in the opposite direction.


Top
 Profile  
Reply with quote  
 Post subject: Re: Get Text From HTML
PostPosted: Sat Jan 25, 2020 11:31 am 
Offline
Addict
Addict

Joined: Thu Jun 07, 2007 3:25 pm
Posts: 3914
Location: Berlin, Germany
Separators such as spaces, line breaks, start and end of a paragraph are part of a text! Derren's example shows, that not taking them into account will give undesired results. A longer text extracted this way might be hard to read.

Another problem is e.g., that HTML allows attribute values to contain the '>' character. So the following is valid HTML 5 code:
Code:
<!DOCTYPE html>
<html lang="en">

<head>
  <meta charset="utf-8">
  <title>Demo</title>
</head>

<body>
  <p title="3 > 2, that' true.">
    This is a paragraph.
  </p>
</body>
</html>

Reading this HTML code, your program returns
Quote:
2, that' true.">
This is a paragraph.


collectordave wrote:
Not intended to parse HTML just get the text in as simple a manner as possible.

For seriously extracting the text of any valid HTML code (in a way so that it's readable), some basic parsing of that HTML code is required.

_________________
Please excuse my flawed English. My native language is PureBasic.
Search
RSBasic's backups


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 4 posts ] 

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 24 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  

 


Powered by phpBB © 2008 phpBB Group
subSilver+ theme by Canver Software, sponsor Sanal Modifiye