Downloading to an HTML file then converting it to plain text

Just starting out? Need help? Post your questions and find answers here.
Seymour Clufley
Addict
Addict
Posts: 1265
Joined: Wed Feb 28, 2007 9:13 am
Location: London

Downloading to an HTML file then converting it to plain text

Post by Seymour Clufley »

I want to make a program that reads a webpage address from file (I know how to do this bit!) and then snatches text from that page.

As far as I know, the procedure would be for the program to download the webpage to an HTML file, then get the entire contents of that file and parse it somehow.

I'm aware that it would be possible to simply remove all text enclosed within < and > tags, but that may cut out all the text from a webpage so it isn't a solution in itself.

The input could be any webpage whatsoever. (I'm using Yahoo's random URL generator.)

Formatting isn't important. I just want random pieces of text from the Internet.

I know there are already programs available that convert HTML to plain text. But to avoid licensing complications and the idiosyncrasies of each program, I'd like to make one myself.

So what I need help with, and I'd be very grateful for any, is:

1. How to download a webpage from a URL to an HTML file.

2. How to convert it to plain text.

Basically what I want to do is like clicking on any webpage and pressing CTRL+A and CTRL+C to select all text and copy it. Could that be done with PB coding?

Thanks in advance for any help,
Seymour.
User avatar
Kiffi
Addict
Addict
Posts: 1504
Joined: Tue Mar 02, 2004 1:20 pm
Location: Amphibios 9

Re: Downloading to an HTML file then converting it to plain

Post by Kiffi »

Hello Seymour,

with the PureDispHelper from ts-soft it is quite easy to extract all text from
a website:

Code: Select all

EnableExplicit

Define.l oIE, result

dhToggleExceptions(#True)
oIE = dhCreateObject("InternetExplorer.Application")

If oIE
  
  dhPutValue(oIE, "Visible = %b", #False)
  dhCallMethod(oIE, "Navigate (%T)", @"www.purebasic.com")
  
  Repeat
    dhGetValue("%d", @result, oIE, "ReadyState")
  Until result = 4
  
  dhGetValue("%T", @result, oIE, ".document.body.innertext")
  
  If result
    
    MessageRequester("Plain text", PeekS(result), #MB_ICONINFORMATION)
    dhFreeString(result) : result = 0
    
  EndIf
  
  dhReleaseObject(oIE)
  
EndIf
Greetings ... Kiffi
Seymour Clufley
Addict
Addict
Posts: 1265
Joined: Wed Feb 28, 2007 9:13 am
Location: London

Post by Seymour Clufley »

Thanks for telling me about this.

Please don't think I want to avoid doing any work... but how could that code be adapted to work with a single webpage instead of a whole website?
User avatar
Kiffi
Addict
Addict
Posts: 1504
Joined: Tue Mar 02, 2004 1:20 pm
Location: Amphibios 9

Post by Kiffi »

Seymour Clufley wrote:but how could that code be adapted to work with a single webpage instead of a whole website?
oh, is see, there ist a misunderstanding. the code loads the content of a
single webpage.

i wrote website but i mean webpage ;-)

Greetings ... Kiffi
Seymour Clufley
Addict
Addict
Posts: 1265
Joined: Wed Feb 28, 2007 9:13 am
Location: London

Post by Seymour Clufley »

Oh, excellent. Thanks very much!

Since I'm new to PB, I have to ask a few embarrassing questions... is the resulting plain text retrievable using the PeekS(result) variable? So I could for example write:

Code: Select all

result$=PeekS(result)
file$="C:\plaintext.txt"
WriteString(file$,result$)
Also, how do I use the PureDispHelper library? I've placed it in the same folder as the PB project file, and tried to a compile the executable to the same folder but I get an error on the line:

"dhToggleExceptions(#True)"

It says it isn't an array or macro etc.

The help file doesn't seem to explain how to use a library with a PB project.
User avatar
Flype
Addict
Addict
Posts: 1542
Joined: Tue Jul 22, 2003 5:02 pm
Location: In a long distant galaxy

Post by Flype »

I've placed it in the same folder as the PB project file
hi,

no, not in the PB project file.

you must place it on the PB Home folder which is commonly 'c:\program files\purebasic\'

and then restart purebasic.

in purebasic, the language can be extended with such libs, this way you have new functions available (just like an included file).
No programming language is perfect. There is not even a single best language.
There are only languages well suited or perhaps poorly suited for particular purposes. Herbert Mayer
Seymour Clufley
Addict
Addict
Posts: 1265
Joined: Wed Feb 28, 2007 9:13 am
Location: London

Post by Seymour Clufley »

Damn, it still isn't working.

PureBasic is in the default location "C:\Program Files\PureBasic" and I have placed the PureDispHelper in that folder, and even copied it into the Library folder and the UserLibraries folder within that. So there are three instances of it! I'm still getting the "dhToggleExceptions is not a macro, array etc." error. And I've restarted PB several times.

Perhaps it's to do with the code? All I've done is pasted the code you gave me into a new PB project. Is there other stuff to do?
User avatar
ts-soft
Always Here
Always Here
Posts: 5756
Joined: Thu Jun 24, 2004 2:44 pm
Location: Berlin - Germany

Post by ts-soft »

PureArea.net wrote:Notes for installing of user-libs:

The real command library from the respective archive must be copied into the directory PureBasic\PureLibraries\UserLibraries\. After the next restart of the PureBasic editor the included commands will be recognized and can be used for programming.

If predefined constants (in a .res file) are included in the archive, this file must be copied into the directory PureBasic\Residents\.

If the archive includes a manual in .chm format, this file can be copied into the directory PureBasic\Help\, than the context-sensitive help (via F1) can be used.
PureBasic 5.73 | SpiderBasic 2.30 | Windows 10 Pro (x64) | Linux Mint 20.1 (x64)
Old bugs good, new bugs bad! Updates are evil: might fix old bugs and introduce no new ones.
Image
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

Yes it should be placed in the \PureLibraries\UserLibraries\ subfolder of your purebasic installation folder.

You don't have 2 versions of Purebasic on your computer do you by any chance? A couple of beta versions perhaps?
I may look like a mule, but I'm not a complete ass.
PB
PureBasic Expert
PureBasic Expert
Posts: 7581
Joined: Fri Apr 25, 2003 5:24 pm

Post by PB »

No need for a lib. And this is not my code, I think it was Freak's originally?

Code: Select all

; ----- START WEBGADGET COPY
DataSection
  IID_IHTMLDocument2: ; {332C4425-26CB-11D0-B483-00C04FD90119}
    Data.l $332C4425
    Data.w $26CB,$11D0
    Data.b $B4,$83,$00,$C0,$4F,$D9,$01,$19
  IID_IHTMLDocument3: ; {3050F485-98B5-11CF-BB82-00AA00BDCE0B}
    Data.l $3050F485
    Data.w $98B5,$11CF
    Data.b $BB,$82,$00,$AA,$00,$BD,$CE,$0B
  IID_NULL: ; {00000000-0000-0000-0000-000000000000}
    Data.l $00000000
    Data.w $0000,$0000
    Data.b $00,$00,$00,$00,$00,$00,$00,$00
EndDataSection
Procedure WebGadget_Document(gad,*IID)
  Browser.IWebBrowser2=GetWindowLong_(GadgetID(gad),#GWL_USERDATA)
  If Browser
    If Browser\get_Document(@DocumentDispatch.IDispatch)=#S_OK And DocumentDispatch
      DocumentDispatch\QueryInterface(*IID,@Document) : DocumentDispatch\Release()
    EndIf
  EndIf
  ProcedureReturn Document
EndProcedure
Procedure.s WebGadget_Selection(gad)
  Document.IHTMLDocument2=WebGadget_Document(gad,?IID_IHTMLDocument2)
  If Document
    If Document\get_selection(@Selection.IHTMLSelectionObject)=#S_OK
      If Selection\get_type(@bstr_type)=#S_OK And bstr_type
        If LCase(PeekS(bstr_type,-1,#PB_Unicode))="text"
          If Selection\createRange(@TextRange.IDispatch)=#S_OK And TextRange
            UnicodeText$=Space(10) : PokeS(@UnicodeText$,"text",-1,#PB_Unicode) : pUnicodeText=@UnicodeText$
            If TextRange\GetIDsOfNames(?IID_NULL,@pUnicodeText,1,#LOCALE_SYSTEM_DEFAULT,@dispid_text)=#S_OK
              params.DISPPARAMS\cArgs=0 : params\cNamedArgs=0
              If TextRange\Invoke(dispid_text,?IID_NULL,#LOCALE_SYSTEM_DEFAULT,#DISPATCH_PROPERTYGET,@params,@varResult.VARIANT,0,0)=#S_OK
                If varResult\vt=#VT_BSTR
                  r$=PeekS(varResult\bstrVal,-1,#PB_Unicode)
                Else
                  VariantChangeType_(@varResult,@varResult,0,#VT_BSTR)
                EndIf
                VariantClear_(@varResult)
              EndIf
            EndIf
            TextRange\Release()
          EndIf
        EndIf
        SysFreeString_(bstr_type)
      EndIf
      Selection\Release()
    EndIf
    Document\Release()
  EndIf
  ProcedureReturn r$
EndProcedure
Procedure.s WebGadget_CopyText(gad)
  Document.IHTMLDocument2=WebGadget_Document(gad,?IID_IHTMLDocument2)
  If Document
    If Document\get_body(@Body.IHTMLElement)=#S_OK
      If Body\get_innerText(@bstr_text)=#S_OK And bstr_text
        r$=PeekS(bstr_text,-1,#PB_Unicode) : SysFreeString_(bstr_text)
      EndIf
      Body\Release()
    EndIf
    Document\Release()
  EndIf
  ProcedureReturn r$
EndProcedure
; ----- END WEBGADGET COPY

Procedure WaitForWebGadget(gad)
  Browser.IWebBrowser2=GetWindowLong_(GadgetID(gad),#GWL_USERDATA)
  Repeat
    While WindowEvent() : Wend : Browser\get_Busy(@busy.l)
    If busy=#VARIANT_TRUE : Sleep_(1) : EndIf
  Until busy=#VARIANT_FALSE
EndProcedure

Enumeration
  #Web
  #Editor
  #Button1
EndEnumeration

If OpenWindow(0,0,0,1024,768,"Web",#PB_Window_ScreenCentered|#PB_Window_SystemMenu)
  CreateGadgetList(WindowID(0))
  WebGadget(#Web,0,0,900,350,"")
  ButtonGadget(#Button1,10,360,100,25,"Convert to plain text")
  EditorGadget(#Editor,0,390,900,350)
  SetGadgetText(#Web,"http://www.google.com")
  WaitForWebGadget(#Web)
  Repeat
    Event=WaitWindowEvent()
    If Event=#PB_Event_Gadget
      Select EventGadget()
        Case #Button1
          SetGadgetText(#Editor,WebGadget_CopyText(#Web))
      EndSelect
    EndIf
  Until Event=#PB_Event_CloseWindow
EndIf
I compile using 5.31 (x86) on Win 7 Ultimate (64-bit).
"PureBasic won't be object oriented, period" - Fred.
Seymour Clufley
Addict
Addict
Posts: 1265
Joined: Wed Feb 28, 2007 9:13 am
Location: London

Post by Seymour Clufley »

I'm back on to this project now. It has ceased being an exe and become a dll.

Using Freak's code, I can get the text from a webpage but there's a strange problem. As I said, the code is inside a dll that gets called by an exe. I can call other functions in the dll and there are no problems, but when I call the "ReadWebpage" function (which uses Freak's code), the exe can't be shut down.

It "closes", but it remains visible in TaskManager>Processes. And the only way to shut it down is to terminate the process there.

The dll seems to keep on operating. Or rather, it keeps the exe that used it operating.

As Kiffi said, the other option is to use PureDispHelper. I tried that and couldn't get it to work at all. The download I got didn't contain any .res files (should there be?), so nothing to paste into the Residents folder. Either way, trying to compile throws up the error "dhToggleExceptions is not an array, macro, etc."

All in all it's a bit confusing. Does anyone know any other ways to get plain text from the webgadget?
xgp
Enthusiast
Enthusiast
Posts: 128
Joined: Mon Jun 13, 2005 6:03 pm

Post by xgp »

Code: Select all

input$  = "<html><head><title>title example</title></head><body><h2>my homepage</h2></body></html>"
iSize   = Len(input$)
iInput  = FindString(input$,"<body>",0)+6 ; start after the <body> tag
PlainText$ = ""

While iInput < iSize
  
  temp$ = Mid(input$,iInput,1)
  
  If temp$ = "<"
    iInput = FindString(input$,">",iInput+1)+1
  Else
    PlainText$ = PlainText$ + temp$
    iInput = iInput+1
  EndIf
  
Wend

Debug PlainText$
While reading your post, this methodology came into my mind, but after reading some of the answers here, i am not sure if this really helps.
Anyway,...

Regards
xgp
Seymour Clufley
Addict
Addict
Posts: 1265
Joined: Wed Feb 28, 2007 9:13 am
Location: London

Post by Seymour Clufley »

Thanks, XGP. That code works, but it's very slow (about 10 seconds from start to finish on my PC).

I'm amazed there's no way just to copy the text from the webgadget???

Any ideas would be gratefully appreciated,
Seymour.
PB
PureBasic Expert
PureBasic Expert
Posts: 7581
Joined: Fri Apr 25, 2003 5:24 pm

Post by PB »

> I'm amazed there's no way just to copy the text from the webgadget???

The code I posted above does it, but yes, I'd prefer a native solution too.
User avatar
Fluid Byte
Addict
Addict
Posts: 2336
Joined: Fri Jul 21, 2006 4:41 am
Location: Berlin, Germany

Post by Fluid Byte »

This just saved my butt, thanks PB!
Windows 10 Pro, 64-Bit / Whose Hoff is it anyway?
Post Reply