Downloading to an HTML file then converting it to plain text
-
- Addict
- Posts: 1265
- Joined: Wed Feb 28, 2007 9:13 am
- Location: London
Downloading to an HTML file then converting it to plain text
I want to make a program that reads a webpage address from file (I know how to do this bit!) and then snatches text from that page.
As far as I know, the procedure would be for the program to download the webpage to an HTML file, then get the entire contents of that file and parse it somehow.
I'm aware that it would be possible to simply remove all text enclosed within < and > tags, but that may cut out all the text from a webpage so it isn't a solution in itself.
The input could be any webpage whatsoever. (I'm using Yahoo's random URL generator.)
Formatting isn't important. I just want random pieces of text from the Internet.
I know there are already programs available that convert HTML to plain text. But to avoid licensing complications and the idiosyncrasies of each program, I'd like to make one myself.
So what I need help with, and I'd be very grateful for any, is:
1. How to download a webpage from a URL to an HTML file.
2. How to convert it to plain text.
Basically what I want to do is like clicking on any webpage and pressing CTRL+A and CTRL+C to select all text and copy it. Could that be done with PB coding?
Thanks in advance for any help,
Seymour.
As far as I know, the procedure would be for the program to download the webpage to an HTML file, then get the entire contents of that file and parse it somehow.
I'm aware that it would be possible to simply remove all text enclosed within < and > tags, but that may cut out all the text from a webpage so it isn't a solution in itself.
The input could be any webpage whatsoever. (I'm using Yahoo's random URL generator.)
Formatting isn't important. I just want random pieces of text from the Internet.
I know there are already programs available that convert HTML to plain text. But to avoid licensing complications and the idiosyncrasies of each program, I'd like to make one myself.
So what I need help with, and I'd be very grateful for any, is:
1. How to download a webpage from a URL to an HTML file.
2. How to convert it to plain text.
Basically what I want to do is like clicking on any webpage and pressing CTRL+A and CTRL+C to select all text and copy it. Could that be done with PB coding?
Thanks in advance for any help,
Seymour.
Re: Downloading to an HTML file then converting it to plain
Hello Seymour,
with the PureDispHelper from ts-soft it is quite easy to extract all text from
a website:
Greetings ... Kiffi
with the PureDispHelper from ts-soft it is quite easy to extract all text from
a website:
Code: Select all
EnableExplicit
Define.l oIE, result
dhToggleExceptions(#True)
oIE = dhCreateObject("InternetExplorer.Application")
If oIE
dhPutValue(oIE, "Visible = %b", #False)
dhCallMethod(oIE, "Navigate (%T)", @"www.purebasic.com")
Repeat
dhGetValue("%d", @result, oIE, "ReadyState")
Until result = 4
dhGetValue("%T", @result, oIE, ".document.body.innertext")
If result
MessageRequester("Plain text", PeekS(result), #MB_ICONINFORMATION)
dhFreeString(result) : result = 0
EndIf
dhReleaseObject(oIE)
EndIf
-
- Addict
- Posts: 1265
- Joined: Wed Feb 28, 2007 9:13 am
- Location: London
-
- Addict
- Posts: 1265
- Joined: Wed Feb 28, 2007 9:13 am
- Location: London
Oh, excellent. Thanks very much!
Since I'm new to PB, I have to ask a few embarrassing questions... is the resulting plain text retrievable using the PeekS(result) variable? So I could for example write:
Also, how do I use the PureDispHelper library? I've placed it in the same folder as the PB project file, and tried to a compile the executable to the same folder but I get an error on the line:
"dhToggleExceptions(#True)"
It says it isn't an array or macro etc.
The help file doesn't seem to explain how to use a library with a PB project.
Since I'm new to PB, I have to ask a few embarrassing questions... is the resulting plain text retrievable using the PeekS(result) variable? So I could for example write:
Code: Select all
result$=PeekS(result)
file$="C:\plaintext.txt"
WriteString(file$,result$)
"dhToggleExceptions(#True)"
It says it isn't an array or macro etc.
The help file doesn't seem to explain how to use a library with a PB project.
hi,I've placed it in the same folder as the PB project file
no, not in the PB project file.
you must place it on the PB Home folder which is commonly 'c:\program files\purebasic\'
and then restart purebasic.
in purebasic, the language can be extended with such libs, this way you have new functions available (just like an included file).
No programming language is perfect. There is not even a single best language.
There are only languages well suited or perhaps poorly suited for particular purposes. Herbert Mayer
There are only languages well suited or perhaps poorly suited for particular purposes. Herbert Mayer
-
- Addict
- Posts: 1265
- Joined: Wed Feb 28, 2007 9:13 am
- Location: London
Damn, it still isn't working.
PureBasic is in the default location "C:\Program Files\PureBasic" and I have placed the PureDispHelper in that folder, and even copied it into the Library folder and the UserLibraries folder within that. So there are three instances of it! I'm still getting the "dhToggleExceptions is not a macro, array etc." error. And I've restarted PB several times.
Perhaps it's to do with the code? All I've done is pasted the code you gave me into a new PB project. Is there other stuff to do?
PureBasic is in the default location "C:\Program Files\PureBasic" and I have placed the PureDispHelper in that folder, and even copied it into the Library folder and the UserLibraries folder within that. So there are three instances of it! I'm still getting the "dhToggleExceptions is not a macro, array etc." error. And I've restarted PB several times.
Perhaps it's to do with the code? All I've done is pasted the code you gave me into a new PB project. Is there other stuff to do?
PureArea.net wrote:Notes for installing of user-libs:
The real command library from the respective archive must be copied into the directory PureBasic\PureLibraries\UserLibraries\. After the next restart of the PureBasic editor the included commands will be recognized and can be used for programming.
If predefined constants (in a .res file) are included in the archive, this file must be copied into the directory PureBasic\Residents\.
If the archive includes a manual in .chm format, this file can be copied into the directory PureBasic\Help\, than the context-sensitive help (via F1) can be used.
PureBasic 5.73 | SpiderBasic 2.30 | Windows 10 Pro (x64) | Linux Mint 20.1 (x64)
Old bugs good, new bugs bad! Updates are evil: might fix old bugs and introduce no new ones.

Old bugs good, new bugs bad! Updates are evil: might fix old bugs and introduce no new ones.

No need for a lib. And this is not my code, I think it was Freak's originally?
Code: Select all
; ----- START WEBGADGET COPY
DataSection
IID_IHTMLDocument2: ; {332C4425-26CB-11D0-B483-00C04FD90119}
Data.l $332C4425
Data.w $26CB,$11D0
Data.b $B4,$83,$00,$C0,$4F,$D9,$01,$19
IID_IHTMLDocument3: ; {3050F485-98B5-11CF-BB82-00AA00BDCE0B}
Data.l $3050F485
Data.w $98B5,$11CF
Data.b $BB,$82,$00,$AA,$00,$BD,$CE,$0B
IID_NULL: ; {00000000-0000-0000-0000-000000000000}
Data.l $00000000
Data.w $0000,$0000
Data.b $00,$00,$00,$00,$00,$00,$00,$00
EndDataSection
Procedure WebGadget_Document(gad,*IID)
Browser.IWebBrowser2=GetWindowLong_(GadgetID(gad),#GWL_USERDATA)
If Browser
If Browser\get_Document(@DocumentDispatch.IDispatch)=#S_OK And DocumentDispatch
DocumentDispatch\QueryInterface(*IID,@Document) : DocumentDispatch\Release()
EndIf
EndIf
ProcedureReturn Document
EndProcedure
Procedure.s WebGadget_Selection(gad)
Document.IHTMLDocument2=WebGadget_Document(gad,?IID_IHTMLDocument2)
If Document
If Document\get_selection(@Selection.IHTMLSelectionObject)=#S_OK
If Selection\get_type(@bstr_type)=#S_OK And bstr_type
If LCase(PeekS(bstr_type,-1,#PB_Unicode))="text"
If Selection\createRange(@TextRange.IDispatch)=#S_OK And TextRange
UnicodeText$=Space(10) : PokeS(@UnicodeText$,"text",-1,#PB_Unicode) : pUnicodeText=@UnicodeText$
If TextRange\GetIDsOfNames(?IID_NULL,@pUnicodeText,1,#LOCALE_SYSTEM_DEFAULT,@dispid_text)=#S_OK
params.DISPPARAMS\cArgs=0 : params\cNamedArgs=0
If TextRange\Invoke(dispid_text,?IID_NULL,#LOCALE_SYSTEM_DEFAULT,#DISPATCH_PROPERTYGET,@params,@varResult.VARIANT,0,0)=#S_OK
If varResult\vt=#VT_BSTR
r$=PeekS(varResult\bstrVal,-1,#PB_Unicode)
Else
VariantChangeType_(@varResult,@varResult,0,#VT_BSTR)
EndIf
VariantClear_(@varResult)
EndIf
EndIf
TextRange\Release()
EndIf
EndIf
SysFreeString_(bstr_type)
EndIf
Selection\Release()
EndIf
Document\Release()
EndIf
ProcedureReturn r$
EndProcedure
Procedure.s WebGadget_CopyText(gad)
Document.IHTMLDocument2=WebGadget_Document(gad,?IID_IHTMLDocument2)
If Document
If Document\get_body(@Body.IHTMLElement)=#S_OK
If Body\get_innerText(@bstr_text)=#S_OK And bstr_text
r$=PeekS(bstr_text,-1,#PB_Unicode) : SysFreeString_(bstr_text)
EndIf
Body\Release()
EndIf
Document\Release()
EndIf
ProcedureReturn r$
EndProcedure
; ----- END WEBGADGET COPY
Procedure WaitForWebGadget(gad)
Browser.IWebBrowser2=GetWindowLong_(GadgetID(gad),#GWL_USERDATA)
Repeat
While WindowEvent() : Wend : Browser\get_Busy(@busy.l)
If busy=#VARIANT_TRUE : Sleep_(1) : EndIf
Until busy=#VARIANT_FALSE
EndProcedure
Enumeration
#Web
#Editor
#Button1
EndEnumeration
If OpenWindow(0,0,0,1024,768,"Web",#PB_Window_ScreenCentered|#PB_Window_SystemMenu)
CreateGadgetList(WindowID(0))
WebGadget(#Web,0,0,900,350,"")
ButtonGadget(#Button1,10,360,100,25,"Convert to plain text")
EditorGadget(#Editor,0,390,900,350)
SetGadgetText(#Web,"http://www.google.com")
WaitForWebGadget(#Web)
Repeat
Event=WaitWindowEvent()
If Event=#PB_Event_Gadget
Select EventGadget()
Case #Button1
SetGadgetText(#Editor,WebGadget_CopyText(#Web))
EndSelect
EndIf
Until Event=#PB_Event_CloseWindow
EndIf
I compile using 5.31 (x86) on Win 7 Ultimate (64-bit).
"PureBasic won't be object oriented, period" - Fred.
"PureBasic won't be object oriented, period" - Fred.
-
- Addict
- Posts: 1265
- Joined: Wed Feb 28, 2007 9:13 am
- Location: London
I'm back on to this project now. It has ceased being an exe and become a dll.
Using Freak's code, I can get the text from a webpage but there's a strange problem. As I said, the code is inside a dll that gets called by an exe. I can call other functions in the dll and there are no problems, but when I call the "ReadWebpage" function (which uses Freak's code), the exe can't be shut down.
It "closes", but it remains visible in TaskManager>Processes. And the only way to shut it down is to terminate the process there.
The dll seems to keep on operating. Or rather, it keeps the exe that used it operating.
As Kiffi said, the other option is to use PureDispHelper. I tried that and couldn't get it to work at all. The download I got didn't contain any .res files (should there be?), so nothing to paste into the Residents folder. Either way, trying to compile throws up the error "dhToggleExceptions is not an array, macro, etc."
All in all it's a bit confusing. Does anyone know any other ways to get plain text from the webgadget?
Using Freak's code, I can get the text from a webpage but there's a strange problem. As I said, the code is inside a dll that gets called by an exe. I can call other functions in the dll and there are no problems, but when I call the "ReadWebpage" function (which uses Freak's code), the exe can't be shut down.
It "closes", but it remains visible in TaskManager>Processes. And the only way to shut it down is to terminate the process there.
The dll seems to keep on operating. Or rather, it keeps the exe that used it operating.
As Kiffi said, the other option is to use PureDispHelper. I tried that and couldn't get it to work at all. The download I got didn't contain any .res files (should there be?), so nothing to paste into the Residents folder. Either way, trying to compile throws up the error "dhToggleExceptions is not an array, macro, etc."
All in all it's a bit confusing. Does anyone know any other ways to get plain text from the webgadget?
Code: Select all
input$ = "<html><head><title>title example</title></head><body><h2>my homepage</h2></body></html>"
iSize = Len(input$)
iInput = FindString(input$,"<body>",0)+6 ; start after the <body> tag
PlainText$ = ""
While iInput < iSize
temp$ = Mid(input$,iInput,1)
If temp$ = "<"
iInput = FindString(input$,">",iInput+1)+1
Else
PlainText$ = PlainText$ + temp$
iInput = iInput+1
EndIf
Wend
Debug PlainText$
Anyway,...
Regards
xgp
-
- Addict
- Posts: 1265
- Joined: Wed Feb 28, 2007 9:13 am
- Location: London
- Fluid Byte
- Addict
- Posts: 2336
- Joined: Fri Jul 21, 2006 4:41 am
- Location: Berlin, Germany