Page 1 of 2
Downloading to an HTML file then converting it to plain text
Posted: Thu Jul 12, 2007 8:33 pm
by Seymour Clufley
I want to make a program that reads a webpage address from file (I know how to do this bit!) and then snatches text from that page.
As far as I know, the procedure would be for the program to download the webpage to an HTML file, then get the entire contents of that file and parse it somehow.
I'm aware that it would be possible to simply remove all text enclosed within < and > tags, but that may cut out
all the text from a webpage so it isn't a solution in itself.
The input could be any webpage whatsoever. (I'm using Yahoo's random URL generator.)
Formatting isn't important. I just want random pieces of text from the Internet.
I know there are already
programs available that convert HTML to plain text. But to avoid licensing complications and the idiosyncrasies of each program, I'd like to make one myself.
So what I need help with, and I'd be very grateful for any, is:
1. How to download a webpage from a URL to an HTML file.
2. How to convert it to plain text.
Basically what I want to do is like clicking on any webpage and pressing CTRL+A and CTRL+C to select all text and copy it. Could that be done with PB coding?
Thanks in advance for any help,
Seymour.
Re: Downloading to an HTML file then converting it to plain
Posted: Thu Jul 12, 2007 10:01 pm
by Kiffi
Hello Seymour,
with the
PureDispHelper from ts-soft it is quite easy to extract all text from
a website:
Code: Select all
EnableExplicit
Define.l oIE, result
dhToggleExceptions(#True)
oIE = dhCreateObject("InternetExplorer.Application")
If oIE
dhPutValue(oIE, "Visible = %b", #False)
dhCallMethod(oIE, "Navigate (%T)", @"www.purebasic.com")
Repeat
dhGetValue("%d", @result, oIE, "ReadyState")
Until result = 4
dhGetValue("%T", @result, oIE, ".document.body.innertext")
If result
MessageRequester("Plain text", PeekS(result), #MB_ICONINFORMATION)
dhFreeString(result) : result = 0
EndIf
dhReleaseObject(oIE)
EndIf
Greetings ... Kiffi
Posted: Fri Jul 13, 2007 12:00 am
by Seymour Clufley
Thanks for telling me about this.
Please don't think I want to avoid doing any work... but how could that code be adapted to work with a single webpage instead of a whole website?
Posted: Fri Jul 13, 2007 7:56 am
by Kiffi
Seymour Clufley wrote:but how could that code be adapted to work with a single webpage instead of a whole website?
oh, is see, there ist a misunderstanding. the code loads the content of a
single webpage.
i wrote website but i mean webpage
Greetings ... Kiffi
Posted: Fri Jul 13, 2007 4:53 pm
by Seymour Clufley
Oh, excellent. Thanks very much!
Since I'm new to PB, I have to ask a few embarrassing questions... is the resulting plain text retrievable using the PeekS(result) variable? So I could for example write:
Code: Select all
result$=PeekS(result)
file$="C:\plaintext.txt"
WriteString(file$,result$)
Also, how do I use the PureDispHelper library? I've placed it in the same folder as the PB project file, and tried to a compile the executable to the same folder but I get an error on the line:
"dhToggleExceptions(#True)"
It says it isn't an array or macro etc.
The help file doesn't seem to explain how to use a library with a PB project.
Posted: Fri Jul 13, 2007 6:17 pm
by Flype
I've placed it in the same folder as the PB project file
hi,
no, not in the PB project file.
you must place it on the PB Home folder which is commonly 'c:\program files\purebasic\'
and then restart purebasic.
in purebasic, the language can be extended with such libs, this way you have new functions available (just like an included file).
Posted: Sat Jul 14, 2007 12:29 am
by Seymour Clufley
Damn, it still isn't working.
PureBasic is in the default location "C:\Program Files\PureBasic" and I have placed the PureDispHelper in that folder, and even copied it into the Library folder and the UserLibraries folder within that. So there are three instances of it! I'm still getting the "dhToggleExceptions is not a macro, array etc." error. And I've restarted PB several times.
Perhaps it's to do with the code? All I've done is pasted the code you gave me into a new PB project. Is there other stuff to do?
Posted: Sat Jul 14, 2007 12:51 am
by ts-soft
PureArea.net wrote:Notes for installing of user-libs:
The real command library from the respective archive must be copied into the directory PureBasic\PureLibraries\UserLibraries\. After the next restart of the PureBasic editor the included commands will be recognized and can be used for programming.
If predefined constants (in a .res file) are included in the archive, this file must be copied into the directory PureBasic\Residents\.
If the archive includes a manual in .chm format, this file can be copied into the directory PureBasic\Help\, than the context-sensitive help (via F1) can be used.
Posted: Sat Jul 14, 2007 12:54 am
by srod
Yes it should be placed in the \PureLibraries\UserLibraries\ subfolder of your purebasic installation folder.
You don't have 2 versions of Purebasic on your computer do you by any chance? A couple of beta versions perhaps?
Posted: Sat Jul 14, 2007 2:30 am
by PB
No need for a lib. And this is not my code, I think it was Freak's originally?
Code: Select all
; ----- START WEBGADGET COPY
DataSection
IID_IHTMLDocument2: ; {332C4425-26CB-11D0-B483-00C04FD90119}
Data.l $332C4425
Data.w $26CB,$11D0
Data.b $B4,$83,$00,$C0,$4F,$D9,$01,$19
IID_IHTMLDocument3: ; {3050F485-98B5-11CF-BB82-00AA00BDCE0B}
Data.l $3050F485
Data.w $98B5,$11CF
Data.b $BB,$82,$00,$AA,$00,$BD,$CE,$0B
IID_NULL: ; {00000000-0000-0000-0000-000000000000}
Data.l $00000000
Data.w $0000,$0000
Data.b $00,$00,$00,$00,$00,$00,$00,$00
EndDataSection
Procedure WebGadget_Document(gad,*IID)
Browser.IWebBrowser2=GetWindowLong_(GadgetID(gad),#GWL_USERDATA)
If Browser
If Browser\get_Document(@DocumentDispatch.IDispatch)=#S_OK And DocumentDispatch
DocumentDispatch\QueryInterface(*IID,@Document) : DocumentDispatch\Release()
EndIf
EndIf
ProcedureReturn Document
EndProcedure
Procedure.s WebGadget_Selection(gad)
Document.IHTMLDocument2=WebGadget_Document(gad,?IID_IHTMLDocument2)
If Document
If Document\get_selection(@Selection.IHTMLSelectionObject)=#S_OK
If Selection\get_type(@bstr_type)=#S_OK And bstr_type
If LCase(PeekS(bstr_type,-1,#PB_Unicode))="text"
If Selection\createRange(@TextRange.IDispatch)=#S_OK And TextRange
UnicodeText$=Space(10) : PokeS(@UnicodeText$,"text",-1,#PB_Unicode) : pUnicodeText=@UnicodeText$
If TextRange\GetIDsOfNames(?IID_NULL,@pUnicodeText,1,#LOCALE_SYSTEM_DEFAULT,@dispid_text)=#S_OK
params.DISPPARAMS\cArgs=0 : params\cNamedArgs=0
If TextRange\Invoke(dispid_text,?IID_NULL,#LOCALE_SYSTEM_DEFAULT,#DISPATCH_PROPERTYGET,@params,@varResult.VARIANT,0,0)=#S_OK
If varResult\vt=#VT_BSTR
r$=PeekS(varResult\bstrVal,-1,#PB_Unicode)
Else
VariantChangeType_(@varResult,@varResult,0,#VT_BSTR)
EndIf
VariantClear_(@varResult)
EndIf
EndIf
TextRange\Release()
EndIf
EndIf
SysFreeString_(bstr_type)
EndIf
Selection\Release()
EndIf
Document\Release()
EndIf
ProcedureReturn r$
EndProcedure
Procedure.s WebGadget_CopyText(gad)
Document.IHTMLDocument2=WebGadget_Document(gad,?IID_IHTMLDocument2)
If Document
If Document\get_body(@Body.IHTMLElement)=#S_OK
If Body\get_innerText(@bstr_text)=#S_OK And bstr_text
r$=PeekS(bstr_text,-1,#PB_Unicode) : SysFreeString_(bstr_text)
EndIf
Body\Release()
EndIf
Document\Release()
EndIf
ProcedureReturn r$
EndProcedure
; ----- END WEBGADGET COPY
Procedure WaitForWebGadget(gad)
Browser.IWebBrowser2=GetWindowLong_(GadgetID(gad),#GWL_USERDATA)
Repeat
While WindowEvent() : Wend : Browser\get_Busy(@busy.l)
If busy=#VARIANT_TRUE : Sleep_(1) : EndIf
Until busy=#VARIANT_FALSE
EndProcedure
Enumeration
#Web
#Editor
#Button1
EndEnumeration
If OpenWindow(0,0,0,1024,768,"Web",#PB_Window_ScreenCentered|#PB_Window_SystemMenu)
CreateGadgetList(WindowID(0))
WebGadget(#Web,0,0,900,350,"")
ButtonGadget(#Button1,10,360,100,25,"Convert to plain text")
EditorGadget(#Editor,0,390,900,350)
SetGadgetText(#Web,"http://www.google.com")
WaitForWebGadget(#Web)
Repeat
Event=WaitWindowEvent()
If Event=#PB_Event_Gadget
Select EventGadget()
Case #Button1
SetGadgetText(#Editor,WebGadget_CopyText(#Web))
EndSelect
EndIf
Until Event=#PB_Event_CloseWindow
EndIf
Posted: Sun Aug 05, 2007 8:19 am
by Seymour Clufley
I'm back on to this project now. It has ceased being an exe and become a dll.
Using Freak's code, I can get the text from a webpage but there's a strange problem. As I said, the code is inside a dll that gets called by an exe. I can call other functions in the dll and there are no problems, but when I call the "ReadWebpage" function (which uses Freak's code), the exe can't be shut down.
It "closes", but it remains visible in TaskManager>Processes. And the only way to shut it down is to terminate the process there.
The dll seems to keep on operating. Or rather, it keeps the exe that used it operating.
As Kiffi said, the other option is to use PureDispHelper. I tried that and couldn't get it to work at all. The download I got didn't contain any .res files (should there be?), so nothing to paste into the Residents folder. Either way, trying to compile throws up the error "dhToggleExceptions is not an array, macro, etc."
All in all it's a bit confusing. Does anyone know any other ways to get plain text from the webgadget?
Posted: Sun Aug 05, 2007 11:23 am
by xgp
Code: Select all
input$ = "<html><head><title>title example</title></head><body><h2>my homepage</h2></body></html>"
iSize = Len(input$)
iInput = FindString(input$,"<body>",0)+6 ; start after the <body> tag
PlainText$ = ""
While iInput < iSize
temp$ = Mid(input$,iInput,1)
If temp$ = "<"
iInput = FindString(input$,">",iInput+1)+1
Else
PlainText$ = PlainText$ + temp$
iInput = iInput+1
EndIf
Wend
Debug PlainText$
While reading your post, this methodology came into my mind, but after reading some of the answers here, i am not sure if this really helps.
Anyway,...
Regards
xgp
Posted: Mon Aug 06, 2007 10:55 am
by Seymour Clufley
Thanks, XGP. That code works, but it's very slow (about 10 seconds from start to finish on my PC).
I'm amazed there's no way just to copy the text from the webgadget???
Any ideas would be gratefully appreciated,
Seymour.
Posted: Sun Sep 30, 2007 7:01 am
by PB
> I'm amazed there's no way just to copy the text from the webgadget???
The code I posted above does it, but yes, I'd prefer a native solution too.
Posted: Mon Aug 04, 2008 10:46 am
by Fluid Byte
This just saved my butt, thanks PB!