Download any section of .html from web

LJ · Post by LJ » Sat Jul 19, 2003 9:21 pm

Code updated for 5.20+

Ever wanted to download a table from any web page--or any section of a web site? The code below downloads a section from a web page and saves it to a .html file. The best way to use this code is to go to a web page you want to grab data from, view the source, get a section of the html code where you want to start grabbing data and plug this into the START variable, do the same with the STOP variable.

The code below grabs the table from the web site http://www.rsp.wisc.edu/html/tab2001-2.html

This code compiles to less than a 20K .exe!!

Code: Select all

;Download .html code from web
;By Lance Jepsen, 7/19/2003
;Based on code by Pille, 14.07.2003
;Adapted from Jost Schwider's VB OpenUrl
;Greetings Pille! Mirabile dictu!


Enumeration
  #Window 
  #Editor
  #Url 
  #cmdOpenUrl
  #View
EndEnumeration

defaultUrl.s="http://www.google.fr";Put URL you want to grab data from in here

Procedure.s OpenURL(URL.s, OpenType.b)
  
  isLoop.b=1
  INET_RELOAD.l=$80000000
  hInet.l=0: hURL.l=0: Bytes.l=0
  Buffer.s=Space(2048)
  res.s=""
  
  hInet = InternetOpen_("PB@INET", OpenType, #Null, #Null, 0)
  hURL = InternetOpenUrl_(hInet, URL, #Null, 0, INET_RELOAD, 0)
  
  Repeat
    InternetReadFile_(hURL, @Buffer, Len(Buffer), @Bytes)
    If Bytes = 0
      isLoop=0
    Else
      res = res + Left(Buffer, Bytes)
    EndIf
  Until isLoop=0
  
  Debug res
  
  InternetCloseHandle_(hURL)
  InternetCloseHandle_(hInet)
  
  ProcedureReturn res
EndProcedure

If OpenWindow(#Window, 0, 0, 500, 300, "Download", #PB_Window_SystemMenu | #PB_Window_TitleBar | #PB_Window_ScreenCentered)
  
  EditorGadget(#Url, 5, 5, 410, 20)
  EditorGadget(#Editor, 5, 30, 490, 260)
  ;HideGadget(#Url,1)
  HideGadget(#Editor,1)
  ButtonGadget(#cmdOpenUrl, 420, 5, 75, 20, "Download")
  
  SetGadgetText(#Url, defaultUrl)
  
  Repeat
    EventID.l = WaitWindowEvent()
    If EventID = #PB_Event_Gadget
      Select EventGadget()
          
        Case #cmdOpenUrl
          
          result = MessageRequester("Download","Are you connected to the Internet?",#PB_MessageRequester_YesNo)
          If result = #PB_MessageRequester_Yes
            ProgressBarGadget(0, 80, 80,250, 25, 0,100)
            SetGadgetState   (0, 0)
            SetGadgetText(#Editor, OpenUrl(GetGadgetText(#Url),1))
            SetGadgetState   (0, 20)
            result$ = OpenUrl(GetGadgetText(#Url),1)
            SetGadgetState   (0, 80)
            If CreateFile(0,"test.html") ; Now we save the data we grabbed to a html file. Open up this file with a web browser to see the data you grabbed.
              SetGadgetState   (0, 90)
              WriteString(0, result$)
              SetGadgetState   (0, 100)
              CloseFile(0)
            EndIf
            Delay(500)
            HideGadget(0,1)
            SetGadgetText(#Editor,"Downloading complete. File TEST.HTML created.")
            HideGadget(#Editor,0)
            DisableGadget(#cmdOpenUrl,1)           
          EndIf
          
      EndSelect
    EndIf       
  Until EventID = #PB_Event_CloseWindow
  
EndIf
End

ricardo · Post by **ricardo** » Sat Jul 19, 2003 11:07 pm

Hi,

Nice example, but i think its more recommendable to acces from DOM. Of course you must use the webgadget as an activex, but the advantages are great.

The problem with parsing the web page as a text file is that you must find a unique part to be able to grab and if this part change say bye to the fetch possibilitie.

Using DOM you only need to check if you want to grab the first Table (or the second, 3th, etc) and get the InnerHtml with something like this:

Html$ = document.all.mytable.innerHTML

For managin web pages (or xml, etc) DOM is the best choice, imho.

DOM (Document Object Model) explanation is available at:

http://www.w3.org/DOM/

Justin · Post by **Justin** » Sun Jul 20, 2003 11:08 am

If the page is bigger than 64Kb probably will crash, because you are using a string.

I'm still waiting for simple memory parsing functions in PB to do things like this, download the file to a buffer and parse it.

findmemorystring(*memoryaddress,string$)

LJ · Post by LJ » Sun Jul 20, 2003 7:52 pm

I've erased this message. But do consider what I said if you read it, and if you didn't...it's a

ricardo · Post by **ricardo** » Sun Jul 20, 2003 8:34 pm

@LJ

Sorry if you feel that i was making a critic to your code. I was not.

Here (in the forum) we usually try to find 'solutions' and when someone share one, many of us try to discuss it to find alternatives, but don't take it like critics.

In my case, im fetching or grabbing info from different sites too and i found that relaying on some part of the code gives me troubles. Then i found that the best solution (not perfect of course) is to use the 'path' in DOM for the particular element that i want to fetch.

The 64k limit is (im agree) many times a problem because with many functions (on external dlls) you must send a string pointer to receive data and some times you can't send a pointer to a memory bank because dosen't works.

Then if the string is bigger than 64 k your app will crash.

Im not sure why the 64 k limit on PB, since other languages dosen't have it.

In this very case, DOM dosen't have this limitation, so there is a problem here because if i ask the innerHTML of some part of the webpage its common that the result is bigger than 64 k and the msscript.ocx dosent accept if i point to a memory bank to get the result.

LJ · Post by LJ » Sun Jul 20, 2003 9:16 pm

Aries Moon.

Lj

Pille · Post by **Pille** » Sun Jul 20, 2003 9:23 pm

@LJ:
Nice to see that OpenUrl was useful to someone else

I also used it to write a parser for a website (in my case www.spiegel.de) to put all headlines with a link into a PB written newsticker.
(In this case as well as in many other cases 64k ought be enough for dealing with websites).
Funny thing - you mentioned the 20k of your Program - try to pack it with FSG then you'll have just 8k left

(Don't take the critic comments too personal - 'cause you will only grow with criticism - if you just get good feedbacks you will always stay on the same level... by the way: I like critic comments a lot more than 'Huh what a beautiful code'... maybe there are tons of things you can do better but you don't know it yet - so be glad that there's someone trying to give you an advice

)

@ricardo
I admit it's nice and comfortable to use the DOM-Object (also sometimes you can't avoid writing your own parser) - I often use it within VB and JS - but I've never used it in PB... could you just drop some lines of code to show how to access the DOM within PB?

LJ · Post by LJ » Sun Jul 20, 2003 10:16 pm

Greetings Pille:

Actual, personally, I don't grow from critics. I grow through my own relentless persuit of perfection because quite often, critics are wrong.

Criticism is easy, it's human, but creation, it's God like, now that is where it's at: Ad eundum quo nemo ante iit.

ricardo · Post by **ricardo** » Sun Jul 20, 2003 10:49 pm

Pille wrote:@ricardo
I admit it's nice and comfortable to use the DOM-Object (also sometimes you can't avoid writing your own parser) - I often use it within VB and JS - but I've never used it in PB... could you just drop some lines of code to show how to access the DOM within PB?

Ok.

Since i dont want this to seams like and add, i will use a code that dosent use it. (Of course its easier to do it from inside PB).

With this example we will parse some page and get ALL the links in that web page and store it as result:

Code: Select all

Dim MyVar, myIE

Set myIE = WScript.CreateObject("InternetExplorer.Application", "IE_")
myIE.ToolBar = False
myIE.StatusBar = False
myIE.Resizable = False
myIE.MenuBar = False
myIE.AddressBar = False
myIE.Width = 700
myIE.Height = 500
myIE.Left = 10
myIE.Top = 10

'Wait until the page is loaded
myIE.Navigate "www.google.com"
Do
Loop While myIE.Busy


myIE.Visible = False
'myIE.Visible = True


With myIE
 For each link in .document.links
 'get each link on the web page
  MyVar = MyVar & link.InnerText & " : " & link & "<br>"
 Next
End With

myIE.document.Open()
myIE.document.write cstr(MyVar)
MyIE.document.Close()

myIE.ExecWB 4,2,"Google Links"
MyIE.Quit

Just need to write this lines in a vbs file and run it and you will get all the links from google page.
It could be done with IE in 'silent' mode (invisible) just change the visible property and there is some way to save the result without prompting user but in this moment i dont remember this part.

From PB you can write vbs files, runit and then read the result.

ricardo · Post by **ricardo** » Sun Jul 20, 2003 11:09 pm

Now, you can grab at realtime from PHP or ASP, but you can use javascript too. The limitation is that must be run in IE 5+ and that is a little slow.

Look this beautty

Just save it as an hmtl and run it from Internet Explorer, it will parse in runtime the google page.

Code: Select all

<html><head>
<script>
function getWeb(url) {
var http = new ActiveXObject('Microsoft.XMLHTTP');
    http.open('GET', url, false);
    http.send();
    var lastModifiedHeader = http.ResponseText;
    return lastModifiedHeader;
}
</script>
<script language="vbs">
Function ParseaGoogle()
 Dim xHtml, Position, Position1, xxHtml, Parte, Parte1
 Parte = "<div"
 Parte1 = "</div>"
 xHtml = getWeb("http://www.google.com/search?q=purebasic")
 Position = instr(1,xHtml,Parte)
  for i = 1 to 3
    Position = instr(Position+1,xHtml,Parte)
  next
 Position1 = instr(Position+1,xHtml,Parte1)
 xxHtml = Mid(xHtml,Position,Position1 - Position)
 document.all.aqui.insertAdjacentHtml "afterbegin",xxHtml
End Function
</script>
</head>
<body onload="ParseaGoogle()">
&nbsp;<FONT COLOR='#FF0000'><B>Parsed from Google in RealTime</B></FONT><P>*Only works with IE<p>
<table border="1" style='position:absolute;width:400px;height:100px;'>
<tr>

<td id="aqui">&nbsp;</td>
</tr>
</table>
</body></html>

LJ · Post by LJ » Mon Jul 21, 2003 12:25 am

Nice Ricardo!

Here is the code in PureBasic that achieves the exact same result simply by changing the START and STOP parser text with nice 'Are you connected to Internet' and progressbar, and ability to save to a .html file.

Code: Select all

;Download .html code from web 
;By Lance Jepsen, 7/19/2003 
;Based on code by Pille, 14.07.2003 
;Adapted from Jost Schwider's VB OpenUrl 
;Greetings Pille! Mirabile dictu! 

#gIndex=0 

#Window = #gIndex:#gIndex=#gIndex+1 
#Editor = #gIndex:#gIndex=#gIndex+1 
#Url = #gIndex:#gIndex=#gIndex+1 
#cmdOpenUrl = #gIndex:#gIndex=#gIndex+1 
#View = #gIndex:#gIndex=#gindex+1 

defaultUrl.s="http://www.google.com/search?q=purebasic";Put URL you want to grab data from in here 

Procedure.s OpenURL(URL.s, OpenType.b) 
  
  isLoop.b=1 
  INET_RELOAD.l=$80000000 
  hInet.l=0: hURL.l=0: Bytes.l=0 
  Buffer.s=Space(2048) 
  res.s="" 

   hInet = InternetOpen_("PB@INET", OpenType, #Null, #Null, 0) 
   hURL = InternetOpenUrl_(hInet, URL, #Null, 0, INET_RELOAD, 0) 
    
   Repeat 
      InternetReadFile_(hURL, @Buffer, Len(Buffer), @Bytes) 
      If Bytes = 0 
         isLoop=0 
      Else 
         res = res + Left(Buffer, Bytes) 
      EndIf 
   Until isLoop=0 

   InternetCloseHandle_(hURL) 
   InternetCloseHandle_(hInet) 

   ProcedureReturn res 
EndProcedure 

If OpenWindow(#Window, 0, 0, 500, 300, #PB_Window_SystemMenu | #PB_Window_TitleBar | #PB_Window_ScreenCentered , "Download") 
    
   If CreateGadgetList(WindowID()) 
      EditorGadget(#Url, 5, 5, 410, 20) 
      EditorGadget(#Editor, 5, 30, 490, 260) 
      HideGadget(#Url,1) 
      HideGadget(#Editor,1) 
      ButtonGadget(#cmdOpenUrl, 420, 5, 75, 20, "Download") 
   EndIf 
    
   SetGadgetText(#Url, defaultUrl) 
    
   Repeat 
      EventID.l = WaitWindowEvent() 
      If EventID = #PB_EventGadget    
         Select EventGadgetID() 
                  
          
         Case #cmdOpenUrl 
          
          result = MessageRequester("Download","Are you connected to the Internet?",#PB_MessageRequester_YesNo) 
          If result = 6 
          CreateGadgetList(WindowID(0)) 
          ProgressBarGadget(0, 80, 80,250, 25, 0,100) 
          SetGadgetState   (0, 0) 
             SetGadgetText(#Editor, OpenUrl(GetGadgetText(#Url),1)) 
             SetGadgetState   (0, 20) 
             start = FindString(OpenUrl(GetGadgetText(#Url),1), "<div",1); In view source of a web page: this is where you want to begin grabbing data 
             SetGadgetState   (0, 40) 
             stop = FindString(OpenUrl(GetGadgetText(#Url),1), "</div>",1);this is where you want to stop getting the data 
             SetGadgetState   (0, 60) 
             result$ = Mid(OpenUrl(GetGadgetText(#Url),1), start, (stop-start)) 
             SetGadgetState   (0, 80) 
             CreateFile(0,"test.html") ; Now we save the data we grabbed to a html file. Open up this file with a web browser to see the data you grabbed. 
             SetGadgetState   (0, 90) 
             WriteString(result$) 
             SetGadgetState   (0, 100) 
             CloseFile(0) 
             Delay(500) 
             HideGadget(0,1) 
             SetGadgetText(#Editor,"Downloading complete. File TEST.HTML created.") 
             HideGadget(#Editor,0) 
             DisableGadget(#cmdOpenUrl,1)            
           EndIf 
                    
         EndSelect 
      EndIf        
   Until EventID = #PB_EventCloseWindow 

EndIf 
End

ricardo · Post by **ricardo** » Mon Jul 21, 2003 3:44 am

@LJ

Very nice!!!

Let me explain my interest on DOM (in fact in XML):

HTML code (like many others) are not plain text because has some structure (even that HTML is very bad structured many times) and we CAN use that structure to 'navigate' into the document (that the Document Model).

See this:

Code: Select all

<body>
    <table>
        <tr>
            <td> Cell1 </td>
            <td> Cell2  </td>
        </tr>
        <tr>
            <td> Cell1 </td>
            <td> Cell2  </td>
        </tr>
    </table>
</body>

If we look carefully we can see some STRUCTURE (something like a tree) that could let us pick some element (and its content) easily than parsing it as a string.

In DOM we can find a hierarchy that let us see that some elements are childs of another elements (jus like a tree) and let us search using this hierarchy method.

In this case if i want to get the content of the first cell, i could do something like (in a pseudo code):

Content$ = body.table.tr.td(1)

If we can parse and manage HTML (and xHTML and XML) in this way we can do magic!!!

Its possible to do it in PureBasic. Once i write some PB code (i dont know where do i have put it!!) that do 2 things:

1.- Read the string and put every element in a hierarchy level

2.- Then let me ask the innertext (the plain text inside an element or its childs) or the inner HTML (all the elements and text inside an element).

The problem is that i never finish this code because i preffer to start working directly with DOM.

But writing some kind of DOM parser could be great and fun!!

(We need to keep in mind that <hr> and other are elements with no 'closing' tag)

*Please NEVER get my comments as criticism to your work or ideas, im just start dreaming when i saw your code, NOT making critics.

In fact i was working the last 3 weeks in a programm that let me fetch hundreds of webpages and keep the content in a database (thats why im asking a lot about sqlite).
Im finishing it, but i do all the job combining javascript DOM and purebasic.

I even develope (as a part of this same software) a WYSIWYG html editor. Thats why when i reead your code get interested on the matter and write some comments, never because i want to make critics.

LJ · Post by LJ » Mon Jul 21, 2003 7:15 am

You are a good man Ricardo, I better understand where you are coming from now, I hope you never loose your kindness that you have.

I think your DOM method sounds great. I also think we should consider the .NET framework. Here is an excellent article: http://www.devx.com/dotnet/Article/16273/0/page/2

Microsoft is spending big money and many very large companies are implementing the .NET framework and communications via SOAP. Most ISP's will allow the use of the .NET framework on their web hosting services in the future and so this would be nice to use along with the DOM method.

El_Choni · Post by **El_Choni** » Mon Jul 21, 2003 12:18 pm

I've coded an HTML2DB app recently for a project I'm working on, using MySQL and MyODBC (and PB's DataBase functions). It picks an html page and, recursively, stores each element in a MySQL table as (idx, tag, class, id, attributes, content). Each nested tag is stored in "content" as "<tag=idx>" where idx is an index into the nested element row so, when the page is retrieved from the client, Php is able to use recursion again to output the desired html page. Of course, the Html page must be well structured (no empty , no overlapped elements, etc).

Let me know if someone is interested.

dige · Post by **dige** » Mon Jul 21, 2003 12:26 pm

Yeah! Im very interested!

Regards,

dige

PureBasic Forums - English

Download any section of .html from web

Download any section of .html from web

Re: Download any section of .html from web

Here we go...

Re: Here we go...

Re: Here we go...

Greetings

Nice

Re: Nice

Good man