It is currently Fri May 24, 2013 12:05 am

All times are UTC + 1 hour




Post new topic Reply to topic  [ 4 posts ] 
Author Message
 Post subject: Crawling Websites
PostPosted: Wed May 16, 2007 6:47 pm 
Offline
Addict
Addict
User avatar

Joined: Thu Jul 01, 2004 2:51 am
Posts: 905
Location: Tacoma, WA
As a small side project, I wanted to look at making a simple web crawler. Not the Spiderman kind but the kind that could be fed a domain name like "google.com" and then crawl through the allowed pages and store some information.

I'd like to get both the plain text and source of the page I'm looking at.

I would also like to conform to ... what do they call them? The text files that control where things like this are allowed to go.

I don't really need anything complex. Just a simple little thing to retrieve page info.

Is there code that can already retrieve web page information? I'd like to do that without resorting to using the web gadget if possible.

Thanks for all information!


Top
 Profile  
 
 Post subject:
PostPosted: Wed May 16, 2007 6:50 pm 
Offline
Addict
Addict
User avatar

Joined: Thu Apr 05, 2007 12:15 am
Posts: 899
Location: Nuremberg, Germany
Just google for "HTTrack". It's Opensource.

_________________
Windows 7 & PureBasic 4.4


Top
 Profile  
 
 Post subject:
PostPosted: Thu May 17, 2007 1:23 am 
Offline
Addict
Addict

Joined: Mon May 29, 2006 1:01 am
Posts: 1966
Location: Outback
I crawl my supplier sites by using urldownloadtofile or in one case the site-specific crawler app uses num3-derived PB code.

Once the file is downloaded, I read it, turn it to lowercase and replace doublequotes with single quotes so that I don't have to parse it. Then findstring key tags like "<a" and "<img" and so on and put out the links, files, etc from href='xxx' and src='xxx' and so on. Whatever is desired.

If a site has a "robots.txt" in the root directory you can download this and parse it for rules. If it does not, theoretically it doesn't care where you go but metatags for robots exist embedded in pages - values like noindex and nofollow give some guides.

I understand that many sites offer a google-like sitemap to make it easier for crawlers (like the googlebot).

BTW, if your code is fast, put a delay of some sort in so you don't hammer the site you're crawling. If you can surf it ok as your crawler crawls it then you are being nice.

HTTrack, btw, can pull the legs off a website so tweak it to be nice if you use it.

Hope this was useful. If not, ignore. :)

_________________
Dare2 cut down to size


Top
 Profile  
 
 Post subject:
PostPosted: Thu May 17, 2007 1:32 am 
Offline
PureBasic Team
PureBasic Team
User avatar

Joined: Fri Apr 25, 2003 5:21 pm
Posts: 5188
Location: Germany
for the robots.txt, see here: http://www.robotstxt.org/wc/robots.html

_________________
Perl – The only language that looks the same before and after RSA encryption.
-- Keith Bostic


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 4 posts ] 

All times are UTC + 1 hour


Who is online

Users browsing this forum: jassing and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  

 


Powered by phpBB © 2008 phpBB Group
subSilver+ theme by Canver Software, sponsor Sanal Modifiye