I crawl my supplier sites by using urldownloadtofile or in one case the site-specific crawler app uses num3-derived PB code.
Once the file is downloaded, I read it, turn it to lowercase and replace doublequotes with single quotes so that I don't have to parse it. Then findstring key tags like "<a" and "<img" and so on and put out the links, files, etc from href='xxx' and src='xxx' and so on. Whatever is desired.
If a site has a "robots.txt" in the root directory you can download this and parse it for rules. If it does not, theoretically it doesn't care where you go but metatags for robots exist embedded in pages - values like noindex and nofollow give some guides.
I understand that many sites offer a google-like sitemap to make it easier for crawlers (like the googlebot).
BTW, if your code is fast, put a delay of some sort in so you don't hammer the site you're crawling. If you can surf it ok as your crawler crawls it then you are being nice.
HTTrack, btw, can pull the legs off a website so tweak it to be nice if you use it.
Hope this was useful. If not, ignore.
