Page 2 of 2

Re: Determing if a URL is a web page or file download

Posted: Tue Nov 01, 2011 12:36 pm
by MachineCode
TomS wrote:Sigh. It's NOT reliable.
Just revisiting this now, as my app nears completion... so there's no reliable way other than downloading the target locally first, and then checking if it's a web page versus a binary file?

Re: Determing if a URL is a web page or file download

Posted: Wed Nov 02, 2011 1:16 am
by kenmo
MachineCode wrote:Just revisiting this now, as my app nears completion... so there's no reliable way other than downloading the target locally first, and then checking if it's a web page versus a binary file?
Well it looks like you have three options...

1. Check the URL -- not reliable.
2. Check the HTTP content type -- usually reliable, but can be intentionally/unintentionally incorrect.
3. Check the file itself -- always works, but requires a full download and parse of the file.

In theory, if all you know about a file is its name/extension/size, you CAN'T determine if it's text or binary, unless you look inside it.

But it's sometimes unreasonable to download and scan every file. Depending on your application, option 2 might be fine.

(Also, theoretically, a "binary" file output by any other program might contain all valid ASCII characters, just by chance... Then option 3 would not consider it a "binary" file, unless the content-type is also checked! It's not likely, but possible. Depending on what your program does with them, it may not really matter.)

Re: Determing if a URL is a web page or file download

Posted: Wed Nov 02, 2011 8:20 am
by MachineCode
It's all too hard. I'm just going to do my own option:

4. Obey the user.

If they submit a binary download instead of a web page URL, then too bad. :P

Re: Determing if a URL is a web page or file download

Posted: Wed Nov 02, 2011 10:28 pm
by greyhoundcode
Do you expect and/or intend to let your users view resources such as images (by themselves), or rather pages of HTML? If the latter, perhaps some crude testing by regex or even a more complex solution such as attempting to load and inspect the DOM would suffice?

Re: Determing if a URL is a web page or file download

Posted: Wed Nov 02, 2011 11:07 pm
by Trond
ContentType is the correct way. If the server sends wrong information, then that's a problem with the server and not your program.

Re: Determing if a URL is a web page or file download

Posted: Thu Nov 03, 2011 1:16 am
by citystate
couldn't you download the first X bytes and check them? I always thought the first few characters or any page would be "<HTML>" or the like

Re: Determing if a URL is a web page or file download

Posted: Thu Nov 03, 2011 9:37 am
by MachineCode
greyhoundcode wrote:Do you expect and/or intend to let your users view resources such as images (by themselves), or rather pages of HTML?
They can enter any URL, just like a web browser's address bar. But like a web browser, if you enter an URL to a 10 GB file, then it's going to be a long time before it downloads. That's what I was hoping to prevent.

@Trond: I'll look more into ContentType then. Thanks.

Re: Determing if a URL is a web page or file download

Posted: Thu Nov 03, 2011 3:13 pm
by ultralazor
MIME header...done

Re: Determing if a URL is a web page or file download

Posted: Sun Nov 13, 2011 7:15 am
by MachineCode
After extensive testing, Kenmo's ContentType() procedure is the most reliable, but it still fails to return a content type for Wikipedia's "Random" link, which is http://en.wikipedia.org/wiki/Special:Random . I'm sure I can work around that, though. Thanks Kenmo! :)

Re: Determing if a URL is a web page or file download

Posted: Sun Nov 13, 2011 11:52 am
by Trond
MachineCode wrote:After extensive testing, Kenmo's ContentType() procedure is the most reliable, but it still fails to return a content type for Wikipedia's "Random" link, which is http://en.wikipedia.org/wiki/Special:Random . I'm sure I can work around that, though. Thanks Kenmo! :)
GetHTTPHeader() returns an empty string for this URL. I think it's a bug in this function.