Determing if a URL is a web page or file download

Just starting out? Need help? Post your questions and find answers here.
MachineCode
Addict
Addict
Posts: 1482
Joined: Tue Feb 22, 2011 1:16 pm

Determing if a URL is a web page or file download

Post by MachineCode »

In my app, the user can specify a web page to view by directly typing it into a StringGadget. But it was reported to me today that if the user types a filename, such as "http://www.example.com/App.zip", then my app sends back the file as gibberish text.

So, I need to determine if the typed URL is actually a web page or download. What's the best way to test this? I was just going to check if the extension is HTM, HTML, PHP, and so on; but some URLs don't end in these (like these forums, this thread's URL ends in "viewtopic.php?f=13&t=46967" and not just ".php").

What to do? Thanks.
Microsoft Visual Basic only lasted 7 short years: 1991 to 1998.
PureBasic: Born in 1998 and still going strong to this very day!
User avatar
skywalk
Addict
Addict
Posts: 4220
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: Determing if a URL is a web page or file download

Post by skywalk »

I put this in a wrapper to check lan connectivity.

Code: Select all

InitNetwork()
Define.s URL$
URL$ = "http://www.w3.org/Protocols/HTTP/HTRESP.html"
Debug UCase(StringField(GetHTTPHeader(URL$),1,#CRLF$))

URL$ = "http://www.example.com/App.zip"
Debug UCase(StringField(GetHTTPHeader(URL$),1,#CRLF$))

; Checking a file or folder does not work! --> "file://c:\z.txt"
; Even though it will load in a webgadget.
; Use FileSize() instead.
URL$ = "file://c:\z.txt"
;Debug UCase(StringField(GetHTTPHeader(URL$),1,#CRLF$))  ; <--- This will fail after a long timeout
Debug FileSize(URL$) + 1
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
User avatar
TomS
Enthusiast
Enthusiast
Posts: 342
Joined: Sun Mar 18, 2007 2:26 pm
Location: Munich, Germany

Re: Determing if a URL is a web page or file download

Post by TomS »

@skywalk: Your code debugs "302 found" for the second file which clearly doesn't exist.

@Machinecode: The simplest way is to check for Chr(0). A textfile won't contain any.
This way you can display .pb etc, too, like Firefox does for example: http://purearea.net/pb/CodeArchiv/Datab ... atabase.pb
User avatar
skywalk
Addict
Addict
Posts: 4220
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: Determing if a URL is a web page or file download

Post by skywalk »

TomS wrote:@skywalk: Your code debugs "302 found" for the second file which clearly doesn't exist.
HTTP/1.1 200 OK
HTTP/1.0 302 FOUND
0

You are correct. :wink:
The HTTP response status code '302 Found' is the most common way of performing a redirection.
It is up to you to decide if the new URL is valid.

Code: Select all

InitNetwork()
Define.s URL$
URL$ = "http://www.w3.org/Protocols/HTTP/HTRESP.html"
Debug UCase(StringField(GetHTTPHeader(URL$),1,#CRLF$))

URL$ = "http://www.example.com/App.zip"
Debug UCase(StringField(GetHTTPHeader(URL$),1,#CRLF$))  ; <--- Redirected implies failure

URL$ = "http://www.tedia.eu/download/files/udaq_ftdi_w98.zip"
Debug UCase(StringField(GetHTTPHeader(URL$),1,#CRLF$))  ; <--- This file does exist and returns 200
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
User avatar
kenmo
Addict
Addict
Posts: 2047
Joined: Tue Dec 23, 2003 3:54 am

Re: Determing if a URL is a web page or file download

Post by kenmo »

You can check the Content-Type of the returned HTTP header, something like this (quick'n'ugy code)

Code: Select all

Procedure.s ContentType(URL.s)
  Protected Header.s, Type.s, LocStart.i, Location.s
  
  Debug URL
  
  Repeat
    Header = GetHTTPHeader(URL)
    LocStart = FindString(Header, "Location: ", 1)
    If (LocStart)
      URL = Mid(Header, LocStart + Len("Location: "))
      URL = StringField(URL, 1, #CR$)
    Else
      Break
    EndIf
  ForEver
  
  Type = Mid(Header, FindString(Header, "Content-Type: ", 1) + Len("Content-Type: "))
  Type = StringField(Type, 1, #CR$)
  Type = LCase(StringField(Type, 1, ";"))
  
  Debug "  --->  " + Type
  
  ProcedureReturn Type
EndProcedure


InitNetwork()

ContentType("http://www.purebasic.fr/english/viewtopic.php?f=13&p=356948&sid=b4b1d1bf5a2d4aa082ced0b67fff63ff")
ContentType("http://google.com")
ContentType("http://www.purebasic.fr/english/download/file.php?avatar=5039_1292709976.jpg")
ContentType("http://download.savannah.nongnu.org/releases/tinycc/tcc-0.9.25.tar.bz2")
ContentType("http://download.savannah.nongnu.org/releases/tinycc/tcc-0.9.25-win32-bin.zip?dummyparameter=5")

Debug ""

PS. Skywalk, keep in mind that StringField() only works with single-character delimiters (for now...?).

Code: Select all

; Doesn't work as expected!
Debug StringField("Hello World!", 1, "or")
User avatar
kenmo
Addict
Addict
Posts: 2047
Joined: Tue Dec 23, 2003 3:54 am

Re: Determing if a URL is a web page or file download

Post by kenmo »

Or something like this... Much simpler and faster, but it doesn't actually check the file online, just its name. Also it doesn't work if the user just enters a path with no file (see third example).

Code: Select all

Procedure.s URLExtension(URL.s)
  ProcedureReturn (LCase(GetExtensionPart(GetURLPart(URL, #PB_URL_Path))))
EndProcedure

Debug URLExtension("http://www.purebasic.fr/english/viewtopic.php?f=13&p=356948&sid=b4b1d1bf5a2d4aa082ced0b67fff63ff")
Debug URLExtension("http://download.savannah.nongnu.org/releases/tinycc/tcc-0.9.25.tar.bz2?yes=no")
Debug URLExtension("http://audiere.sourceforge.net/audiere-1.9.4-users-doxygen/") ; Uh oh
Debug ""
MachineCode
Addict
Addict
Posts: 1482
Joined: Tue Feb 22, 2011 1:16 pm

Re: Determing if a URL is a web page or file download

Post by MachineCode »

kenmo wrote:LCase(GetExtensionPart(GetURLPart(URL, #PB_URL_Path)))
Ah, that's what I need! :) Thanks!

But the next question is: is there an official list of web page extensions somewhere? HTM and HTML I know, and PHP I know, but who are you? I mean, who are the rest?
Microsoft Visual Basic only lasted 7 short years: 1991 to 1998.
PureBasic: Born in 1998 and still going strong to this very day!
User avatar
TomS
Enthusiast
Enthusiast
Posts: 342
Joined: Sun Mar 18, 2007 2:26 pm
Location: Munich, Germany

Re: Determing if a URL is a web page or file download

Post by TomS »

The extension is not reliable at all.
Any extension can be routed to be interpreted by php or any other language interpreter.
And a php file can output any data.

www.example.com/image.php can very well display an image and thus your stringgadget contains the same binary "rubbish" as with a normal image.

You could check the MIME type in the http-header and compare it to a list of plaintext mimes. But it's also not reliable as it can be changed by the server/php.
I could load an image in php, and output its data using the mimetype plain text. Every browser will display the binary contents of the image, and so will your programm.

To check if the file does NOT contain any characters 0-31 is the only 100% reliable way.
MachineCode
Addict
Addict
Posts: 1482
Joined: Tue Feb 22, 2011 1:16 pm

Re: Determing if a URL is a web page or file download

Post by MachineCode »

TomS wrote:To check if the file does NOT contain any characters 0-31 is the only 100% reliable way.
Okay, thanks, that's the approach I will take. :)
Microsoft Visual Basic only lasted 7 short years: 1991 to 1998.
PureBasic: Born in 1998 and still going strong to this very day!
User avatar
skywalk
Addict
Addict
Posts: 4220
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: Determing if a URL is a web page or file download

Post by skywalk »

I agree checking the file extension is not correct, but how is scanning a file for Chr(0) easier than UCase(StringField(GetHTTPHeader(URL$),1,#CRLF$)) :?:
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
User avatar
TomS
Enthusiast
Enthusiast
Posts: 342
Joined: Sun Mar 18, 2007 2:26 pm
Location: Munich, Germany

Re: Determing if a URL is a web page or file download

Post by TomS »

What is this supposed to do?
It doesn't tell you if it's a binary file or not. It just tells if the file is there.
It's of course not easy to check the whole content of the file. Well, it is easy it's just not fast and you have to download the whole file first, but it's the only reliable method for MachineCode's question.
User avatar
skywalk
Addict
Addict
Posts: 4220
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: Determing if a URL is a web page or file download

Post by skywalk »

True, but a redirect implies a stale URL or some other question as to whether to download the file in the 1st place.
How is that not equally or more important?
User avatar
TomS
Enthusiast
Enthusiast
Posts: 342
Joined: Sun Mar 18, 2007 2:26 pm
Location: Munich, Germany

Re: Determing if a URL is a web page or file download

Post by TomS »

It's a totally different problem.
Of course it doesn't hurt to check if the file exists before attemping a download ;)
User avatar
RichAlgeni
Addict
Addict
Posts: 935
Joined: Wed Sep 22, 2010 1:50 am
Location: Bradenton, FL

Re: Determing if a URL is a web page or file download

Post by RichAlgeni »

I agree with Kenmo, use the line starting with 'Content-Type:'. The first word will tell you the type of data to expect, after the slash you will get the specifics. This should be on a line by itself in the returning header, according to the RFC.
User avatar
TomS
Enthusiast
Enthusiast
Posts: 342
Joined: Sun Mar 18, 2007 2:26 pm
Location: Munich, Germany

Re: Determing if a URL is a web page or file download

Post by TomS »

RichAlgeni wrote:I agree with Kenmo, use the line starting with 'Content-Type:'. The first word will tell you the type of data to expect, after the slash you will get the specifics. This should be on a line by itself in the returning header, according to the RFC.
Sigh. It's NOT reliable.

Code: Select all

<?php   //An image with content type: Text
$my_img = imagecreate( 200, 80 );
$background = imagecolorallocate( $my_img, 0, 0, 255 );
$text_colour = imagecolorallocate( $my_img, 255, 255, 0 );
$line_colour = imagecolorallocate( $my_img, 128, 255, 0 );
imagestring( $my_img, 4, 30, 25, "This is an image",
  $text_colour );
imagesetthickness ( $my_img, 5 );
imageline( $my_img, 30, 45, 165, 45, $line_colour );

header( "Content-type: text" );
imagepng( $my_img );
imagecolordeallocate( $line_color );
imagecolordeallocate( $text_color );
imagecolordeallocate( $background );
imagedestroy( $my_img );
?>
Here a test file:

Code: Select all

Debug GetHTTPHeader("http://purebasicusermap.bplaced.de/contenttype/image_contenttype_text.php")
Debug ReceiveHTTPString("http://purebasicusermap.bplaced.de/contenttype/image_contenttype_text.php")

Code: Select all

<?php //A text-output with content-type: Image/PNG
header( "Content-type: image/png" );
echo("Hello World");
?>

Code: Select all

Debug GetHTTPHeader("http://purebasicusermap.bplaced.de/contenttype/text_contenttype_image.php")
Debug ReceiveHTTPString("http://purebasicusermap.bplaced.de/contenttype/text_contenttype_image.php")

ReceiveHTTPString:

Code: Select all

Procedure.s ReceiveHTTPString(URL$, TimeOut=5000)
   Protected Event, Time, Size, String$, Inhalt
   Protected BufferSize = $1000, *Buffer = AllocateMemory(BufferSize)
   Protected ServerName$ = GetURLPart(URL$, #PB_URL_Site)
   Protected ConnectionID = OpenNetworkConnection(ServerName$, 80)
   If ConnectionID
      SendNetworkString(ConnectionID, "GET "+URL$+" HTTP/1.0"+#LFCR$+#LFCR$)
      Time = ElapsedMilliseconds()
      Repeat
         Delay(10)
         Event = NetworkClientEvent(ConnectionID)
         If Event = #PB_NetworkEvent_Data
            Repeat
               Size = ReceiveNetworkData(ConnectionID, *Buffer, BufferSize)
               String$ + PeekS(*Buffer, Size, #PB_Ascii)
            Until Not Size
            Inhalt = FindString(String$, #LFCR$, 1)
            If Inhalt
               String$ = Mid(String$, Inhalt+3)
            EndIf
         EndIf   
      Until ElapsedMilliseconds()-Time > TimeOut Or String$
      CloseNetworkConnection(ConnectionID)
   EndIf
   FreeMemory(*Buffer)
   ProcedureReturn String$
EndProcedure
Don't forget InitNetwork()^^
Post Reply