Page 1 of 2

Determing if a URL is a web page or file download

Posted: Thu Jul 21, 2011 10:49 pm
by MachineCode
In my app, the user can specify a web page to view by directly typing it into a StringGadget. But it was reported to me today that if the user types a filename, such as "http://www.example.com/App.zip", then my app sends back the file as gibberish text.

So, I need to determine if the typed URL is actually a web page or download. What's the best way to test this? I was just going to check if the extension is HTM, HTML, PHP, and so on; but some URLs don't end in these (like these forums, this thread's URL ends in "viewtopic.php?f=13&t=46967" and not just ".php").

What to do? Thanks.

Re: Determing if a URL is a web page or file download

Posted: Thu Jul 21, 2011 11:20 pm
by skywalk
I put this in a wrapper to check lan connectivity.

Code: Select all

InitNetwork()
Define.s URL$
URL$ = "http://www.w3.org/Protocols/HTTP/HTRESP.html"
Debug UCase(StringField(GetHTTPHeader(URL$),1,#CRLF$))

URL$ = "http://www.example.com/App.zip"
Debug UCase(StringField(GetHTTPHeader(URL$),1,#CRLF$))

; Checking a file or folder does not work! --> "file://c:\z.txt"
; Even though it will load in a webgadget.
; Use FileSize() instead.
URL$ = "file://c:\z.txt"
;Debug UCase(StringField(GetHTTPHeader(URL$),1,#CRLF$))  ; <--- This will fail after a long timeout
Debug FileSize(URL$) + 1

Re: Determing if a URL is a web page or file download

Posted: Fri Jul 22, 2011 12:31 am
by TomS
@skywalk: Your code debugs "302 found" for the second file which clearly doesn't exist.

@Machinecode: The simplest way is to check for Chr(0). A textfile won't contain any.
This way you can display .pb etc, too, like Firefox does for example: http://purearea.net/pb/CodeArchiv/Datab ... atabase.pb

Re: Determing if a URL is a web page or file download

Posted: Fri Jul 22, 2011 12:53 am
by skywalk
TomS wrote:@skywalk: Your code debugs "302 found" for the second file which clearly doesn't exist.
HTTP/1.1 200 OK
HTTP/1.0 302 FOUND
0

You are correct. :wink:
The HTTP response status code '302 Found' is the most common way of performing a redirection.
It is up to you to decide if the new URL is valid.

Code: Select all

InitNetwork()
Define.s URL$
URL$ = "http://www.w3.org/Protocols/HTTP/HTRESP.html"
Debug UCase(StringField(GetHTTPHeader(URL$),1,#CRLF$))

URL$ = "http://www.example.com/App.zip"
Debug UCase(StringField(GetHTTPHeader(URL$),1,#CRLF$))  ; <--- Redirected implies failure

URL$ = "http://www.tedia.eu/download/files/udaq_ftdi_w98.zip"
Debug UCase(StringField(GetHTTPHeader(URL$),1,#CRLF$))  ; <--- This file does exist and returns 200

Re: Determing if a URL is a web page or file download

Posted: Fri Jul 22, 2011 1:33 am
by kenmo
You can check the Content-Type of the returned HTTP header, something like this (quick'n'ugy code)

Code: Select all

Procedure.s ContentType(URL.s)
  Protected Header.s, Type.s, LocStart.i, Location.s
  
  Debug URL
  
  Repeat
    Header = GetHTTPHeader(URL)
    LocStart = FindString(Header, "Location: ", 1)
    If (LocStart)
      URL = Mid(Header, LocStart + Len("Location: "))
      URL = StringField(URL, 1, #CR$)
    Else
      Break
    EndIf
  ForEver
  
  Type = Mid(Header, FindString(Header, "Content-Type: ", 1) + Len("Content-Type: "))
  Type = StringField(Type, 1, #CR$)
  Type = LCase(StringField(Type, 1, ";"))
  
  Debug "  --->  " + Type
  
  ProcedureReturn Type
EndProcedure


InitNetwork()

ContentType("http://www.purebasic.fr/english/viewtopic.php?f=13&p=356948&sid=b4b1d1bf5a2d4aa082ced0b67fff63ff")
ContentType("http://google.com")
ContentType("http://www.purebasic.fr/english/download/file.php?avatar=5039_1292709976.jpg")
ContentType("http://download.savannah.nongnu.org/releases/tinycc/tcc-0.9.25.tar.bz2")
ContentType("http://download.savannah.nongnu.org/releases/tinycc/tcc-0.9.25-win32-bin.zip?dummyparameter=5")

Debug ""

PS. Skywalk, keep in mind that StringField() only works with single-character delimiters (for now...?).

Code: Select all

; Doesn't work as expected!
Debug StringField("Hello World!", 1, "or")

Re: Determing if a URL is a web page or file download

Posted: Fri Jul 22, 2011 1:37 am
by kenmo
Or something like this... Much simpler and faster, but it doesn't actually check the file online, just its name. Also it doesn't work if the user just enters a path with no file (see third example).

Code: Select all

Procedure.s URLExtension(URL.s)
  ProcedureReturn (LCase(GetExtensionPart(GetURLPart(URL, #PB_URL_Path))))
EndProcedure

Debug URLExtension("http://www.purebasic.fr/english/viewtopic.php?f=13&p=356948&sid=b4b1d1bf5a2d4aa082ced0b67fff63ff")
Debug URLExtension("http://download.savannah.nongnu.org/releases/tinycc/tcc-0.9.25.tar.bz2?yes=no")
Debug URLExtension("http://audiere.sourceforge.net/audiere-1.9.4-users-doxygen/") ; Uh oh
Debug ""

Re: Determing if a URL is a web page or file download

Posted: Fri Jul 22, 2011 1:27 pm
by MachineCode
kenmo wrote:LCase(GetExtensionPart(GetURLPart(URL, #PB_URL_Path)))
Ah, that's what I need! :) Thanks!

But the next question is: is there an official list of web page extensions somewhere? HTM and HTML I know, and PHP I know, but who are you? I mean, who are the rest?

Re: Determing if a URL is a web page or file download

Posted: Fri Jul 22, 2011 3:17 pm
by TomS
The extension is not reliable at all.
Any extension can be routed to be interpreted by php or any other language interpreter.
And a php file can output any data.

www.example.com/image.php can very well display an image and thus your stringgadget contains the same binary "rubbish" as with a normal image.

You could check the MIME type in the http-header and compare it to a list of plaintext mimes. But it's also not reliable as it can be changed by the server/php.
I could load an image in php, and output its data using the mimetype plain text. Every browser will display the binary contents of the image, and so will your programm.

To check if the file does NOT contain any characters 0-31 is the only 100% reliable way.

Re: Determing if a URL is a web page or file download

Posted: Fri Jul 22, 2011 3:28 pm
by MachineCode
TomS wrote:To check if the file does NOT contain any characters 0-31 is the only 100% reliable way.
Okay, thanks, that's the approach I will take. :)

Re: Determing if a URL is a web page or file download

Posted: Fri Jul 22, 2011 3:55 pm
by skywalk
I agree checking the file extension is not correct, but how is scanning a file for Chr(0) easier than UCase(StringField(GetHTTPHeader(URL$),1,#CRLF$)) :?:

Re: Determing if a URL is a web page or file download

Posted: Fri Jul 22, 2011 4:03 pm
by TomS
What is this supposed to do?
It doesn't tell you if it's a binary file or not. It just tells if the file is there.
It's of course not easy to check the whole content of the file. Well, it is easy it's just not fast and you have to download the whole file first, but it's the only reliable method for MachineCode's question.

Re: Determing if a URL is a web page or file download

Posted: Fri Jul 22, 2011 4:12 pm
by skywalk
True, but a redirect implies a stale URL or some other question as to whether to download the file in the 1st place.
How is that not equally or more important?

Re: Determing if a URL is a web page or file download

Posted: Fri Jul 22, 2011 4:16 pm
by TomS
It's a totally different problem.
Of course it doesn't hurt to check if the file exists before attemping a download ;)

Re: Determing if a URL is a web page or file download

Posted: Mon Jul 25, 2011 10:01 pm
by RichAlgeni
I agree with Kenmo, use the line starting with 'Content-Type:'. The first word will tell you the type of data to expect, after the slash you will get the specifics. This should be on a line by itself in the returning header, according to the RFC.

Re: Determing if a URL is a web page or file download

Posted: Mon Jul 25, 2011 10:35 pm
by TomS
RichAlgeni wrote:I agree with Kenmo, use the line starting with 'Content-Type:'. The first word will tell you the type of data to expect, after the slash you will get the specifics. This should be on a line by itself in the returning header, according to the RFC.
Sigh. It's NOT reliable.

Code: Select all

<?php   //An image with content type: Text
$my_img = imagecreate( 200, 80 );
$background = imagecolorallocate( $my_img, 0, 0, 255 );
$text_colour = imagecolorallocate( $my_img, 255, 255, 0 );
$line_colour = imagecolorallocate( $my_img, 128, 255, 0 );
imagestring( $my_img, 4, 30, 25, "This is an image",
  $text_colour );
imagesetthickness ( $my_img, 5 );
imageline( $my_img, 30, 45, 165, 45, $line_colour );

header( "Content-type: text" );
imagepng( $my_img );
imagecolordeallocate( $line_color );
imagecolordeallocate( $text_color );
imagecolordeallocate( $background );
imagedestroy( $my_img );
?>
Here a test file:

Code: Select all

Debug GetHTTPHeader("http://purebasicusermap.bplaced.de/contenttype/image_contenttype_text.php")
Debug ReceiveHTTPString("http://purebasicusermap.bplaced.de/contenttype/image_contenttype_text.php")

Code: Select all

<?php //A text-output with content-type: Image/PNG
header( "Content-type: image/png" );
echo("Hello World");
?>

Code: Select all

Debug GetHTTPHeader("http://purebasicusermap.bplaced.de/contenttype/text_contenttype_image.php")
Debug ReceiveHTTPString("http://purebasicusermap.bplaced.de/contenttype/text_contenttype_image.php")

ReceiveHTTPString:

Code: Select all

Procedure.s ReceiveHTTPString(URL$, TimeOut=5000)
   Protected Event, Time, Size, String$, Inhalt
   Protected BufferSize = $1000, *Buffer = AllocateMemory(BufferSize)
   Protected ServerName$ = GetURLPart(URL$, #PB_URL_Site)
   Protected ConnectionID = OpenNetworkConnection(ServerName$, 80)
   If ConnectionID
      SendNetworkString(ConnectionID, "GET "+URL$+" HTTP/1.0"+#LFCR$+#LFCR$)
      Time = ElapsedMilliseconds()
      Repeat
         Delay(10)
         Event = NetworkClientEvent(ConnectionID)
         If Event = #PB_NetworkEvent_Data
            Repeat
               Size = ReceiveNetworkData(ConnectionID, *Buffer, BufferSize)
               String$ + PeekS(*Buffer, Size, #PB_Ascii)
            Until Not Size
            Inhalt = FindString(String$, #LFCR$, 1)
            If Inhalt
               String$ = Mid(String$, Inhalt+3)
            EndIf
         EndIf   
      Until ElapsedMilliseconds()-Time > TimeOut Or String$
      CloseNetworkConnection(ConnectionID)
   EndIf
   FreeMemory(*Buffer)
   ProcedureReturn String$
EndProcedure
Don't forget InitNetwork()^^