Page 1 of 2

[Implemented] ReceiveHTTPMemory() needs user-agent setting

Posted: Sat Jan 09, 2016 10:18 am
by Dude
The following two snippets both fail to load the web page. :(

This fails to get anything at all:

Code: Select all

InitNetwork()

*Buffer = ReceiveHTTPMemory("http://whatismyipaddress.com/",#PB_HTTP_NoRedirect)

If *Buffer
  Size = MemorySize(*Buffer)
  Debug "Content: " + PeekS(*Buffer, Size, #PB_UTF8)
  FreeMemory(*Buffer)
Else
  Debug "Failed"
EndIf
And this fails because the site says it needs a valid user-agent:

Code: Select all

InitNetwork()

*Buffer = ReceiveHTTPMemory("http://whatismyipaddress.com/")

If *Buffer
  Size = MemorySize(*Buffer)
  Debug "Content: " + PeekS(*Buffer, Size, #PB_UTF8)
  FreeMemory(*Buffer)
Else
  Debug "Failed"
EndIf
So, can this command be enhanced to specify a user-agent for websites that need it? Thanks!

Re: ReceiveHTTPMemory() needs user-agent setting

Posted: Sat Jan 09, 2016 11:14 am
by Dude
Currently I have to use this instead, which works fine:

Code: Select all

Procedure.s ReceiveHTTPMemoryFixed(url$)
  bufsize=1048576
  hInet=InternetOpen_(url$,1,0,0,0)
  If hInet
    hURL=InternetOpenUrl_(hInet,url$,0,0,#INTERNET_FLAG_RELOAD,0)
    If hURL
      html$=Space(bufsize)
      If InternetReadFile_(hURL,@html$,Len(html$),@bytes)=0
        html$=""
      Else
        html$=PeekS(@html$,bufsize,#PB_Ascii)
      EndIf
    EndIf
    InternetCloseHandle_(hInet)
  EndIf
  ProcedureReturn html$
EndProcedure

Re: ReceiveHTTPMemory() needs user-agent setting

Posted: Sun Jan 10, 2016 9:11 pm
by tj1010
Yeah if you're scraping you have to use the API because user-agent is used for WAF and other anti-bot anti-attack systems. Then all you have to worry about is JS, HTML5 canvas, Flash, Applets, and cache-tricks and of course captcha.

I think you have to watch out for HTTPS certificate trickery too like revoked and self-signed certs on phishing sites and malicious ad networks. I'm not sure if the PB functions validate TLS certificates.

EDIT: A lot of web hosts use a WAF or stat collection that will block you even if you own the domain and pay for the hosting. I've seen it with transparent stat collection systems built in to apache modules.

Re: ReceiveHTTPMemory() needs user-agent setting

Posted: Mon Jan 11, 2016 8:17 am
by Dude
I'm thinking this is pretty much a bug report actually, because if I have to write my own procedure to do what a command is supposed to do...?

Re: ReceiveHTTPMemory() needs user-agent setting

Posted: Mon Jan 11, 2016 8:49 pm
by tj1010
A custom user-agent parameter would be nice. Just adding a default user-agent isn't going to help much if you're using PB for web scraping or web APIs.

Re: ReceiveHTTPMemory() needs user-agent setting

Posted: Tue Jun 28, 2016 10:50 pm
by Lunasole
+1 to this request.
Currently Purebasic forces UA string to be always null and hides any info like OS version, screen resolution, etc. That's nicely done, but there are some poor & stupid sites which just forbidding any access if received UA is different from one used in common browsers. Thus at least ability to spoof user agent is necessary, not saying about ability to build request with fully custom headers ^^ (which however is the same by fact - just additional string argument)

Re: ReceiveHTTPMemory() needs user-agent setting

Posted: Tue Jun 28, 2016 11:05 pm
by le_magn
+1

Re: ReceiveHTTPMemory() needs user-agent setting

Posted: Fri Nov 18, 2016 6:36 pm
by tj1010
Lunasole wrote:+1 to this request.
Currently Purebasic forces UA string to be always null and hides any info like OS version, screen resolution, etc. That's nicely done, but there are some poor & stupid sites which just forbidding any access if received UA is different from one used in common browsers. Thus at least ability to spoof user agent is necessary, not saying about ability to build request with fully custom headers ^^ (which however is the same by fact - just additional string argument)
Just a generic header setting parameter or call, and HTTP-Proxy support. The coolest stuff out there you'd want to interact with you typically need to set the XMLHTTPRequest and user-agent header for, for example. Cookie header as well. Also the ability to read returned headers.

I don't personally need SOCKS5 or HTTP-proxy support but I know places where it'd be handy and people have requested it. I just redirect through TOR via 127.0.0.1.

Re: ReceiveHTTPMemory() needs user-agent setting

Posted: Mon Dec 05, 2016 5:23 pm
by thyphoon
+1 lot off people wait proxy support and user-agent option ...

Re: ReceiveHTTPMemory() needs user-agent setting

Posted: Wed Dec 07, 2016 2:50 pm
by Fred
Would a generic web agent like "Mozilla/5.0 Gecko/41.0 Firefox/41.0" would be enough ?

Re: ReceiveHTTPMemory() needs user-agent setting

Posted: Wed Dec 07, 2016 10:23 pm
by thyphoon
Fred wrote:Would a generic web agent like "Mozilla/5.0 Gecko/41.0 Firefox/41.0" would be enough ?
On a professional project i must use web agent as authorisation key.

Re: ReceiveHTTPMemory() needs user-agent setting

Posted: Thu Dec 08, 2016 7:38 am
by Fred
I see thanks !

Re: ReceiveHTTPMemory() needs user-agent setting

Posted: Fri Dec 09, 2016 8:21 am
by tj1010
If you'd put POST-variable and optional-HTTP-header map parameters you'd only ever have to do bug fixes pertaining to HTTP. PB already does TLS and chunking so there is nothing else possible to add for HTTP, and low-level-enough socket stuff has long been done in the network library.

Re: ReceiveHTTPMemory() needs user-agent setting

Posted: Tue Jan 10, 2017 5:26 pm
by uwekel
+1 for this request

Re: ReceiveHTTPMemory() needs user-agent setting

Posted: Thu Feb 09, 2017 12:13 pm
by endo
We should be able to set any HTTP header for the request.
I need to request for an http://api.example.com but it requires me to set a custom header X-AUTH-KEY which is not possible with PB yet without going to direct network stuff.