[Implemented] ReceiveHTTPMemory() needs user-agent setting

Got an idea for enhancing PureBasic? New command(s) you'd like to see?
Dude
Addict
Addict
Posts: 1907
Joined: Mon Feb 16, 2015 2:49 pm

[Implemented] ReceiveHTTPMemory() needs user-agent setting

Post by Dude »

The following two snippets both fail to load the web page. :(

This fails to get anything at all:

Code: Select all

InitNetwork()

*Buffer = ReceiveHTTPMemory("http://whatismyipaddress.com/",#PB_HTTP_NoRedirect)

If *Buffer
  Size = MemorySize(*Buffer)
  Debug "Content: " + PeekS(*Buffer, Size, #PB_UTF8)
  FreeMemory(*Buffer)
Else
  Debug "Failed"
EndIf
And this fails because the site says it needs a valid user-agent:

Code: Select all

InitNetwork()

*Buffer = ReceiveHTTPMemory("http://whatismyipaddress.com/")

If *Buffer
  Size = MemorySize(*Buffer)
  Debug "Content: " + PeekS(*Buffer, Size, #PB_UTF8)
  FreeMemory(*Buffer)
Else
  Debug "Failed"
EndIf
So, can this command be enhanced to specify a user-agent for websites that need it? Thanks!
Dude
Addict
Addict
Posts: 1907
Joined: Mon Feb 16, 2015 2:49 pm

Re: ReceiveHTTPMemory() needs user-agent setting

Post by Dude »

Currently I have to use this instead, which works fine:

Code: Select all

Procedure.s ReceiveHTTPMemoryFixed(url$)
  bufsize=1048576
  hInet=InternetOpen_(url$,1,0,0,0)
  If hInet
    hURL=InternetOpenUrl_(hInet,url$,0,0,#INTERNET_FLAG_RELOAD,0)
    If hURL
      html$=Space(bufsize)
      If InternetReadFile_(hURL,@html$,Len(html$),@bytes)=0
        html$=""
      Else
        html$=PeekS(@html$,bufsize,#PB_Ascii)
      EndIf
    EndIf
    InternetCloseHandle_(hInet)
  EndIf
  ProcedureReturn html$
EndProcedure
tj1010
Enthusiast
Enthusiast
Posts: 716
Joined: Mon Feb 25, 2013 5:51 pm

Re: ReceiveHTTPMemory() needs user-agent setting

Post by tj1010 »

Yeah if you're scraping you have to use the API because user-agent is used for WAF and other anti-bot anti-attack systems. Then all you have to worry about is JS, HTML5 canvas, Flash, Applets, and cache-tricks and of course captcha.

I think you have to watch out for HTTPS certificate trickery too like revoked and self-signed certs on phishing sites and malicious ad networks. I'm not sure if the PB functions validate TLS certificates.

EDIT: A lot of web hosts use a WAF or stat collection that will block you even if you own the domain and pay for the hosting. I've seen it with transparent stat collection systems built in to apache modules.
Dude
Addict
Addict
Posts: 1907
Joined: Mon Feb 16, 2015 2:49 pm

Re: ReceiveHTTPMemory() needs user-agent setting

Post by Dude »

I'm thinking this is pretty much a bug report actually, because if I have to write my own procedure to do what a command is supposed to do...?
tj1010
Enthusiast
Enthusiast
Posts: 716
Joined: Mon Feb 25, 2013 5:51 pm

Re: ReceiveHTTPMemory() needs user-agent setting

Post by tj1010 »

A custom user-agent parameter would be nice. Just adding a default user-agent isn't going to help much if you're using PB for web scraping or web APIs.
User avatar
Lunasole
Addict
Addict
Posts: 1091
Joined: Mon Oct 26, 2015 2:55 am
Location: UA
Contact:

Re: ReceiveHTTPMemory() needs user-agent setting

Post by Lunasole »

+1 to this request.
Currently Purebasic forces UA string to be always null and hides any info like OS version, screen resolution, etc. That's nicely done, but there are some poor & stupid sites which just forbidding any access if received UA is different from one used in common browsers. Thus at least ability to spoof user agent is necessary, not saying about ability to build request with fully custom headers ^^ (which however is the same by fact - just additional string argument)
"W̷i̷s̷h̷i̷n̷g o̷n a s̷t̷a̷r"
User avatar
le_magn
Enthusiast
Enthusiast
Posts: 277
Joined: Wed Aug 24, 2005 12:11 pm
Location: Italia

Re: ReceiveHTTPMemory() needs user-agent setting

Post by le_magn »

+1
Image
tj1010
Enthusiast
Enthusiast
Posts: 716
Joined: Mon Feb 25, 2013 5:51 pm

Re: ReceiveHTTPMemory() needs user-agent setting

Post by tj1010 »

Lunasole wrote:+1 to this request.
Currently Purebasic forces UA string to be always null and hides any info like OS version, screen resolution, etc. That's nicely done, but there are some poor & stupid sites which just forbidding any access if received UA is different from one used in common browsers. Thus at least ability to spoof user agent is necessary, not saying about ability to build request with fully custom headers ^^ (which however is the same by fact - just additional string argument)
Just a generic header setting parameter or call, and HTTP-Proxy support. The coolest stuff out there you'd want to interact with you typically need to set the XMLHTTPRequest and user-agent header for, for example. Cookie header as well. Also the ability to read returned headers.

I don't personally need SOCKS5 or HTTP-proxy support but I know places where it'd be handy and people have requested it. I just redirect through TOR via 127.0.0.1.
User avatar
thyphoon
Enthusiast
Enthusiast
Posts: 345
Joined: Sat Dec 25, 2004 2:37 pm

Re: ReceiveHTTPMemory() needs user-agent setting

Post by thyphoon »

+1 lot off people wait proxy support and user-agent option ...
Fred
Administrator
Administrator
Posts: 18154
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Re: ReceiveHTTPMemory() needs user-agent setting

Post by Fred »

Would a generic web agent like "Mozilla/5.0 Gecko/41.0 Firefox/41.0" would be enough ?
User avatar
thyphoon
Enthusiast
Enthusiast
Posts: 345
Joined: Sat Dec 25, 2004 2:37 pm

Re: ReceiveHTTPMemory() needs user-agent setting

Post by thyphoon »

Fred wrote:Would a generic web agent like "Mozilla/5.0 Gecko/41.0 Firefox/41.0" would be enough ?
On a professional project i must use web agent as authorisation key.
Fred
Administrator
Administrator
Posts: 18154
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Re: ReceiveHTTPMemory() needs user-agent setting

Post by Fred »

I see thanks !
tj1010
Enthusiast
Enthusiast
Posts: 716
Joined: Mon Feb 25, 2013 5:51 pm

Re: ReceiveHTTPMemory() needs user-agent setting

Post by tj1010 »

If you'd put POST-variable and optional-HTTP-header map parameters you'd only ever have to do bug fixes pertaining to HTTP. PB already does TLS and chunking so there is nothing else possible to add for HTTP, and low-level-enough socket stuff has long been done in the network library.
uwekel
Enthusiast
Enthusiast
Posts: 740
Joined: Sat Dec 03, 2011 5:54 pm
Location: Oldenburg (Germany)

Re: ReceiveHTTPMemory() needs user-agent setting

Post by uwekel »

+1 for this request
PB 5.70 LTS (x64) - Debian Testing, Gnome 3.30.2
endo
Enthusiast
Enthusiast
Posts: 141
Joined: Fri Apr 30, 2004 10:44 pm
Location: Turkiye (istanbul)
Contact:

Re: ReceiveHTTPMemory() needs user-agent setting

Post by endo »

We should be able to set any HTTP header for the request.
I need to request for an http://api.example.com but it requires me to set a custom header X-AUTH-KEY which is not possible with PB yet without going to direct network stuff.
-= endo (registered user of purebasic since 98) =-
Post Reply