How to read a webpage

Just starting out? Need help? Post your questions and find answers here.
User avatar
TI-994A
Addict
Addict
Posts: 2751
Joined: Sat Feb 19, 2011 3:47 am
Location: Singapore
Contact:

How to read a webpage

Post by TI-994A »

Is there a way to stream and read the contents of a webpage that is not a file (eg: .html, .php, .png, etc)?

I am currently facing this issue with the AccuWeather site where the ReceiveHTTPFile() or even HTTPRequest() functions don't work.

Here's an example:

Code: Select all

https://www.accuweather.com/en/br/s%C3%A3o-paulo/45881/weather-forecast/45881
It loads in successfully into a WebGadget(), but I believe that the GetGadgetItemText() function for this works only on Windows.
Texas Instruments TI-99/4A Home Computer: the first home computer with a 16bit processor, crammed into an 8bit architecture. Great hardware - Poor design - Wonderful BASIC engine. And it could talk too! Please visit my YouTube Channel :D
User avatar
Kiffi
Addict
Addict
Posts: 1509
Joined: Tue Mar 02, 2004 1:20 pm
Location: Amphibios 9

Re: How to read a webpage

Post by Kiffi »

I think that in most cases it makes more sense to use any available APIs. You can then process their results with PureBasic's own JSON or XML commands.

See also: https://developer.accuweather.com/ or https://openweathermap.org/api
Hygge
User avatar
TI-994A
Addict
Addict
Posts: 2751
Joined: Sat Feb 19, 2011 3:47 am
Location: Singapore
Contact:

Re: How to read a webpage

Post by TI-994A »

Kiffi wrote: Sun Aug 11, 2024 8:03 pm...it makes more sense to use any available APIs. ...
Hi Kiffi. You're right, and I'm already developing with AccuWeather through their APIs. But access to them is subscription based, and even the free tier allows only limited access. Thus I was finding a workaround, by trying to read the results directly from their pubic webpages.

I'm already getting the desired result through WebGadget() and the GetGadgetItemText() function with the #PB_Web_HtmlCode flag. But sadly, this is indicated to be a Windows-only solution - although it appears to work on MacOS as well. :shock:

Nonetheless, I'm sure there must be a way to capture the incoming web stream of any URL - without the overhead of an embedded browser.

Any ideas, anyone? :lol:
Texas Instruments TI-99/4A Home Computer: the first home computer with a 16bit processor, crammed into an 8bit architecture. Great hardware - Poor design - Wonderful BASIC engine. And it could talk too! Please visit my YouTube Channel :D
plouf
Enthusiast
Enthusiast
Posts: 282
Joined: Fri Apr 25, 2003 6:35 pm
Location: Athens,Greece

Re: How to read a webpage

Post by plouf »

Most websites nowdays use https so its difficult with receivehttp

There is a libcurl.pbi wrapper by some user. Its the preffered choice since curl lib is uptodate and full compliant
Christos
Fred
Administrator
Administrator
Posts: 18350
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Re: How to read a webpage

Post by Fred »

ReceiveHTTPFile() supports HTTPS
infratec
Always Here
Always Here
Posts: 7662
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: How to read a webpage

Post by infratec »

But here it makes no sense, since this page is not a static file.

It is rendered via javascript.
So you need WebGadget or WebviewGadget and wait until the page is completetly rendered.
Than you can get the content.
User avatar
TI-994A
Addict
Addict
Posts: 2751
Joined: Sat Feb 19, 2011 3:47 am
Location: Singapore
Contact:

Re: How to read a webpage

Post by TI-994A »

Fred wrote: Mon Aug 12, 2024 12:57 pm ReceiveHTTPFile() supports HTTPS
Absolutely, Fred. I took your suggestion in the other thread and adapted the REST API example to use the HTTPRequest() function. It now works with both http and https as well.

REST API with HTTPRequest()

However, this particular URL continues to fail with all the PureBasic http functions, although it could be loaded into a WebGadget() normally. From there, I'm able to extract the script with the GetGadgetItemText() function and the #PB_Web_HtmlCode directive.

Code: Select all

https://www.accuweather.com/en/br/sao-paulo/45881/weather-forecast/45881
Like infratec just said, I now realise that it's a JavaScript-rendered webpage, so the WebGadget() is probably the only way to go.
Texas Instruments TI-99/4A Home Computer: the first home computer with a 16bit processor, crammed into an 8bit architecture. Great hardware - Poor design - Wonderful BASIC engine. And it could talk too! Please visit my YouTube Channel :D
User avatar
TI-994A
Addict
Addict
Posts: 2751
Joined: Sat Feb 19, 2011 3:47 am
Location: Singapore
Contact:

Re: How to read a webpage

Post by TI-994A »

infratec wrote: Mon Aug 12, 2024 2:14 pm... It is rendered via javascript. ...
Thanks for pointing that out, Bernd. I was so focused on finding the fault that I missed the cause.

Clearly one of those missed the forest for the trees scenarios. :lol:
Texas Instruments TI-99/4A Home Computer: the first home computer with a 16bit processor, crammed into an 8bit architecture. Great hardware - Poor design - Wonderful BASIC engine. And it could talk too! Please visit my YouTube Channel :D
Fred
Administrator
Administrator
Posts: 18350
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Re: How to read a webpage

Post by Fred »

:lol:
boddhi
Enthusiast
Enthusiast
Posts: 524
Joined: Mon Nov 15, 2010 9:53 pm

Re: How to read a webpage

Post by boddhi »

It's certainly not the ideal solution, but in this kind of situation, here's how I proceed

Code: Select all

HTTPRequest=HTTPRequest(#PB_HTTP_Get,URL$)
If HTTPRequest
   Status=HTTPInfo(HTTPRequest,#PB_HTTP_StatusCode)
   Select Status
     Case "200"
       HTMLContent$=HTTPInfo(HTTPRequest,#PB_HTTP_Response)
       If HTMLContent$
         ; * Here, see comment below
       EndIf
       ;Case [...]
   EndSelect
 EndIf
*Here, see comment below
And to analyze/retrieve tags and content between tags, I use the procedure on this thread (see 'Ajout 1') as a basis.
The ReadFile()/ReadString()/CloseFile() part can easily be replaced by the contents of the HTMLContent$ string passed as procedure's parameter, and the information you want to retrieve stored temporarily in the way you want (string with precise separators, map, list, etc.).
The first part of the procedure (after the comment 'Analyse du contenu entre deux balises ou commentaires HTML') manages the text content between two tags or comments, while the second (after the comment 'Analyse balise HTML') manages the tag and its potential attributes.

Of course, this will require a bit of work to adapt the code to get what you want to retrieve.
 
here, a short and simple example of use:

Code: Select all

EnableExplicit

Structure COUNTRY
  Name.s
  URL.s
EndStructure
#IHTML_REGEXHTMLBAL="</?\w+:?\w*((\s+(\w+-?+)+:?\w?+(\s*=\s*(?:"+#DQUOTE$+".*?"+#DQUOTE$+"|'.*?'|[^'"+#DQUOTE$+">\s]+))?)+\s*|\s*)/?>"
#REGEX_HTMLBALISE=0
Global NewList Countries.COUNTRY()

Procedure.a Pc_ContentAnalysis(ArgHTMLContent.s)
  Protected.i AncPosition=1                   ; Position précédente dans la variable TexteFichier
  Protected.i NouvPosition                    ; Position actuelle dans la variable TexteFichier après appel RegEx
  Protected.i LongChaine                      ; Longueur de la chaine textuelle entre deux balises
  Protected.a Commentaire                     ; Commentaire HTML en cours de traitement (Booléen)
  Protected.a BaliseListeTrouvee              ; Table contenant les infos trouvée
  Protected.a BalisePaysTrouvee               ; Balise pays trouvée
  Protected.s TexteHTML
  
  ; Suppression des LF, CR & TAB
  ;ArgHTMLContent=ReplaceString(ReplaceString(ReplaceString(ArgHTMLContent,Chr(10),""),Chr(13),""),Chr(9),"")
  ReplaceString(ArgHTMLContent,Chr(10)," ",#PB_String_InPlace)
  ReplaceString(ArgHTMLContent,Chr(13)," ",#PB_String_InPlace)
  ReplaceString(ArgHTMLContent,Chr(9)," ",#PB_String_InPlace)
  ArgHTMLContent=Trim(ArgHTMLContent)
  If PeekA(@ArgHTMLContent)=$FF And PeekA(@ArgHTMLContent+1)=$FE:ArgHTMLContent=Mid(ArgHTMLContent,2):EndIf
  ; Test entête fichier HTML
  If UCase(Left(ArgHTMLContent,15))<>"<!DOCTYPE HTML>"
    MessageRequester("Analyse Balises et attributs HTML","Le fichier ne semble pas être un page HTML valide",#PB_MessageRequester_Error)
    ProcedureReturn #False
  EndIf
  ; Boucle lecture des balises
  If CreateRegularExpression(#REGEX_HTMLBALISE,#IHTML_REGEXHTMLBAL)
    If ExamineRegularExpression(#REGEX_HTMLBALISE,ArgHTMLContent)
      While NextRegularExpressionMatch(#REGEX_HTMLBALISE)
        NouvPosition=RegularExpressionMatchPosition(#REGEX_HTMLBALISE)
        ; Analyse du contenu entre deux balises ou commentaires HTML "<!-- blabla -->"
        If AncPosition<>NouvPosition                                      ; Texte ou commentaire
          LongChaine=NouvPosition-AncPosition
          TexteHTML=Mid(ArgHTMLContent,AncPosition,LongChaine)
          If Left(LTrim(TexteHTML),4)="<!--"                              ; Balise début commentaire
            If Right(RTrim(TexteHTML),3)<>"-->"                           ; Commentaire encadrant:"<!-- blabla > <blabla> blabla <!-->"
              Commentaire=#True
            EndIf
          ElseIf Right(RTrim(TexteHTML),3)="-->"                          ; Balise fin commentaire encadrant
            Commentaire=#False
          ElseIf BalisePaysTrouvee
            Countries()\Name=TexteHTML
            BalisePaysTrouvee=#False
          EndIf
        EndIf
        ; Analyse balise HTML
        LongChaine=RegularExpressionMatchLength(#REGEX_HTMLBALISE)
        TexteHTML=RegularExpressionMatchString(#REGEX_HTMLBALISE)
        
        If Left(TexteHTML,18)="<div class="+Chr(34)+"lists"+Chr(34)
          BaliseListeTrouvee=#True
        ElseIf BaliseListeTrouvee
          If Left(TexteHTML,20)="<a href="+Chr(34)+"/countries/"
            BalisePaysTrouvee=#True
            AddElement(Countries())
            Countries()\URL=StringField(StringField(TexteHTML,2,"href="+Chr(34)),1,Chr(34))
          ElseIf Left(TexteHTML,31)="<div class="+Chr(34)+"hsg-width-sidebar"+Chr(34)+">"
            Break
          EndIf
        EndIf
        AncPosition=NouvPosition+LongChaine
      Wend
    EndIf
  EndIf
EndProcedure
Procedure.s Fc_WebsiteRequest(ArgURL.s)
  Protected.i HTTPRequest
  Protected.s Status,HTMLContent
  
  Debug "Request sent to the site"
  HTTPRequest=HTTPRequest(#PB_HTTP_Get,ArgURL)
  If HTTPRequest
    Status=HTTPInfo(HTTPRequest,#PB_HTTP_StatusCode)
    Debug "Status: "+Status
    Select Status
      Case "200"
        HTMLContent=HTTPInfo(HTTPRequest,#PB_HTTP_Response)
        If HTMLContent
          Pc_ContentAnalysis(HTMLContent)
        EndIf
        HTMLContent=""
      ;Case [...]
    EndSelect
  Else
    Debug "Error!"
  EndIf
EndProcedure
;
Fc_WebsiteRequest("https://history.state.gov/countries/all")
If ListSize(Countries())
  Debug ~"*--------------------------*\nCountry list:"
  ForEach Countries()
    Debug "  "+Countries()\Name+": URL="+Countries()\URL
  Next
Else
  Debug "No countries"
EndIf
If my English syntax and lexicon are incorrect, please bear with Google translate and DeepL. They rarely agree with each other!
Except on this sentence...
BarryG
Addict
Addict
Posts: 4219
Joined: Thu Apr 18, 2019 8:17 am

Re: How to read a webpage

Post by BarryG »

TI-994A wrote: Mon Aug 12, 2024 4:02 amI'm already getting the desired result through WebGadget() and the GetGadgetItemText() function with the #PB_Web_HtmlCode flag. But sadly, this is indicated to be a Windows-only solution - although it appears to work on MacOS as well.
Seems like the best solution would be for #PB_Web_HtmlCode to be officially supported on Mac and Linux.
User avatar
TI-994A
Addict
Addict
Posts: 2751
Joined: Sat Feb 19, 2011 3:47 am
Location: Singapore
Contact:

Re: How to read a webpage

Post by TI-994A »

BarryG wrote: Tue Aug 13, 2024 3:39 am...for #PB_Web_HtmlCode to be officially supported on Mac and Linux.
As per the documentation, the HTML extraction feature is supported only on the Windows platform. But strangely enough, it works well on macOS Sonoma (PureBasic v6.11 LTS arm64) and macOS Catalina (PureBasic 5.73 LTS x64). :?
Texas Instruments TI-99/4A Home Computer: the first home computer with a 16bit processor, crammed into an 8bit architecture. Great hardware - Poor design - Wonderful BASIC engine. And it could talk too! Please visit my YouTube Channel :D
Post Reply