How to read a webpage

TI-994A · Post by **TI-994A** » Sun Aug 11, 2024 6:52 pm

Is there a way to stream and read the contents of a webpage that is not a file (eg: .html, .php, .png, etc)?

I am currently facing this issue with the AccuWeather site where the ReceiveHTTPFile() or even HTTPRequest() functions don't work.

Here's an example:

Code: Select all

https://www.accuweather.com/en/br/s%C3%A3o-paulo/45881/weather-forecast/45881

It loads in successfully into a WebGadget(), but I believe that the GetGadgetItemText() function for this works only on Windows.

Post by **Kiffi** » Sun Aug 11, 2024 8:03 pm

I think that in most cases it makes more sense to use any available APIs. You can then process their results with PureBasic's own JSON or XML commands.

See also: https://developer.accuweather.com/ or https://openweathermap.org/api

TI-994A · Post by **TI-994A** » Mon Aug 12, 2024 4:02 am

Kiffi wrote: Sun Aug 11, 2024 8:03 pm...it makes more sense to use any available APIs. ...

Hi Kiffi. You're right, and I'm already developing with AccuWeather through their APIs. But access to them is subscription based, and even the free tier allows only limited access. Thus I was finding a workaround, by trying to read the results directly from their pubic webpages.

I'm already getting the desired result through WebGadget() and the GetGadgetItemText() function with the #PB_Web_HtmlCode flag. But sadly, this is indicated to be a Windows-only solution - although it appears to work on MacOS as well.

Nonetheless, I'm sure there must be a way to capture the incoming web stream of any URL - without the overhead of an embedded browser.

Any ideas, anyone?

plouf · Post by **plouf** » Mon Aug 12, 2024 6:12 am

Most websites nowdays use https so its difficult with receivehttp

There is a libcurl.pbi wrapper by some user. Its the preffered choice since curl lib is uptodate and full compliant

Post by **Fred** » Mon Aug 12, 2024 12:57 pm

ReceiveHTTPFile() supports HTTPS

infratec · Post by **infratec** » Mon Aug 12, 2024 2:14 pm

But here it makes no sense, since this page is not a static file.

It is rendered via javascript.
So you need WebGadget or WebviewGadget and wait until the page is completetly rendered.
Than you can get the content.

TI-994A · Post by **TI-994A** » Mon Aug 12, 2024 2:44 pm

Fred wrote: Mon Aug 12, 2024 12:57 pm ReceiveHTTPFile() supports HTTPS

Absolutely, Fred. I took your suggestion in the other thread and adapted the REST API example to use the HTTPRequest() function. It now works with both http and https as well.

REST API with HTTPRequest()

However, this particular URL continues to fail with all the PureBasic http functions, although it could be loaded into a WebGadget() normally. From there, I'm able to extract the script with the GetGadgetItemText() function and the #PB_Web_HtmlCode directive.

Code: Select all

https://www.accuweather.com/en/br/sao-paulo/45881/weather-forecast/45881

Like infratec just said, I now realise that it's a JavaScript-rendered webpage, so the WebGadget() is probably the only way to go.

TI-994A · Post by **TI-994A** » Mon Aug 12, 2024 2:53 pm

infratec wrote: Mon Aug 12, 2024 2:14 pm... It is rendered via javascript. ...

Thanks for pointing that out, Bernd. I was so focused on finding the fault that I missed the cause.

Clearly one of those missed the forest for the trees scenarios.

Post by **Fred** » Mon Aug 12, 2024 4:22 pm

boddhi · Post by **boddhi** » Mon Aug 12, 2024 6:26 pm

It's certainly not the ideal solution, but in this kind of situation, here's how I proceed

Code: Select all

HTTPRequest=HTTPRequest(#PB_HTTP_Get,URL$)
If HTTPRequest
   Status=HTTPInfo(HTTPRequest,#PB_HTTP_StatusCode)
   Select Status
     Case "200"
       HTMLContent$=HTTPInfo(HTTPRequest,#PB_HTTP_Response)
       If HTMLContent$
         ; * Here, see comment below
       EndIf
       ;Case [...]
   EndSelect
 EndIf

*Here, see comment below
And to analyze/retrieve tags and content between tags, I use the procedure on this thread (see 'Ajout 1') as a basis.
The ReadFile()/ReadString()/CloseFile() part can easily be replaced by the contents of the HTMLContent$ string passed as procedure's parameter, and the information you want to retrieve stored temporarily in the way you want (string with precise separators, map, list, etc.).
The first part of the procedure (after the comment 'Analyse du contenu entre deux balises ou commentaires HTML') manages the text content between two tags or comments, while the second (after the comment 'Analyse balise HTML') manages the tag and its potential attributes.

Of course, this will require a bit of work to adapt the code to get what you want to retrieve.

here, a short and simple example of use:

Code: Select all

EnableExplicit

Structure COUNTRY
  Name.s
  URL.s
EndStructure
#IHTML_REGEXHTMLBAL="</?\w+:?\w*((\s+(\w+-?+)+:?\w?+(\s*=\s*(?:"+#DQUOTE$+".*?"+#DQUOTE$+"|'.*?'|[^'"+#DQUOTE$+">\s]+))?)+\s*|\s*)/?>"
#REGEX_HTMLBALISE=0
Global NewList Countries.COUNTRY()

Procedure.a Pc_ContentAnalysis(ArgHTMLContent.s)
  Protected.i AncPosition=1                   ; Position précédente dans la variable TexteFichier
  Protected.i NouvPosition                    ; Position actuelle dans la variable TexteFichier après appel RegEx
  Protected.i LongChaine                      ; Longueur de la chaine textuelle entre deux balises
  Protected.a Commentaire                     ; Commentaire HTML en cours de traitement (Booléen)
  Protected.a BaliseListeTrouvee              ; Table contenant les infos trouvée
  Protected.a BalisePaysTrouvee               ; Balise pays trouvée
  Protected.s TexteHTML
  
  ; Suppression des LF, CR & TAB
  ;ArgHTMLContent=ReplaceString(ReplaceString(ReplaceString(ArgHTMLContent,Chr(10),""),Chr(13),""),Chr(9),"")
  ReplaceString(ArgHTMLContent,Chr(10)," ",#PB_String_InPlace)
  ReplaceString(ArgHTMLContent,Chr(13)," ",#PB_String_InPlace)
  ReplaceString(ArgHTMLContent,Chr(9)," ",#PB_String_InPlace)
  ArgHTMLContent=Trim(ArgHTMLContent)
  If PeekA(@ArgHTMLContent)=$FF And PeekA(@ArgHTMLContent+1)=$FE:ArgHTMLContent=Mid(ArgHTMLContent,2):EndIf
  ; Test entête fichier HTML
  If UCase(Left(ArgHTMLContent,15))<>"<!DOCTYPE HTML>"
    MessageRequester("Analyse Balises et attributs HTML","Le fichier ne semble pas être un page HTML valide",#PB_MessageRequester_Error)
    ProcedureReturn #False
  EndIf
  ; Boucle lecture des balises
  If CreateRegularExpression(#REGEX_HTMLBALISE,#IHTML_REGEXHTMLBAL)
    If ExamineRegularExpression(#REGEX_HTMLBALISE,ArgHTMLContent)
      While NextRegularExpressionMatch(#REGEX_HTMLBALISE)
        NouvPosition=RegularExpressionMatchPosition(#REGEX_HTMLBALISE)
        ; Analyse du contenu entre deux balises ou commentaires HTML "<!-- blabla -->"
        If AncPosition<>NouvPosition                                      ; Texte ou commentaire
          LongChaine=NouvPosition-AncPosition
          TexteHTML=Mid(ArgHTMLContent,AncPosition,LongChaine)
          If Left(LTrim(TexteHTML),4)="<!--"                              ; Balise début commentaire
            If Right(RTrim(TexteHTML),3)<>"-->"                           ; Commentaire encadrant:"<!-- blabla > <blabla> blabla <!-->"
              Commentaire=#True
            EndIf
          ElseIf Right(RTrim(TexteHTML),3)="-->"                          ; Balise fin commentaire encadrant
            Commentaire=#False
          ElseIf BalisePaysTrouvee
            Countries()\Name=TexteHTML
            BalisePaysTrouvee=#False
          EndIf
        EndIf
        ; Analyse balise HTML
        LongChaine=RegularExpressionMatchLength(#REGEX_HTMLBALISE)
        TexteHTML=RegularExpressionMatchString(#REGEX_HTMLBALISE)
        
        If Left(TexteHTML,18)="<div class="+Chr(34)+"lists"+Chr(34)
          BaliseListeTrouvee=#True
        ElseIf BaliseListeTrouvee
          If Left(TexteHTML,20)="<a href="+Chr(34)+"/countries/"
            BalisePaysTrouvee=#True
            AddElement(Countries())
            Countries()\URL=StringField(StringField(TexteHTML,2,"href="+Chr(34)),1,Chr(34))
          ElseIf Left(TexteHTML,31)="<div class="+Chr(34)+"hsg-width-sidebar"+Chr(34)+">"
            Break
          EndIf
        EndIf
        AncPosition=NouvPosition+LongChaine
      Wend
    EndIf
  EndIf
EndProcedure
Procedure.s Fc_WebsiteRequest(ArgURL.s)
  Protected.i HTTPRequest
  Protected.s Status,HTMLContent
  
  Debug "Request sent to the site"
  HTTPRequest=HTTPRequest(#PB_HTTP_Get,ArgURL)
  If HTTPRequest
    Status=HTTPInfo(HTTPRequest,#PB_HTTP_StatusCode)
    Debug "Status: "+Status
    Select Status
      Case "200"
        HTMLContent=HTTPInfo(HTTPRequest,#PB_HTTP_Response)
        If HTMLContent
          Pc_ContentAnalysis(HTMLContent)
        EndIf
        HTMLContent=""
      ;Case [...]
    EndSelect
  Else
    Debug "Error!"
  EndIf
EndProcedure
;
Fc_WebsiteRequest("https://history.state.gov/countries/all")
If ListSize(Countries())
  Debug ~"*--------------------------*\nCountry list:"
  ForEach Countries()
    Debug "  "+Countries()\Name+": URL="+Countries()\URL
  Next
Else
  Debug "No countries"
EndIf

BarryG · Post by **BarryG** » Tue Aug 13, 2024 3:39 am

TI-994A wrote: Mon Aug 12, 2024 4:02 amI'm already getting the desired result through WebGadget() and the GetGadgetItemText() function with the #PB_Web_HtmlCode flag. But sadly, this is indicated to be a Windows-only solution - although it appears to work on MacOS as well.

Seems like the best solution would be for #PB_Web_HtmlCode to be officially supported on Mac and Linux.

TI-994A · Post by **TI-994A** » Tue Aug 13, 2024 6:47 am

BarryG wrote: Tue Aug 13, 2024 3:39 am...for #PB_Web_HtmlCode to be officially supported on Mac and Linux.

As per the documentation, the HTML extraction feature is supported only on the Windows platform. But strangely enough, it works well on macOS Sonoma (PureBasic v6.11 LTS arm64) and macOS Catalina (PureBasic 5.73 LTS x64).

PureBasic Forums - English

How to read a webpage

How to read a webpage

Re: How to read a webpage

Re: How to read a webpage

Re: How to read a webpage

Re: How to read a webpage

Re: How to read a webpage

Re: How to read a webpage

Re: How to read a webpage

Re: How to read a webpage

Re: How to read a webpage

Re: How to read a webpage

Re: How to read a webpage