It's certainly not the ideal solution, but in this kind of situation, here's how I proceed
Code: Select all
HTTPRequest=HTTPRequest(#PB_HTTP_Get,URL$)
If HTTPRequest
Status=HTTPInfo(HTTPRequest,#PB_HTTP_StatusCode)
Select Status
Case "200"
HTMLContent$=HTTPInfo(HTTPRequest,#PB_HTTP_Response)
If HTMLContent$
; * Here, see comment below
EndIf
;Case [...]
EndSelect
EndIf
*Here, see comment below
And to analyze/retrieve tags and content between tags, I use the procedure on
this thread (see '
Ajout 1') as a basis.
The ReadFile()/ReadString()/CloseFile() part can easily be replaced by the contents of the HTMLContent$ string passed as procedure's parameter, and the information you want to retrieve stored temporarily in the way you want (string with precise separators, map, list, etc.).
The first part of the procedure (after the comment 'Analyse du contenu entre deux balises ou commentaires HTML') manages the text content between two tags or comments, while the second (after the comment 'Analyse balise HTML') manages the tag and its potential attributes.
Of course, this will require a bit of work to adapt the code to get what you want to retrieve.
here, a short and simple example of use:
Code: Select all
EnableExplicit
Structure COUNTRY
Name.s
URL.s
EndStructure
#IHTML_REGEXHTMLBAL="</?\w+:?\w*((\s+(\w+-?+)+:?\w?+(\s*=\s*(?:"+#DQUOTE$+".*?"+#DQUOTE$+"|'.*?'|[^'"+#DQUOTE$+">\s]+))?)+\s*|\s*)/?>"
#REGEX_HTMLBALISE=0
Global NewList Countries.COUNTRY()
Procedure.a Pc_ContentAnalysis(ArgHTMLContent.s)
Protected.i AncPosition=1 ; Position précédente dans la variable TexteFichier
Protected.i NouvPosition ; Position actuelle dans la variable TexteFichier après appel RegEx
Protected.i LongChaine ; Longueur de la chaine textuelle entre deux balises
Protected.a Commentaire ; Commentaire HTML en cours de traitement (Booléen)
Protected.a BaliseListeTrouvee ; Table contenant les infos trouvée
Protected.a BalisePaysTrouvee ; Balise pays trouvée
Protected.s TexteHTML
; Suppression des LF, CR & TAB
;ArgHTMLContent=ReplaceString(ReplaceString(ReplaceString(ArgHTMLContent,Chr(10),""),Chr(13),""),Chr(9),"")
ReplaceString(ArgHTMLContent,Chr(10)," ",#PB_String_InPlace)
ReplaceString(ArgHTMLContent,Chr(13)," ",#PB_String_InPlace)
ReplaceString(ArgHTMLContent,Chr(9)," ",#PB_String_InPlace)
ArgHTMLContent=Trim(ArgHTMLContent)
If PeekA(@ArgHTMLContent)=$FF And PeekA(@ArgHTMLContent+1)=$FE:ArgHTMLContent=Mid(ArgHTMLContent,2):EndIf
; Test entête fichier HTML
If UCase(Left(ArgHTMLContent,15))<>"<!DOCTYPE HTML>"
MessageRequester("Analyse Balises et attributs HTML","Le fichier ne semble pas être un page HTML valide",#PB_MessageRequester_Error)
ProcedureReturn #False
EndIf
; Boucle lecture des balises
If CreateRegularExpression(#REGEX_HTMLBALISE,#IHTML_REGEXHTMLBAL)
If ExamineRegularExpression(#REGEX_HTMLBALISE,ArgHTMLContent)
While NextRegularExpressionMatch(#REGEX_HTMLBALISE)
NouvPosition=RegularExpressionMatchPosition(#REGEX_HTMLBALISE)
; Analyse du contenu entre deux balises ou commentaires HTML "<!-- blabla -->"
If AncPosition<>NouvPosition ; Texte ou commentaire
LongChaine=NouvPosition-AncPosition
TexteHTML=Mid(ArgHTMLContent,AncPosition,LongChaine)
If Left(LTrim(TexteHTML),4)="<!--" ; Balise début commentaire
If Right(RTrim(TexteHTML),3)<>"-->" ; Commentaire encadrant:"<!-- blabla > <blabla> blabla <!-->"
Commentaire=#True
EndIf
ElseIf Right(RTrim(TexteHTML),3)="-->" ; Balise fin commentaire encadrant
Commentaire=#False
ElseIf BalisePaysTrouvee
Countries()\Name=TexteHTML
BalisePaysTrouvee=#False
EndIf
EndIf
; Analyse balise HTML
LongChaine=RegularExpressionMatchLength(#REGEX_HTMLBALISE)
TexteHTML=RegularExpressionMatchString(#REGEX_HTMLBALISE)
If Left(TexteHTML,18)="<div class="+Chr(34)+"lists"+Chr(34)
BaliseListeTrouvee=#True
ElseIf BaliseListeTrouvee
If Left(TexteHTML,20)="<a href="+Chr(34)+"/countries/"
BalisePaysTrouvee=#True
AddElement(Countries())
Countries()\URL=StringField(StringField(TexteHTML,2,"href="+Chr(34)),1,Chr(34))
ElseIf Left(TexteHTML,31)="<div class="+Chr(34)+"hsg-width-sidebar"+Chr(34)+">"
Break
EndIf
EndIf
AncPosition=NouvPosition+LongChaine
Wend
EndIf
EndIf
EndProcedure
Procedure.s Fc_WebsiteRequest(ArgURL.s)
Protected.i HTTPRequest
Protected.s Status,HTMLContent
Debug "Request sent to the site"
HTTPRequest=HTTPRequest(#PB_HTTP_Get,ArgURL)
If HTTPRequest
Status=HTTPInfo(HTTPRequest,#PB_HTTP_StatusCode)
Debug "Status: "+Status
Select Status
Case "200"
HTMLContent=HTTPInfo(HTTPRequest,#PB_HTTP_Response)
If HTMLContent
Pc_ContentAnalysis(HTMLContent)
EndIf
HTMLContent=""
;Case [...]
EndSelect
Else
Debug "Error!"
EndIf
EndProcedure
;
Fc_WebsiteRequest("https://history.state.gov/countries/all")
If ListSize(Countries())
Debug ~"*--------------------------*\nCountry list:"
ForEach Countries()
Debug " "+Countries()\Name+": URL="+Countries()\URL
Next
Else
Debug "No countries"
EndIf