Page 1 of 1
Extract all URLs from page html
Posted: Fri Aug 26, 2016 1:33 pm
by Lunasole
I've finally got tired to rewrite such trivial parsing stuff everytime it is needed, so just made some simple unified function ^^
Code: Select all
EnableExplicit
; Searches for <a> tags within html page and returns URLs from all matches
; Links$() and array to receive the result
; HTML$ any HTMl text
; RETURN resulting array size
Procedure ExtractLinks (Array Links$(1), HTML$)
Protected exp = CreateRegularExpression(#PB_Any, ~"<a.*?href=.*?<\\/a>", #PB_RegularExpression_NoCase)
Protected res = ExtractRegularExpression(exp, HTML$, Links$()) : FreeRegularExpression(exp)
exp = CreateRegularExpression(#PB_Any, ~"href=.*?\\\".*?\\\"", #PB_RegularExpression_NoCase)
Protected Dim T$(0)
While res
res - 1
ExtractRegularExpression(exp, Links$(res), T$())
Links$(res) = StringField(T$(0), 2, #DOUBLEQUOTE$) ; extract url itself from a tags
Debug Links$(res)
Wend
FreeRegularExpression(exp)
ProcedureReturn ArraySize(Links$()) + 1
EndProcedure
;; Example
Define HTML$ = ~"<A pff \"attributes\" before href=\"./ucp.php?mode=logout&sid=abcdfe\"><img src=\"./styles/subsilver2/theme/images/icon_mini_login.gif\" width=\"12\" height=\"13\" alt=\"*\" /> Logout [ ]</a>"
Dim Out$(0)
Debug "Links found: " + ExtractLinks(Out$(), HTML$)
Re: Extract all URLs from page html
Posted: Mon Aug 29, 2016 9:23 am
by Kwai chang caine
I have an error "Doublequote missing" in line 22

Re: Extract all URLs from page html
Posted: Mon Aug 29, 2016 12:57 pm
by mk-soft
PB v5.42 or greater
Literale Strings

Re: Extract all URLs from page html
Posted: Mon Aug 29, 2016 1:43 pm
by Keya
just looking at the code though it
appears it would miss URLs that have any extra space "<a href" (or tab), or any of the other <a tags with attributes before the "href" (there's
a dozen or so of them), and would also miss capitalized <A etc. </devils advocate>

and yes I agree parsing HTML is a -deleted-

Re: Extract all URLs from page html
Posted: Tue Aug 30, 2016 3:38 pm
by Kwai chang caine
mk-soft wrote:PB v5.42 or greater
Literale Strings

Thanks...
i don't know again this function

Re: Extract all URLs from page html
Posted: Tue Aug 30, 2016 4:06 pm
by Marc56us
i don't know again this function

Code: Select all
; Hello "World"
; OLD
Debug "Hello " + #DQUOTE$ + "World" + #DQUOTE$ + " <- Old PB system"
; or
Debug "Hello " + Chr(34) + "World" + Chr(34) + " <- Old Classic system"
; NEW
Debug ~"Hello \"World\" <- New quotes inside string!"
(~" => Escape sequences in string like in C)
https://www.purebasic.com/documentation ... rules.html

Re: Extract all URLs from page html
Posted: Wed Aug 31, 2016 8:46 am
by Kwai chang caine
Wouaaaaahhh !!! Marc you are really an angel !!
Imagine yesterday i have search everywhere on the forum and not found an explanation of this function
I have even try several way to use it and always have a red line
I have see the tilt character, but not understanding how write the end
Then, i have waiting, because not dare to ask a new donkey fatal question
And you are a mother for me...thanks to you, i have the answer now....
Hyper cool this function...
Thanks a lot my friend

Re: Extract all URLs from page html
Posted: Wed Aug 31, 2016 9:43 pm
by tj1010
Most DOM is procedural placed with JS these days. A generic solution would be impossible without using a runtime debugger for emulation that allowed a work-around for same-origin so AJAX requests could be used too.
Re: Extract all URLs from page html
Posted: Wed Aug 31, 2016 9:45 pm
by Lunasole
Keya wrote:just looking at the code though it
appears it would miss URLs that have any extra space "<a href" (or tab), or any of the other <a tags with attributes before the "href" (there's
a dozen or so of them), and would also miss capitalized <A etc. </devils advocate>

and yes I agree parsing HTML is a -deleted-

Thanks, I've modified function to ignore case and params before href (as well as extra spaces, etc). It should be nice now, of course if no unforeseen side-effects appeared ^^
tj1010 wrote:Most DOM is procedural placed with JS these days. A generic solution would be impossible without using a runtime debugger for emulation that allowed a work-around for same-origin so AJAX requests could be used too.
That's right for many dynamical sites (parsing of them is pain anyway, as it requires authorization, etc), but still many cases when you can just get html and follow links from it.
Re: Extract all URLs from page html
Posted: Thu Sep 01, 2016 5:24 am
by Keya
great work!! I'm not good with regex so i thought i'd thrown some really annoying spanners into the works there but it looks you found them easy enough to handle

Re: Extract all URLs from page html
Posted: Sat Sep 03, 2016 3:35 am
by Lunasole
Keya wrote:great work!! I'm not good with regex so i thought i'd thrown some really annoying spanners into the works there but it looks you found them easy enough to handle :)
Nobody is good with them :) I know them for years but still using only simplest constructions, because if you write some complex regexp, soon you will forget and never remember how it works and why it works at all. That's "expressive power" of language ^^