Page 1 of 1

Extract all URLs from page html

Posted: Fri Aug 26, 2016 1:33 pm
by Lunasole
I've finally got tired to rewrite such trivial parsing stuff everytime it is needed, so just made some simple unified function ^^

Code: Select all

EnableExplicit

; Searches for <a> tags within html page and returns URLs from all matches
; Links$()			and array to receive the result
; HTML$				any HTMl text
; RETURN			resulting array size
Procedure ExtractLinks (Array Links$(1), HTML$)
	Protected exp = CreateRegularExpression(#PB_Any, ~"<a.*?href=.*?<\\/a>", #PB_RegularExpression_NoCase)
	Protected res = ExtractRegularExpression(exp, HTML$, Links$()) : FreeRegularExpression(exp)
	exp = CreateRegularExpression(#PB_Any, ~"href=.*?\\\".*?\\\"", #PB_RegularExpression_NoCase)
	Protected Dim T$(0)
	While res
		res - 1
		ExtractRegularExpression(exp, Links$(res), T$())
		Links$(res) = StringField(T$(0), 2, #DOUBLEQUOTE$) ; extract url itself from a tags
		Debug Links$(res)
	Wend
	FreeRegularExpression(exp)
	ProcedureReturn ArraySize(Links$()) + 1
EndProcedure

;; Example
Define HTML$ = ~"<A pff \"attributes\" before href=\"./ucp.php?mode=logout&sid=abcdfe\"><img src=\"./styles/subsilver2/theme/images/icon_mini_login.gif\" width=\"12\" height=\"13\" alt=\"*\" /> Logout [ ]</a>"
Dim Out$(0)
Debug "Links found: " + ExtractLinks(Out$(), HTML$)

Re: Extract all URLs from page html

Posted: Mon Aug 29, 2016 9:23 am
by Kwai chang caine
I have an error "Doublequote missing" in line 22 :|

Re: Extract all URLs from page html

Posted: Mon Aug 29, 2016 12:57 pm
by mk-soft
PB v5.42 or greater
Literale Strings :wink:

Re: Extract all URLs from page html

Posted: Mon Aug 29, 2016 1:43 pm
by Keya
just looking at the code though it appears it would miss URLs that have any extra space "<a  href" (or tab), or any of the other <a tags with attributes before the "href" (there's a dozen or so of them), and would also miss capitalized <A etc. </devils advocate> :D and yes I agree parsing HTML is a -deleted- :P

Re: Extract all URLs from page html

Posted: Tue Aug 30, 2016 3:38 pm
by Kwai chang caine
mk-soft wrote:PB v5.42 or greater
Literale Strings :wink:
Thanks... 8)
i don't know again this function :shock:

Re: Extract all URLs from page html

Posted: Tue Aug 30, 2016 4:06 pm
by Marc56us
i don't know again this function :shock:

Code: Select all

; Hello "World"

; OLD
Debug "Hello " + #DQUOTE$ + "World" + #DQUOTE$ + " <- Old PB system"
; or
Debug "Hello " + Chr(34) + "World" + Chr(34) + " <- Old Classic system"

; NEW 
Debug ~"Hello \"World\" <- New quotes inside string!"
(~" => Escape sequences in string like in C)
https://www.purebasic.com/documentation ... rules.html

:wink:

Re: Extract all URLs from page html

Posted: Wed Aug 31, 2016 8:46 am
by Kwai chang caine
Wouaaaaahhh !!! Marc you are really an angel !!
Imagine yesterday i have search everywhere on the forum and not found an explanation of this function :oops:
I have even try several way to use it and always have a red line :cry:

Code: Select all

a$ = ~"""""aaa""""ttt"
Debug a$ 
I have see the tilt character, but not understanding how write the end :oops:

Then, i have waiting, because not dare to ask a new donkey fatal question :mrgreen:
And you are a mother for me...thanks to you, i have the answer now.... :D
Hyper cool this function...
Thanks a lot my friend 8)

Re: Extract all URLs from page html

Posted: Wed Aug 31, 2016 9:43 pm
by tj1010
Most DOM is procedural placed with JS these days. A generic solution would be impossible without using a runtime debugger for emulation that allowed a work-around for same-origin so AJAX requests could be used too.

Re: Extract all URLs from page html

Posted: Wed Aug 31, 2016 9:45 pm
by Lunasole
Keya wrote:just looking at the code though it appears it would miss URLs that have any extra space "<a  href" (or tab), or any of the other <a tags with attributes before the "href" (there's a dozen or so of them), and would also miss capitalized <A etc. </devils advocate> :D and yes I agree parsing HTML is a -deleted- :P
Thanks, I've modified function to ignore case and params before href (as well as extra spaces, etc). It should be nice now, of course if no unforeseen side-effects appeared ^^
tj1010 wrote:Most DOM is procedural placed with JS these days. A generic solution would be impossible without using a runtime debugger for emulation that allowed a work-around for same-origin so AJAX requests could be used too.
That's right for many dynamical sites (parsing of them is pain anyway, as it requires authorization, etc), but still many cases when you can just get html and follow links from it.

Re: Extract all URLs from page html

Posted: Thu Sep 01, 2016 5:24 am
by Keya
great work!! I'm not good with regex so i thought i'd thrown some really annoying spanners into the works there but it looks you found them easy enough to handle :)

Re: Extract all URLs from page html

Posted: Sat Sep 03, 2016 3:35 am
by Lunasole
Keya wrote:great work!! I'm not good with regex so i thought i'd thrown some really annoying spanners into the works there but it looks you found them easy enough to handle :)
Nobody is good with them :) I know them for years but still using only simplest constructions, because if you write some complex regexp, soon you will forget and never remember how it works and why it works at all. That's "expressive power" of language ^^