Extract all URLs from page html

Share your advanced PureBasic knowledge/code with the community.
User avatar
Lunasole
Addict
Addict
Posts: 1091
Joined: Mon Oct 26, 2015 2:55 am
Location: UA
Contact:

Extract all URLs from page html

Post by Lunasole »

I've finally got tired to rewrite such trivial parsing stuff everytime it is needed, so just made some simple unified function ^^

Code: Select all

EnableExplicit

; Searches for <a> tags within html page and returns URLs from all matches
; Links$()			and array to receive the result
; HTML$				any HTMl text
; RETURN			resulting array size
Procedure ExtractLinks (Array Links$(1), HTML$)
	Protected exp = CreateRegularExpression(#PB_Any, ~"<a.*?href=.*?<\\/a>", #PB_RegularExpression_NoCase)
	Protected res = ExtractRegularExpression(exp, HTML$, Links$()) : FreeRegularExpression(exp)
	exp = CreateRegularExpression(#PB_Any, ~"href=.*?\\\".*?\\\"", #PB_RegularExpression_NoCase)
	Protected Dim T$(0)
	While res
		res - 1
		ExtractRegularExpression(exp, Links$(res), T$())
		Links$(res) = StringField(T$(0), 2, #DOUBLEQUOTE$) ; extract url itself from a tags
		Debug Links$(res)
	Wend
	FreeRegularExpression(exp)
	ProcedureReturn ArraySize(Links$()) + 1
EndProcedure

;; Example
Define HTML$ = ~"<A pff \"attributes\" before href=\"./ucp.php?mode=logout&sid=abcdfe\"><img src=\"./styles/subsilver2/theme/images/icon_mini_login.gif\" width=\"12\" height=\"13\" alt=\"*\" /> Logout [ ]</a>"
Dim Out$(0)
Debug "Links found: " + ExtractLinks(Out$(), HTML$)
Last edited by Lunasole on Wed Aug 31, 2016 10:01 pm, edited 2 times in total.
"W̷i̷s̷h̷i̷n̷g o̷n a s̷t̷a̷r"
User avatar
Kwai chang caine
Always Here
Always Here
Posts: 5494
Joined: Sun Nov 05, 2006 11:42 pm
Location: Lyon - France

Re: Extract all URLs from page html

Post by Kwai chang caine »

I have an error "Doublequote missing" in line 22 :|
ImageThe happiness is a road...
Not a destination
User avatar
mk-soft
Always Here
Always Here
Posts: 6250
Joined: Fri May 12, 2006 6:51 pm
Location: Germany

Re: Extract all URLs from page html

Post by mk-soft »

PB v5.42 or greater
Literale Strings :wink:
My Projects ThreadToGUI / OOP-BaseClass / EventDesigner V3
PB v3.30 / v5.75 - OS Mac Mini OSX 10.xx - VM Window Pro / Linux Ubuntu
Downloads on my Webspace / OneDrive
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: Extract all URLs from page html

Post by Keya »

just looking at the code though it appears it would miss URLs that have any extra space "<a  href" (or tab), or any of the other <a tags with attributes before the "href" (there's a dozen or so of them), and would also miss capitalized <A etc. </devils advocate> :D and yes I agree parsing HTML is a -deleted- :P
User avatar
Kwai chang caine
Always Here
Always Here
Posts: 5494
Joined: Sun Nov 05, 2006 11:42 pm
Location: Lyon - France

Re: Extract all URLs from page html

Post by Kwai chang caine »

mk-soft wrote:PB v5.42 or greater
Literale Strings :wink:
Thanks... 8)
i don't know again this function :shock:
ImageThe happiness is a road...
Not a destination
Marc56us
Addict
Addict
Posts: 1600
Joined: Sat Feb 08, 2014 3:26 pm

Re: Extract all URLs from page html

Post by Marc56us »

i don't know again this function :shock:

Code: Select all

; Hello "World"

; OLD
Debug "Hello " + #DQUOTE$ + "World" + #DQUOTE$ + " <- Old PB system"
; or
Debug "Hello " + Chr(34) + "World" + Chr(34) + " <- Old Classic system"

; NEW 
Debug ~"Hello \"World\" <- New quotes inside string!"
(~" => Escape sequences in string like in C)
https://www.purebasic.com/documentation ... rules.html

:wink:
User avatar
Kwai chang caine
Always Here
Always Here
Posts: 5494
Joined: Sun Nov 05, 2006 11:42 pm
Location: Lyon - France

Re: Extract all URLs from page html

Post by Kwai chang caine »

Wouaaaaahhh !!! Marc you are really an angel !!
Imagine yesterday i have search everywhere on the forum and not found an explanation of this function :oops:
I have even try several way to use it and always have a red line :cry:

Code: Select all

a$ = ~"""""aaa""""ttt"
Debug a$ 
I have see the tilt character, but not understanding how write the end :oops:

Then, i have waiting, because not dare to ask a new donkey fatal question :mrgreen:
And you are a mother for me...thanks to you, i have the answer now.... :D
Hyper cool this function...
Thanks a lot my friend 8)
ImageThe happiness is a road...
Not a destination
tj1010
Enthusiast
Enthusiast
Posts: 716
Joined: Mon Feb 25, 2013 5:51 pm

Re: Extract all URLs from page html

Post by tj1010 »

Most DOM is procedural placed with JS these days. A generic solution would be impossible without using a runtime debugger for emulation that allowed a work-around for same-origin so AJAX requests could be used too.
User avatar
Lunasole
Addict
Addict
Posts: 1091
Joined: Mon Oct 26, 2015 2:55 am
Location: UA
Contact:

Re: Extract all URLs from page html

Post by Lunasole »

Keya wrote:just looking at the code though it appears it would miss URLs that have any extra space "<a  href" (or tab), or any of the other <a tags with attributes before the "href" (there's a dozen or so of them), and would also miss capitalized <A etc. </devils advocate> :D and yes I agree parsing HTML is a -deleted- :P
Thanks, I've modified function to ignore case and params before href (as well as extra spaces, etc). It should be nice now, of course if no unforeseen side-effects appeared ^^
tj1010 wrote:Most DOM is procedural placed with JS these days. A generic solution would be impossible without using a runtime debugger for emulation that allowed a work-around for same-origin so AJAX requests could be used too.
That's right for many dynamical sites (parsing of them is pain anyway, as it requires authorization, etc), but still many cases when you can just get html and follow links from it.
"W̷i̷s̷h̷i̷n̷g o̷n a s̷t̷a̷r"
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: Extract all URLs from page html

Post by Keya »

great work!! I'm not good with regex so i thought i'd thrown some really annoying spanners into the works there but it looks you found them easy enough to handle :)
User avatar
Lunasole
Addict
Addict
Posts: 1091
Joined: Mon Oct 26, 2015 2:55 am
Location: UA
Contact:

Re: Extract all URLs from page html

Post by Lunasole »

Keya wrote:great work!! I'm not good with regex so i thought i'd thrown some really annoying spanners into the works there but it looks you found them easy enough to handle :)
Nobody is good with them :) I know them for years but still using only simplest constructions, because if you write some complex regexp, soon you will forget and never remember how it works and why it works at all. That's "expressive power" of language ^^
"W̷i̷s̷h̷i̷n̷g o̷n a s̷t̷a̷r"
Post Reply