PureBasic Forums - English

Posted: **Fri Aug 26, 2016 1:33 pm**

I've finally got tired to rewrite such trivial parsing stuff everytime it is needed, so just made some simple unified function ^^

Code: Select all

EnableExplicit

; Searches for <a> tags within html page and returns URLs from all matches
; Links$()			and array to receive the result
; HTML$				any HTMl text
; RETURN			resulting array size
Procedure ExtractLinks (Array Links$(1), HTML$)
	Protected exp = CreateRegularExpression(#PB_Any, ~"<a.*?href=.*?<\\/a>", #PB_RegularExpression_NoCase)
	Protected res = ExtractRegularExpression(exp, HTML$, Links$()) : FreeRegularExpression(exp)
	exp = CreateRegularExpression(#PB_Any, ~"href=.*?\\\".*?\\\"", #PB_RegularExpression_NoCase)
	Protected Dim T$(0)
	While res
		res - 1
		ExtractRegularExpression(exp, Links$(res), T$())
		Links$(res) = StringField(T$(0), 2, #DOUBLEQUOTE$) ; extract url itself from a tags
		Debug Links$(res)
	Wend
	FreeRegularExpression(exp)
	ProcedureReturn ArraySize(Links$()) + 1
EndProcedure

;; Example
Define HTML$ = ~"<A pff \"attributes\" before href=\"./ucp.php?mode=logout&sid=abcdfe\"><img src=\"./styles/subsilver2/theme/images/icon_mini_login.gif\" width=\"12\" height=\"13\" alt=\"*\" /> Logout [ ]</a>"
Dim Out$(0)
Debug "Links found: " + ExtractLinks(Out$(), HTML$)

Posted: **Mon Aug 29, 2016 9:23 am**

I have an error "Doublequote missing" in line 22

Posted: **Mon Aug 29, 2016 12:57 pm**

PB v5.42 or greater
Literale Strings

Posted: **Mon Aug 29, 2016 1:43 pm**

just looking at the code though it appears it would miss URLs that have any extra space "<a href" (or tab), or any of the other <a tags with attributes before the "href" (there's a dozen or so of them), and would also miss capitalized <A etc. </devils advocate>

and yes I agree parsing HTML is a -deleted-

Posted: **Tue Aug 30, 2016 3:38 pm**

mk-soft wrote:PB v5.42 or greater
Literale Strings

Thanks...

i don't know again this function

Posted: **Tue Aug 30, 2016 4:06 pm**

i don't know again this function

Code: Select all

; Hello "World"

; OLD
Debug "Hello " + #DQUOTE$ + "World" + #DQUOTE$ + " <- Old PB system"
; or
Debug "Hello " + Chr(34) + "World" + Chr(34) + " <- Old Classic system"

; NEW 
Debug ~"Hello \"World\" <- New quotes inside string!"

(~" => Escape sequences in string like in C)
https://www.purebasic.com/documentation ... rules.html

Posted: **Wed Aug 31, 2016 8:46 am**

Wouaaaaahhh !!! Marc you are really an angel !!
Imagine yesterday i have search everywhere on the forum and not found an explanation of this function

I have even try several way to use it and always have a red line

Code: Select all

a$ = ~"""""aaa""""ttt"
Debug a$

I have see the tilt character, but not understanding how write the end

Then, i have waiting, because not dare to ask a new donkey fatal question

And you are a mother for me...thanks to you, i have the answer now....

Hyper cool this function...
Thanks a lot my friend

Posted: **Wed Aug 31, 2016 9:43 pm**

Most DOM is procedural placed with JS these days. A generic solution would be impossible without using a runtime debugger for emulation that allowed a work-around for same-origin so AJAX requests could be used too.

Posted: **Wed Aug 31, 2016 9:45 pm**

Keya wrote:just looking at the code though it appears it would miss URLs that have any extra space "<a href" (or tab), or any of the other <a tags with attributes before the "href" (there's a dozen or so of them), and would also miss capitalized <A etc. </devils advocate> and yes I agree parsing HTML is a -deleted-

Thanks, I've modified function to ignore case and params before href (as well as extra spaces, etc). It should be nice now, of course if no unforeseen side-effects appeared ^^

tj1010 wrote:Most DOM is procedural placed with JS these days. A generic solution would be impossible without using a runtime debugger for emulation that allowed a work-around for same-origin so AJAX requests could be used too.

That's right for many dynamical sites (parsing of them is pain anyway, as it requires authorization, etc), but still many cases when you can just get html and follow links from it.

Posted: **Thu Sep 01, 2016 5:24 am**

great work!! I'm not good with regex so i thought i'd thrown some really annoying spanners into the works there but it looks you found them easy enough to handle

Posted: **Sat Sep 03, 2016 3:35 am**

Keya wrote:great work!! I'm not good with regex so i thought i'd thrown some really annoying spanners into the works there but it looks you found them easy enough to handle :)

Nobody is good with them :) I know them for years but still using only simplest constructions, because if you write some complex regexp, soon you will forget and never remember how it works and why it works at all. That's "expressive power" of language ^^

PureBasic Forums - English

Extract all URLs from page html

Extract all URLs from page html

Re: Extract all URLs from page html

Re: Extract all URLs from page html

Re: Extract all URLs from page html

Re: Extract all URLs from page html

Re: Extract all URLs from page html

Re: Extract all URLs from page html

Re: Extract all URLs from page html

Re: Extract all URLs from page html

Re: Extract all URLs from page html

Re: Extract all URLs from page html