Page 1 of 1

URL validation

Posted: Fri Nov 19, 2010 11:29 am
by greyhoundcode
Thought I'd share a snippet I use for validating URLS.

Code: Select all

; Validates URLS
; --------------
; Must include a scheme such as http:// or ftp://
; Support for port numbers and numeric IPs
; 
; Returns bool (#True or #False)
; -----------------------------------------------
Procedure.b ValidURL(url.s)

    regex.i
    pattern.s = "^([a-z0-9]+://)(([0-9a-z_!~*'().&=+$%-]+:)?[0-9a-z_!~*'().&=+$%-]+@)?(([0-9]{1,3}\.){3}[0-9]{1,3}|([0-9a-z_!~*'()-]+\.)*([0-9a-z][0-9a-z-]{0,61})?[0-9a-z]\.[a-z]{2,6})(:[0-9]{1,4})?((/?)|(/[0-9a-z_!~*'().;?:@&=+$,%#-]+)+/?)$"
    
    If CreateRegularExpression(regex, pattern)
    
        If MatchRegularExpression(regex, url)
            
            FreeRegularExpression(regex)
            ProcedureReturn #True
    
        EndIf
        
    EndIf
    
    FreeRegularExpression(regex)
    ProcedureReturn #False

EndProcedure
25/11/10 edited to remove accidental whitespace in the regex

Re: URL validation

Posted: Fri Nov 19, 2010 5:07 pm
by JHPJHP
Very cool - thanks for sharing...

Re: URL validation

Posted: Fri Nov 19, 2010 10:30 pm
by greyhoundcode
Pleased to. :)

Re: URL validation

Posted: Sun Nov 21, 2010 12:38 am
by DarkPlayer
hi

This is a nice and short code, but it does not recognize all urls. I wrote a little code snippet some time ago to convert such exotic urls to a more common format and copied some examples which your code does not accept:

This does not work:
http://test:hehe@80.237.159.41:80
This is just a little error, which can be fixed by removing the space in the following part of yor reg:

Code: Select all

&=+$%-]+: )
This is an hex encoded IP address. If you dont believe that this is valid, click on it and see what your browser does. (IE / Chrome will automaticly convert it into Decimal when opening the page, Firefox will show the hex)
http://0x50ed9f29/blog/

This is also valid:
http://gOoGlE.de

Some time ago this also got valid:
http://www.müller.de/
http://straße.de/

Also a nice example:
[url]http://உதாரணம்.பரிட்சை/[/url]

localhost and any other hostname (not dns name!) is not recognized
http://localhost/

this is forbidden by the most registrars, but defined
[url]http://example_test.test[/url]

A IPv6 address would not be valid either
http://[::1]

A dns toplevel with more than 6 characters is also possible, they are not used on the internet, but can be setup local for an internal network
http://example.myownnetwork

The BB Code Parser does not recognize all of them either :D

DarkPlayer

Re: URL validation

Posted: Sun Nov 21, 2010 4:24 pm
by greyhoundcode
Good points :D

Re: URL validation

Posted: Sun Nov 21, 2010 9:08 pm
by Joakim Christiansen
greyhoundcode wrote:Thought I'd share a snippet I use for validating URLS.
You could also make it do a HTTP request to validate if the URL actually points to a real website.

Re: URL validation

Posted: Sun Nov 21, 2010 10:09 pm
by greyhoundcode
Joakim Christiansen wrote:You could also make it do a HTTP request to validate if the URL actually points to a real website.
That's true, although my intent was basically to avoid making unnecessary requests (the URLs coming from potentially untrusted sources) where a URL is badly formed. But yeah good point.

Re: URL validation

Posted: Mon Nov 22, 2010 5:38 am
by kvitaliy
Test this address
http: // россия.рф/main/page8.htm
Your code with it does not work :D

URL validation - regular expressions and non-Latin character

Posted: Mon Nov 22, 2010 1:29 pm
by greyhoundcode
No my code wouldn't work with something like россия.рф, however I'd suggest creating a separate procedure to implement IDNA if this was a concern for an individual application, seeing as non-Latin or accented Latin characters are transliterated back to ASCII anyway (like xn--h1alffa9f.xn--p1ai in the case of россия.рф).

So something like ValidURL( TransformIDNA_URL(url.s) ) maybe. Joakim's suggestion would probably be far easier in this case!

Actually, I have to say I don't know too much about UTF8/UTF16 in regular expressions, don't know if anyone is aware of good tutorials or resources for this? I wonder if POSIX notations like [:upper:] apply irrespective of character set.