Some simple regular expression examples

Share your advanced PureBasic knowledge/code with the community.
User avatar
luis
Addict
Addict
Posts: 3893
Joined: Wed Aug 31, 2005 11:09 pm
Location: Italy

Some simple regular expression examples

Post by luis »

This is not really a "trick'n'tips" post, but I thought here was a better place than "general discussion".

Someone asked for some easy example, so I put together some. I'm not a regex expert so I hope this is not a pile of *beep*, I made it with good intentions :)

Just don't ask me for more complicate patterns !

Some more or less simple regexes :


RegExMatch()

Code: Select all

Procedure.i RegExMatch (text$, regex$ = "")
; [DESC]
; Verify if a regular expression match with the passed string.
;
; [INPUT]
; text$ : The string to be checked.
; regex$ : The regular expression to be used.
; 	
; [RETURN]
; 1 if there is a match, 0 if not, -1 if the regular expression is invalid.
;
; [NOTES]
; You can omit the regular expression between multiple calls it the regex doesn't change.
;
; Here is a list of some of the more useful commands for pattern matching and rember:
; the regex engine is eager, and ?, *, +, are greedy
;
; ^     = look at the start of the line (if used at the begin of the expression)
; $     = look at the end of the line
; \A    = look at the start of the string
; \Z    = look at the end of the string
; |     = OR
; ^     = NOT (if not at the begin)
;
; .     = any char that is not a newline (equivalent to [^\r\n] under Windows)
; ?     = the preceding char/token is optional (also used to switch from greedy to lazy, ie: *?)
;\s     = any whitespace such as [ \t\r\n]
;\S     = anything that is not a whitespace 
;\d     = any decimal digit [0-9]
;\D     = anything that is not a decimal digit
;\w     = any "word" character [A-Za-z0-9_] 
;\W     = anything that is not a "word" character
;\b     = a boundary between a word and a non-word
;\B     = no boundary between word characters
;\Q \E  = interpret all the characters between the \Q and the \E are as literal characters
;
; *     = 0 or many times  {0,} 
; +     = 1 or many times  {1,} (also used to switch from greedy to possessive, ie: *+)
; (     = start sub expression
; )     = end sub expression
; {n}   = repeated n times
; {n,m} = repeated n times min, but no more than m
; {n,}  = repeated n times min, unlimited max
; [     = start char specifier 
; ]     = end char specifier 
; (?i)  = case insensitive matching 
; (?-i) = case sensitive matching 
;
; ?:    = turn off capturing
; ?!    = negative lookahed, usually to match something not followed by something else
;         example: a(?!b) = a not followed by b
;
; \     = the escape character
; \e    = escape ($1B)
; \n    = newline ($0A)
; \r    = carriage return ($0D)
; \t    = tab ($09)
; \xhh  = character with hex code hh
;
; Character classes, for example: [[:alpha:]]
;
; alnum    letters and digits
; alpha    letters
; ascii    character codes 0 - 127
; blank    space or tab only
; cntrl    control characters
; digit    decimal digits (same as \d)
; graph    printing characters, excluding space
; lower    lower case letters
; print    printing characters, including space
; punct    printing characters, excluding letters and digits and space
; space    white space (not quite the same as \s)
; upper    upper case letters
; word     "word" characters (same as \w)
; xdigit   hexadecimal digits
;
; refer to PCRESYNTAX(3) at http://www.pcre.org/pcre.txt for more info about the syntax

 Static iRegEx 
 Protected iRetVal = -1 ; regex error 
 
 If regex$
    If iRegEx > 0
        FreeRegularExpression(iRegEx)
    EndIf    
    iRegEx = CreateRegularExpression(#PB_Any, regex$)
 EndIf
 
 If iRegEx
    iRetVal = MatchRegularExpression(iRegEx, text$)   
 EndIf
 
 ProcedureReturn iRetVal
EndProcedure
RegExMatch examples

Code: Select all

Define r$

r$ = "(?i)insensitive(?-i:sensitive)insensitive" ; mixing case sensitive/unsensitive modifiers
Debug RegExMatch("INSENSITIVEsensitiveINSENSITIVE", r$) ; 1
Debug RegExMatch("INSENSITIVEsEnSiTiVeINSENSITIVE") ; 0

r$ = "colou?r" ; one optional char
Debug RegExMatch("English says colour.", r$) ; 1
Debug RegExMatch("The other ones says color.") ; 1

r$ = "test" ; "test" somewhere
Debug RegExMatch("This is a test string.", r$) ; 1
Debug RegExMatch("This is a TEST string.") ; 0

r$ = "test|strong" ; "test" or "strong" somewhere
Debug RegExMatch("This is a test string.", r$) ; 1
Debug RegExMatch("This is a strong string.") ; 1

r$ = "^test" ; "test" at the beginning 
Debug RegExMatch("This is a test string.", r$) ; 0

r$ = "^This" ; "This" at the beginning 
Debug RegExMatch("This is a test string.", r$) ; 1

r$ = "string\.$" ; "string" at the end
Debug RegExMatch("This is a test string.", r$) ; 1

r$="^num[0-9]$" ; match any single digit 
Debug RegExMatch("num1", r$) ; 1
Debug RegExMatch("numa") ; 0
Debug RegExMatch("num12") ; 0

r$="^num[0-9]+$" ; match any number of digits
Debug RegExMatch("num1", r$) ; 1
Debug RegExMatch("numa") ; 0
Debug RegExMatch("num12") ; 1

r$ = ".*\r\n$" ; any strings terminated by a #CRLF$
Debug RegExMatch("This end with CR+LF" + #CRLF$, r$) ; 1
Debug RegExMatch("This one does not" + #CRLF$ + ".") ; 0

r$ = "^B.*" ; any number of chars, even zero
Debug RegExMatch("BB DDD", r$) ; 1
Debug RegExMatch("B") ; 1
Debug RegExMatch("A") ; 0

r$ = "\b(cat|dog)\b" ; "cat" or "dog" as whole words
Debug RegExMatch("I like my dog.", r$) ; 1
Debug RegExMatch("You are a copycat.") ; 0

r$ = "^\d{3}$" ; exactly three digits
Debug RegExMatch("123", r$) ; 1
Debug RegExMatch("1234", r$) ; 0
Debug RegExMatch("a23", r$) ; 0

r$ = "^Match th.. Text!" ; two chars
Debug RegExMatch("Match this Text!", r$) ; 1
Debug RegExMatch("Match the Text!")  ; 0

r$ = "^Match th.? Text!" ; one or two chars
Debug RegExMatch("Match the Text!", r$)  ; 1

r$ = "^ERROR: \d{3}$" ; three digits at the end
Debug RegExMatch("ERROR: 404", r$) ; 1

r$ = "^Star: \* ABC\d$" ; one escaped '*', one digit at the end
Debug RegExMatch("One escaped asterisk: * and one digit at the end ABC3", r$) ; 1

r$ = "(?m)X$" ; enable multiline mode, match X at the end of a line
Debug RegExMatch("At the end of this line there is one X" + #LF$ + "and nothing here", r$) ; 1 
Debug RegExMatch("No X here" + #LF$ + "this one end with X") ; 1  

r$ = "(?m)X\Z" ; enable multiline mode, match X at the end of the string only
Debug RegExMatch("At the end of this line there is one X" + #LF$ + "and nothing here", r$) ; 0
Debug RegExMatch("No X here" + #LF$ + "this one end with X") ; 1  

r$ = "^AAA[[:lower:]]{3}CCC$" ; three lowercase chars
Debug RegExMatch("AAAbbbCCC", r$) ; 1
Debug RegExMatch("AAABBBCCC") ; 0

r$="\Q[]\^$.|?*+()\E" ; match the problematic string "[]\^$.|?*+()" using \Q \E
Debug RegExMatch("abc []\^$.|?*+() def", r$) ; 1

r$="^((?!beer).)*$" ; match the string if it does not contain "beer" using negative lookahed 
Debug RegExMatch("bread and water match", r$) ; 1
Debug RegExMatch("bread and beer doesn't") ; 0

r$ = "^\w+\d{3}\.(?i)(jpg|bmp)$" ; any "word" char, followed by 3 digits, one dot, and "jpg" or "bmp" (case insensitive)
Debug RegExMatch("Image_001.bmp", r$) ; 1
Debug RegExMatch("TEST123.BMP") ; 1
Debug RegExMatch("123.bmp") ; 0

r$ = "^\w+\\\w+\.(?i)txt$" ; any "word" char, followed by '\', any "word" char, '.' and "txt" (case insensitive)
Debug RegExMatch("Example\File.txt", r$) ; 1 
Debug RegExMatch("\File.txt") ; 0 
Debug RegExMatch("c:\File.txt") ; 0 

r$= "^(?i)[a-z]:\\([^/:*?\x22.<>]+\\)*[^/:*?\x22<>]*$" ; full qualified path to a Window's file or folder
Debug RegExMatch("c:\example\test\file.txt", r$) ; 1 
Debug RegExMatch("c:\f\ile.txt", r$) ; 1 
Debug RegExMatch("\test\file.txt") ; 0
Debug RegExMatch("c:\test\fil*.txt") ; 0
Debug RegExMatch("C:\test\", r$) ; 1

r$ = "^(?i)[A-Z0-9+_.-]+@[A-Z0-9.-]+$" ; not too strict email address validator
Debug RegExMatch("president@whitehouse.org", r$) ; 1 
Debug RegExMatch("president") ; 0
Debug RegExMatch("@whitehouse.org") ; 0
Debug RegExMatch("pre$ident@whitehouse.org") ; 0
Debug RegExMatch("user@host") ; 1
ExtractRegExMatch()

Code: Select all

Procedure ExtractRegExMatch (text$, Array result$(1), regex$ = "")
; [DESC]
; Exctract all the matching strings anc copy them to the result$() array.
;
; [INPUT]
; text$ : The string to be checked.
; regex$ : The regular expression to be used.
;
; [OUTPUT]
; result$() : An array of strings dimensioned to 0, it will contain all the matching strings.
; 	
; [RETURN]
; Return the number of matches or -1 if the regular expression is invalid.
;
; [NOTES]
; You can omit the regular expression between multiple calls it the regex doesn't change.
; See RegExMatch() for help on the PCRE syntax.

 Static iRegEx 
 Protected iRetVal = -1 ; regex error 
 
 If regex$
    If iRegEx > 0
        FreeRegularExpression(iRegEx)
    EndIf     
    iRegEx = CreateRegularExpression(#PB_Any, regex$)
 EndIf
 
 If iRegEx
    iRetVal = ExtractRegularExpression(iRegEx, text$, result$())   
 EndIf
 
 ProcedureReturn iRetVal
EndProcedure
ExtractRegExMatch examples

Code: Select all

Define r$
Dim result$(0)

r$ = "\b\d{3}$" ; three digits at the end, preceded by a word boundary.
Debug ExtractRegExMatch ("ERROR: 404", result$(), r$) ; 1 
Debug result$(0) ; "404" extracted


r$ = "\$[A-Fa-f0-9]+" ; a valid PB hex number 
Debug ExtractRegExMatch ("Escape in hex is $1B, Return is $0d and 4096 is $1000.", result$(), r$) ; 3
Debug result$(0) ; "$1B" extracted
Debug result$(1) ; "$0d" extracted
Debug result$(2) ; "$1000" extracted

r$="\d\d[- /.]\d\d[- /.]\d\d" ; simple date match extractor with various separators
Debug ExtractRegExMatch ("Today is 10/08/12 and it's raining.", result$(), r$) ; 1
Debug result$(0) ; "10/08/12" extracted

r$="\b([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\b" ; extract only numbers between 0-255
Debug ExtractRegExMatch ("111,256,0,1,50,100,777,255,1000", result$(), r$) ; 6
Debug result$(0) ; "111" extracted
Debug result$(1) ; "0" extracted
Debug result$(2) ; "1" extracted
Debug result$(3) ; "50" extracted
Debug result$(4) ; "100" extracted
Debug result$(5) ; "255" extracted
Last edited by luis on Wed Sep 05, 2012 10:42 pm, edited 3 times in total.
"Have you tried turning it off and on again ?"
A little PureBasic review
User avatar
idle
Always Here
Always Here
Posts: 5836
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Some simple regular expression example

Post by idle »

it's a good tip, thanks
Windows 11, Manjaro, Raspberry Pi OS
Image
User avatar
electrochrisso
Addict
Addict
Posts: 989
Joined: Mon May 14, 2007 2:13 am
Location: Darling River

Re: Some simple regular expression example

Post by electrochrisso »

Yeh!, thanks from me too. :)
PureBasic! Purely the best 8)
User avatar
blueznl
PureBasic Expert
PureBasic Expert
Posts: 6166
Joined: Sat May 17, 2003 11:31 am
Contact:

Re: Some simple regular expression example

Post by blueznl »

Regular expressions, at your service...

http://bluez.home.xs4all.nl/purebasic/p ... 15.htm#top
( PB6.00 LTS Win11 x64 Asrock AB350 Pro4 Ryzen 5 3600 32GB GTX1060 6GB)
( The path to enlightenment and the PureBasic Survival Guide right here... )
buddymatkona
Enthusiast
Enthusiast
Posts: 252
Joined: Mon Aug 16, 2010 4:29 am

Re: Some simple regular expression example

Post by buddymatkona »

Here is a couple more that helped me get started with Regex in PB. :)

Load a PB source file into PBfileInString$ followed by (quick and dirty concept code) :

Code: Select all

RegexProcTokens = CreateRegularExpression ( #PB_Any , "(\w+)" )
RegexProcLines = CreateRegularExpression ( #PB_Any , "\b(Procedure |Procedure\.. ).*" )

LineCount = ExtractRegularExpression ( RegexProcLines , PBfileInString$ , ProcLines$ ( ) )  ; Extract Procedure lines to string array
For LineCounter = 0 To LineCount - 1
  TokenCount = ExtractRegularExpression ( RegexProcTokens , ProcLines$ ( LineCounter ) , ProcTokens$ ( ) )  ; Parse Procedure lines              
  For TokenCounter = 0 To TokenCount - 1   
     Debug   Str ( TokenCounter + 1 ) + " Token= " + ProcTokens$(TokenCounter) ; Show Procedure Names, Arguments, Types     
  Next     
Next
Or use "^(?!.*;).*Procedure[ \.].*$" for RegexProcLines to ignore comments - Needs "Multiline" option
User avatar
Kukulkan
Addict
Addict
Posts: 1396
Joined: Mon Jun 06, 2005 2:35 pm
Location: germany
Contact:

Re: Some simple regular expression example

Post by Kukulkan »

Thank you for the examples. But sadly, I'm not even able to get the username out of this sample:

Code: Select all

userid: 1234
username: test
usertype: private
I like to search for the content after the string "username:". But we need grouping and backreference support for this (what PB does not support).

I know, I can use pcre directly by using IncludeC(), but in that case the PB functions are useless :(

Kukulkan
User avatar
blueznl
PureBasic Expert
PureBasic Expert
Posts: 6166
Joined: Sat May 17, 2003 11:31 am
Contact:

Re: Some simple regular expression example

Post by blueznl »

How is the original text organised, is it three lines and do you process each line, or is it part of a large file, or what? Doesn't seem to be too difficult to do it even with standard stuff...
( PB6.00 LTS Win11 x64 Asrock AB350 Pro4 Ryzen 5 3600 32GB GTX1060 6GB)
( The path to enlightenment and the PureBasic Survival Guide right here... )
User avatar
Kukulkan
Addict
Addict
Posts: 1396
Joined: Mon Jun 06, 2005 2:35 pm
Location: germany
Contact:

Re: Some simple regular expression example

Post by Kukulkan »

Hi blueznl,

at the end, it is OCR or PDF extracted text. Upon this, I need to find this "everywhere". The idea is to find data right from "username:" or right or below "invoice" etc.

It has to be flexible, as the customer should be able to define his own RegExp, too. And this is the reason, why there should be no PureBasic String Handling (like processing each lines, using STringField() or some other string stuff). I need full supported RegExp.

Kukulkan
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Re: Some simple regular expression example

Post by srod »

Nice tip Luis, thanks for sharing.
I may look like a mule, but I'm not a complete ass.
User avatar
Danilo
Addict
Addict
Posts: 3036
Joined: Sat Apr 26, 2003 8:26 am
Location: Planet Earth

Re: Some simple regular expression example

Post by Danilo »

Kukulkan wrote:at the end, it is OCR or PDF extracted text. Upon this, I need to find this "everywhere". The idea is to find data right from "username:" or right or below "invoice" etc.
You only want to get the word after "username:" or "invoice"?

Code: Select all

input.s = "userid: 1234 username:"+#CRLF$+"test usertype: private "+#CRLF$
input   + "username:Kukulkan invoice #1234 username:   blueznl usertype:public invoice $123.99 "+#CRLF$
input   + "invoice"+#CRLF$
input   + "$0.99"

Dim result.s(0)

Procedure FindWordAfter(input.s,stringToFind.s,Array a.s(1))
    If CreateRegularExpression(0,stringToFind+"( )*[^ ]+")
        ProcedureReturn ExtractRegularExpression(0,input,a())
    EndIf
    ProcedureReturn -1
EndProcedure

Procedure showdata(header.s,Array a.s(1))
    Debug "- results for '"+header+"'"
    For i = 0 To ArraySize(a())
        Debug Trim(Mid(a(i),Len(header)+1))
    Next i
EndProcedure

count = FindWordAfter(input,"username:",result())
If count > 0
    showdata("username:",result())
EndIf

count = FindWordAfter(input,"invoice",result())
If count > 0
    showdata("invoice",result())
EndIf

count = FindWordAfter(input,"usertype:",result())
If count > 0
    showdata("usertype:",result())
EndIf
User avatar
Kukulkan
Addict
Addict
Posts: 1396
Joined: Mon Jun 06, 2005 2:35 pm
Location: germany
Contact:

Re: Some simple regular expression example

Post by Kukulkan »

Hi Danilo,

thank you, but this needs special programming (showdata()). The users will be able to enter their own RegExp into my product - it is not possible to give them the possibility to do their own showdata() routine. I want to be flexible by giving them the option to enter their own RegExp as Filters for mandatory information in scanned (OCR) or extracted PDF documents.

It may be some very different use case than my example. And the customers are not able to program some own routines.

It is like I wrote before:
And this is the reason, why there should be no PureBasic String Handling (like processing each lines, using StringField() or some other string stuff). I need full supported RegExp.
Your routine needs Trim(), Mid() and Len(). But this should be done using RegExp and not in code. It could be done, if PB would support all RegExp functionalities of PCRE. And this is what I complain about...

Best,

Kukulkan
User avatar
luis
Addict
Addict
Posts: 3893
Joined: Wed Aug 31, 2005 11:09 pm
Location: Italy

Re: Some simple regular expression example

Post by luis »

I was again doing some experiment with regexes to learn so I edited the original post with some better examples of matching and added some example of extractions.
"Have you tried turning it off and on again ?"
A little PureBasic review
User avatar
luis
Addict
Addict
Posts: 3893
Joined: Wed Aug 31, 2005 11:09 pm
Location: Italy

Re: Some simple regular expression example

Post by luis »

Kukulkan wrote:

Code: Select all

userid: 1234
username: test
usertype: private
I like to search for the content after the string "username:".

Hi, maybe I didn't understand your requirement, but doesn't this work (positive look behind) ?

Code: Select all

Define a$
a$ = "userid: 1234" + #CRLF$
a$ + "username: test" + #CRLF$
a$ + "usertype: private" + #CRLF$

r$="(?<=username:).*" ; extract the string preceded by "username:"

Debug ExtractRegExMatch (a$, result$(), r$) ; 1
Debug result$(0) ; "test" extracted
I used the proc in the first post for my convenience :)
"Have you tried turning it off and on again ?"
A little PureBasic review
User avatar
Kukulkan
Addict
Addict
Posts: 1396
Joined: Mon Jun 06, 2005 2:35 pm
Location: germany
Contact:

Re: Some simple regular expression example

Post by Kukulkan »

Hi Luis,

thanks for the example. I just forgot this thread. We are currently helping our customers in guiding them to "Look-around assertions". We suggest them to use something like

(?<=Username\:).+ or
(?<=Kunden-Nr\.:\s)[0-9]{5}[0-9]*
etc.

I just can see that this is exactly the same like you wrote. Thanks!!!!

The only problem is, that we are not able to handle this (remark the blank after the colon):
Username: Testuser
Username:Testuser

The RegExp either matches the one or the other, but the condition is not allowed to use * or + as quantifiers. We currently trim all results to give users a chance...

The

Kukulkan
Perkin
Enthusiast
Enthusiast
Posts: 504
Joined: Thu Jul 03, 2008 10:13 pm
Location: Kent, UK

Re: Some simple regular expression example

Post by Perkin »

I haven't actually tested but this would work in a text editor's regex
(?<=Username\: ?).+ or
(?<=Kunden-Nr\.:\s?)[0-9]{5}[0-9]*
use a space or \s followed by a ? which means its optional
%101010 = $2A = 42
Post Reply