Some simple regular expression examples

luis · Post by **luis** » Sat Aug 27, 2011 5:03 pm

This is not really a "trick'n'tips" post, but I thought here was a better place than "general discussion".

Someone asked for some easy example, so I put together some. I'm not a regex expert so I hope this is not a pile of *beep*, I made it with good intentions

Just don't ask me for more complicate patterns !

Some more or less simple regexes :

RegExMatch()

Code: Select all

Procedure.i RegExMatch (text$, regex$ = "")
; [DESC]
; Verify if a regular expression match with the passed string.
;
; [INPUT]
; text$ : The string to be checked.
; regex$ : The regular expression to be used.
; 	
; [RETURN]
; 1 if there is a match, 0 if not, -1 if the regular expression is invalid.
;
; [NOTES]
; You can omit the regular expression between multiple calls it the regex doesn't change.
;
; Here is a list of some of the more useful commands for pattern matching and rember:
; the regex engine is eager, and ?, *, +, are greedy
;
; ^     = look at the start of the line (if used at the begin of the expression)
; $     = look at the end of the line
; \A    = look at the start of the string
; \Z    = look at the end of the string
; |     = OR
; ^     = NOT (if not at the begin)
;
; .     = any char that is not a newline (equivalent to [^\r\n] under Windows)
; ?     = the preceding char/token is optional (also used to switch from greedy to lazy, ie: *?)
;\s     = any whitespace such as [ \t\r\n]
;\S     = anything that is not a whitespace 
;\d     = any decimal digit [0-9]
;\D     = anything that is not a decimal digit
;\w     = any "word" character [A-Za-z0-9_] 
;\W     = anything that is not a "word" character
;\b     = a boundary between a word and a non-word
;\B     = no boundary between word characters
;\Q \E  = interpret all the characters between the \Q and the \E are as literal characters
;
; *     = 0 or many times  {0,} 
; +     = 1 or many times  {1,} (also used to switch from greedy to possessive, ie: *+)
; (     = start sub expression
; )     = end sub expression
; {n}   = repeated n times
; {n,m} = repeated n times min, but no more than m
; {n,}  = repeated n times min, unlimited max
; [     = start char specifier 
; ]     = end char specifier 
; (?i)  = case insensitive matching 
; (?-i) = case sensitive matching 
;
; ?:    = turn off capturing
; ?!    = negative lookahed, usually to match something not followed by something else
;         example: a(?!b) = a not followed by b
;
; \     = the escape character
; \e    = escape ($1B)
; \n    = newline ($0A)
; \r    = carriage return ($0D)
; \t    = tab ($09)
; \xhh  = character with hex code hh
;
; Character classes, for example: [[:alpha:]]
;
; alnum    letters and digits
; alpha    letters
; ascii    character codes 0 - 127
; blank    space or tab only
; cntrl    control characters
; digit    decimal digits (same as \d)
; graph    printing characters, excluding space
; lower    lower case letters
; print    printing characters, including space
; punct    printing characters, excluding letters and digits and space
; space    white space (not quite the same as \s)
; upper    upper case letters
; word     "word" characters (same as \w)
; xdigit   hexadecimal digits
;
; refer to PCRESYNTAX(3) at http://www.pcre.org/pcre.txt for more info about the syntax

 Static iRegEx 
 Protected iRetVal = -1 ; regex error 
 
 If regex$
    If iRegEx > 0
        FreeRegularExpression(iRegEx)
    EndIf    
    iRegEx = CreateRegularExpression(#PB_Any, regex$)
 EndIf
 
 If iRegEx
    iRetVal = MatchRegularExpression(iRegEx, text$)   
 EndIf
 
 ProcedureReturn iRetVal
EndProcedure

RegExMatch examples

Code: Select all

Define r$

r$ = "(?i)insensitive(?-i:sensitive)insensitive" ; mixing case sensitive/unsensitive modifiers
Debug RegExMatch("INSENSITIVEsensitiveINSENSITIVE", r$) ; 1
Debug RegExMatch("INSENSITIVEsEnSiTiVeINSENSITIVE") ; 0

r$ = "colou?r" ; one optional char
Debug RegExMatch("English says colour.", r$) ; 1
Debug RegExMatch("The other ones says color.") ; 1

r$ = "test" ; "test" somewhere
Debug RegExMatch("This is a test string.", r$) ; 1
Debug RegExMatch("This is a TEST string.") ; 0

r$ = "test|strong" ; "test" or "strong" somewhere
Debug RegExMatch("This is a test string.", r$) ; 1
Debug RegExMatch("This is a strong string.") ; 1

r$ = "^test" ; "test" at the beginning 
Debug RegExMatch("This is a test string.", r$) ; 0

r$ = "^This" ; "This" at the beginning 
Debug RegExMatch("This is a test string.", r$) ; 1

r$ = "string\.$" ; "string" at the end
Debug RegExMatch("This is a test string.", r$) ; 1

r$="^num[0-9]$" ; match any single digit 
Debug RegExMatch("num1", r$) ; 1
Debug RegExMatch("numa") ; 0
Debug RegExMatch("num12") ; 0

r$="^num[0-9]+$" ; match any number of digits
Debug RegExMatch("num1", r$) ; 1
Debug RegExMatch("numa") ; 0
Debug RegExMatch("num12") ; 1

r$ = ".*\r\n$" ; any strings terminated by a #CRLF$
Debug RegExMatch("This end with CR+LF" + #CRLF$, r$) ; 1
Debug RegExMatch("This one does not" + #CRLF$ + ".") ; 0

r$ = "^B.*" ; any number of chars, even zero
Debug RegExMatch("BB DDD", r$) ; 1
Debug RegExMatch("B") ; 1
Debug RegExMatch("A") ; 0

r$ = "\b(cat|dog)\b" ; "cat" or "dog" as whole words
Debug RegExMatch("I like my dog.", r$) ; 1
Debug RegExMatch("You are a copycat.") ; 0

r$ = "^\d{3}$" ; exactly three digits
Debug RegExMatch("123", r$) ; 1
Debug RegExMatch("1234", r$) ; 0
Debug RegExMatch("a23", r$) ; 0

r$ = "^Match th.. Text!" ; two chars
Debug RegExMatch("Match this Text!", r$) ; 1
Debug RegExMatch("Match the Text!")  ; 0

r$ = "^Match th.? Text!" ; one or two chars
Debug RegExMatch("Match the Text!", r$)  ; 1

r$ = "^ERROR: \d{3}$" ; three digits at the end
Debug RegExMatch("ERROR: 404", r$) ; 1

r$ = "^Star: \* ABC\d$" ; one escaped '*', one digit at the end
Debug RegExMatch("One escaped asterisk: * and one digit at the end ABC3", r$) ; 1

r$ = "(?m)X$" ; enable multiline mode, match X at the end of a line
Debug RegExMatch("At the end of this line there is one X" + #LF$ + "and nothing here", r$) ; 1 
Debug RegExMatch("No X here" + #LF$ + "this one end with X") ; 1  

r$ = "(?m)X\Z" ; enable multiline mode, match X at the end of the string only
Debug RegExMatch("At the end of this line there is one X" + #LF$ + "and nothing here", r$) ; 0
Debug RegExMatch("No X here" + #LF$ + "this one end with X") ; 1  

r$ = "^AAA[[:lower:]]{3}CCC$" ; three lowercase chars
Debug RegExMatch("AAAbbbCCC", r$) ; 1
Debug RegExMatch("AAABBBCCC") ; 0

r$="\Q[]\^$.|?*+()\E" ; match the problematic string "[]\^$.|?*+()" using \Q \E
Debug RegExMatch("abc []\^$.|?*+() def", r$) ; 1

r$="^((?!beer).)*$" ; match the string if it does not contain "beer" using negative lookahed 
Debug RegExMatch("bread and water match", r$) ; 1
Debug RegExMatch("bread and beer doesn't") ; 0

r$ = "^\w+\d{3}\.(?i)(jpg|bmp)$" ; any "word" char, followed by 3 digits, one dot, and "jpg" or "bmp" (case insensitive)
Debug RegExMatch("Image_001.bmp", r$) ; 1
Debug RegExMatch("TEST123.BMP") ; 1
Debug RegExMatch("123.bmp") ; 0

r$ = "^\w+\\\w+\.(?i)txt$" ; any "word" char, followed by '\', any "word" char, '.' and "txt" (case insensitive)
Debug RegExMatch("Example\File.txt", r$) ; 1 
Debug RegExMatch("\File.txt") ; 0 
Debug RegExMatch("c:\File.txt") ; 0 

r$= "^(?i)[a-z]:\\([^/:*?\x22.<>]+\\)*[^/:*?\x22<>]*$" ; full qualified path to a Window's file or folder
Debug RegExMatch("c:\example\test\file.txt", r$) ; 1 
Debug RegExMatch("c:\f\ile.txt", r$) ; 1 
Debug RegExMatch("\test\file.txt") ; 0
Debug RegExMatch("c:\test\fil*.txt") ; 0
Debug RegExMatch("C:\test\", r$) ; 1

r$ = "^(?i)[A-Z0-9+_.-]+@[A-Z0-9.-]+$" ; not too strict email address validator
Debug RegExMatch("president@whitehouse.org", r$) ; 1 
Debug RegExMatch("president") ; 0
Debug RegExMatch("@whitehouse.org") ; 0
Debug RegExMatch("pre$ident@whitehouse.org") ; 0
Debug RegExMatch("user@host") ; 1

ExtractRegExMatch()

Code: Select all

Procedure ExtractRegExMatch (text$, Array result$(1), regex$ = "")
; [DESC]
; Exctract all the matching strings anc copy them to the result$() array.
;
; [INPUT]
; text$ : The string to be checked.
; regex$ : The regular expression to be used.
;
; [OUTPUT]
; result$() : An array of strings dimensioned to 0, it will contain all the matching strings.
; 	
; [RETURN]
; Return the number of matches or -1 if the regular expression is invalid.
;
; [NOTES]
; You can omit the regular expression between multiple calls it the regex doesn't change.
; See RegExMatch() for help on the PCRE syntax.

 Static iRegEx 
 Protected iRetVal = -1 ; regex error 
 
 If regex$
    If iRegEx > 0
        FreeRegularExpression(iRegEx)
    EndIf     
    iRegEx = CreateRegularExpression(#PB_Any, regex$)
 EndIf
 
 If iRegEx
    iRetVal = ExtractRegularExpression(iRegEx, text$, result$())   
 EndIf
 
 ProcedureReturn iRetVal
EndProcedure

ExtractRegExMatch examples

Code: Select all

Define r$
Dim result$(0)

r$ = "\b\d{3}$" ; three digits at the end, preceded by a word boundary.
Debug ExtractRegExMatch ("ERROR: 404", result$(), r$) ; 1 
Debug result$(0) ; "404" extracted


r$ = "\$[A-Fa-f0-9]+" ; a valid PB hex number 
Debug ExtractRegExMatch ("Escape in hex is $1B, Return is $0d and 4096 is $1000.", result$(), r$) ; 3
Debug result$(0) ; "$1B" extracted
Debug result$(1) ; "$0d" extracted
Debug result$(2) ; "$1000" extracted

r$="\d\d[- /.]\d\d[- /.]\d\d" ; simple date match extractor with various separators
Debug ExtractRegExMatch ("Today is 10/08/12 and it's raining.", result$(), r$) ; 1
Debug result$(0) ; "10/08/12" extracted

r$="\b([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\b" ; extract only numbers between 0-255
Debug ExtractRegExMatch ("111,256,0,1,50,100,777,255,1000", result$(), r$) ; 6
Debug result$(0) ; "111" extracted
Debug result$(1) ; "0" extracted
Debug result$(2) ; "1" extracted
Debug result$(3) ; "50" extracted
Debug result$(4) ; "100" extracted
Debug result$(5) ; "255" extracted

Post by **idle** » Sun Aug 28, 2011 12:01 am

it's a good tip, thanks

electrochrisso · Post by **electrochrisso** » Sun Aug 28, 2011 2:43 am

Yeh!, thanks from me too.

blueznl · Post by **blueznl** » Tue Nov 15, 2011 8:43 pm

Regular expressions, at your service...

http://bluez.home.xs4all.nl/purebasic/p ... 15.htm#top

buddymatkona · Post by **buddymatkona** » Wed Nov 16, 2011 12:13 pm

Here is a couple more that helped me get started with Regex in PB.

Load a PB source file into PBfileInString$ followed by (quick and dirty concept code) :

Code: Select all

RegexProcTokens = CreateRegularExpression ( #PB_Any , "(\w+)" )
RegexProcLines = CreateRegularExpression ( #PB_Any , "\b(Procedure |Procedure\.. ).*" )

LineCount = ExtractRegularExpression ( RegexProcLines , PBfileInString$ , ProcLines$ ( ) )  ; Extract Procedure lines to string array
For LineCounter = 0 To LineCount - 1
  TokenCount = ExtractRegularExpression ( RegexProcTokens , ProcLines$ ( LineCounter ) , ProcTokens$ ( ) )  ; Parse Procedure lines              
  For TokenCounter = 0 To TokenCount - 1   
     Debug   Str ( TokenCounter + 1 ) + " Token= " + ProcTokens$(TokenCounter) ; Show Procedure Names, Arguments, Types     
  Next     
Next

Or use "^(?!.*;).*Procedure[ \.].*$" for RegexProcLines to ignore comments - Needs "Multiline" option

Kukulkan · Post by **Kukulkan** » Wed Nov 16, 2011 5:23 pm

Thank you for the examples. But sadly, I'm not even able to get the username out of this sample:

Code: Select all

userid: 1234
username: test
usertype: private

I like to search for the content after the string "username:". But we need grouping and backreference support for this (what PB does not support).

I know, I can use pcre directly by using IncludeC(), but in that case the PB functions are useless

Kukulkan

blueznl · Post by **blueznl** » Thu Nov 17, 2011 1:54 pm

How is the original text organised, is it three lines and do you process each line, or is it part of a large file, or what? Doesn't seem to be too difficult to do it even with standard stuff...

Kukulkan · Post by **Kukulkan** » Thu Nov 17, 2011 2:39 pm

Hi blueznl,

at the end, it is OCR or PDF extracted text. Upon this, I need to find this "everywhere". The idea is to find data right from "username:" or right or below "invoice" etc.

It has to be flexible, as the customer should be able to define his own RegExp, too. And this is the reason, why there should be no PureBasic String Handling (like processing each lines, using STringField() or some other string stuff). I need full supported RegExp.

Kukulkan

srod · Post by **srod** » Fri Nov 18, 2011 9:37 am

Nice tip Luis, thanks for sharing.

Danilo · Post by **Danilo** » Fri Nov 18, 2011 12:08 pm

Kukulkan wrote:at the end, it is OCR or PDF extracted text. Upon this, I need to find this "everywhere". The idea is to find data right from "username:" or right or below "invoice" etc.

You only want to get the word after "username:" or "invoice"?

Code: Select all

input.s = "userid: 1234 username:"+#CRLF$+"test usertype: private "+#CRLF$
input   + "username:Kukulkan invoice #1234 username:   blueznl usertype:public invoice $123.99 "+#CRLF$
input   + "invoice"+#CRLF$
input   + "$0.99"

Dim result.s(0)

Procedure FindWordAfter(input.s,stringToFind.s,Array a.s(1))
    If CreateRegularExpression(0,stringToFind+"( )*[^ ]+")
        ProcedureReturn ExtractRegularExpression(0,input,a())
    EndIf
    ProcedureReturn -1
EndProcedure

Procedure showdata(header.s,Array a.s(1))
    Debug "- results for '"+header+"'"
    For i = 0 To ArraySize(a())
        Debug Trim(Mid(a(i),Len(header)+1))
    Next i
EndProcedure

count = FindWordAfter(input,"username:",result())
If count > 0
    showdata("username:",result())
EndIf

count = FindWordAfter(input,"invoice",result())
If count > 0
    showdata("invoice",result())
EndIf

count = FindWordAfter(input,"usertype:",result())
If count > 0
    showdata("usertype:",result())
EndIf

Kukulkan · Post by **Kukulkan** » Fri Nov 18, 2011 12:15 pm

Hi Danilo,

thank you, but this needs special programming (showdata()). The users will be able to enter their own RegExp into my product - it is not possible to give them the possibility to do their own showdata() routine. I want to be flexible by giving them the option to enter their own RegExp as Filters for mandatory information in scanned (OCR) or extracted PDF documents.

It may be some very different use case than my example. And the customers are not able to program some own routines.

It is like I wrote before:

And this is the reason, why there should be no PureBasic String Handling (like processing each lines, using StringField() or some other string stuff). I need full supported RegExp.

Your routine needs Trim(), Mid() and Len(). But this should be done using RegExp and not in code. It could be done, if PB would support all RegExp functionalities of PCRE. And this is what I complain about...

Best,

Kukulkan

luis · Post by **luis** » Mon Sep 03, 2012 11:45 pm

I was again doing some experiment with regexes to learn so I edited the original post with some better examples of matching and added some example of extractions.

luis · Post by **luis** » Tue Sep 04, 2012 12:09 am

Kukulkan wrote:
Code: Select all
userid: 1234
username: test
usertype: private
I like to search for the content after the string "username:".

Hi, maybe I didn't understand your requirement, but doesn't this work (positive look behind) ?

Code: Select all

Define a$
a$ = "userid: 1234" + #CRLF$
a$ + "username: test" + #CRLF$
a$ + "usertype: private" + #CRLF$

r$="(?<=username:).*" ; extract the string preceded by "username:"

Debug ExtractRegExMatch (a$, result$(), r$) ; 1
Debug result$(0) ; "test" extracted

I used the proc in the first post for my convenience

Kukulkan · Post by **Kukulkan** » Tue Sep 04, 2012 8:08 am

Hi Luis,

thanks for the example. I just forgot this thread. We are currently helping our customers in guiding them to "Look-around assertions". We suggest them to use something like

(?<=Username\:).+ or
(?<=Kunden-Nr\.:\s)[0-9]{5}[0-9]*
etc.

I just can see that this is exactly the same like you wrote. Thanks!!!!

The only problem is, that we are not able to handle this (remark the blank after the colon):
Username: Testuser
Username:Testuser

The RegExp either matches the one or the other, but the condition is not allowed to use * or + as quantifiers. We currently trim all results to give users a chance...

The

Kukulkan

Perkin · Post by **Perkin** » Tue Sep 04, 2012 12:34 pm

I haven't actually tested but this would work in a text editor's regex

(?<=Username\: ?).+ or
(?<=Kunden-Nr\.:\s?)[0-9]{5}[0-9]*

use a space or \s followed by a ? which means its optional

PureBasic Forums - English

Some simple regular expression examples

Some simple regular expression examples

Re: Some simple regular expression example

Re: Some simple regular expression example

Re: Some simple regular expression example

Re: Some simple regular expression example

Re: Some simple regular expression example

Re: Some simple regular expression example

Re: Some simple regular expression example

Re: Some simple regular expression example

Re: Some simple regular expression example

Re: Some simple regular expression example

Re: Some simple regular expression example

Re: Some simple regular expression example

Re: Some simple regular expression example

Re: Some simple regular expression example