Page 1 of 1

Split a string by a Regular Expression

Posted: Thu Aug 31, 2017 11:33 pm
by Little John
Hi,

there are already some procedures here on the forum for splitting strings, but I'm not aware of a general purpose procedure that splits a string by a Regular Expression.
Though PureBasic has the built-in function ExtractRegularExpression(), this will only retrieve the parts of the source string which match the Regular Expression, while the parts of the source string which don't match won't be retrieved. Sometimes this is exactly what is needed.

However, often we want to get all parts of the source string as the result. The following function splits a string into parts, that match or don't match the given Regular Expression. In other words, the generated list of parts contains the whole string.

Code: Select all

; -- Split a string by a Regular Expression
; tested with PB 5.60
; by Little John, <http://www.purebasic.fr/english/viewtopic.php?f=12&t=69071>

CompilerIf #PB_Compiler_IsMainFile
   EnableExplicit
CompilerEndIf


Structure SplitString
   content$
   match.i
EndStructure


Procedure.i SplitByRegEx (regEx.i, source$, List part.SplitString())
   ; -- split a string into parts, that match or don't match a Regular Expression
   ; in : regEx  : number of a Regular Expression generated by CreateRegularExpression()
   ;      source$: string to be split into parts
   ; out: part()      : resulting list of parts
   ;      return value: number of elements in part():
   ;                    0 if source$ = "", > 0 otherwise;
   ;                   -1 on error
   Protected.i left, right
   
   If ExamineRegularExpression(regEx, source$) = 0
      ProcedureReturn -1              ; error
   EndIf
   
   ClearList(part())
   
   left = 1
   While NextRegularExpressionMatch(regEx)
      right = RegularExpressionMatchPosition(regEx)
      If left < right
         AddElement(part())
         part()\content$ = Mid(source$, left, right-left)
         part()\match = #False
      EndIf   
      AddElement(part())
      part()\content$ = RegularExpressionMatchString(regEx)
      part()\match = #True
      left = right + RegularExpressionMatchLength(regEx)
   Wend 
   
   If left <= Len(source$)
      AddElement(part())
      part()\content$ = Mid(source$, left)
      part()\match = #False
   EndIf   
   
   ProcedureReturn ListSize(part())   ; success
EndProcedure


CompilerIf #PB_Compiler_IsMainFile
   ; -- Demo
   
   Procedure SplitDemo (rex.i, source$)
      Protected n.i
      Protected NewList part.SplitString()
      
      n = SplitByRegEx(rex, source$, part())
      If n = -1
         Debug "Error with ExamineRegularExpression()"
         End
      EndIf
      
      Debug "'" + source$ + "' split into " + n + " part(s):"
      ForEach part()
         If part()\match = #False
            Debug " [-] " + part()\content$
         Else
            Debug " [+] " + part()\content$
         EndIf   
      Next
      Debug ""
   EndProcedure
   
   
   Define regEx$, rex.i, source$
   
   regEx$ = "<[^>]*>"
   
   rex = CreateRegularExpression(#PB_Any, regEx$)
   If rex = 0
      Debug "Error: " + RegularExpressionError()
      End
   EndIf
   
   source$ = "I like <b>bold</b> and <i>italic</i> fonts."
   SplitDemo(rex, source$)
   
   source$ = "<b></b> and <i></i>"
   SplitDemo(rex, source$)
   
   source$ = "plain text"
   SplitDemo(rex, source$)
CompilerEndIf

Re: Split a string by a Regular Expression

Posted: Fri Sep 01, 2017 7:44 am
by RSBasic
Very useful, thank you. Image

Re: Split a string by a Regular Expression

Posted: Fri Sep 01, 2017 2:55 pm
by Little John
Hi RSBasic,

you are welcome.
I'm considering this an important RegEx function, and I was surprised when I realized that it isn't built-in. :-)

Re: Split a string by a Regular Expression

Posted: Fri Sep 01, 2017 9:00 pm
by Kwai chang caine
Yes very usefull ..
Remains to be learned the complex regex :oops:
Thanks for sharing 8)