RegularExpression functions: Add start pos and length param

Got an idea for enhancing PureBasic? New command(s) you'd like to see?
User avatar
Sicro
Enthusiast
Enthusiast
Posts: 538
Joined: Wed Jun 25, 2014 5:25 pm
Location: Germany
Contact:

RegularExpression functions: Add start pos and length param

Post by Sicro »

It would be very good if the regular expression functions had a parameter for the start position.
Currently, I have to preprocess the string with Mid() before I can pass the string to the regular expression functions, which slows down the code considerably.

There should also be a length parameter so that the function does not have to calculate the length of the string every time it is called.
That would be better: String length should be stored for string variables

I hope that the functions receive the string by reference and not by value ...
Last edited by Sicro on Sat Jul 18, 2020 1:29 pm, edited 2 times in total.
Image
Why OpenSource should have a license :: PB-CodeArchiv-Rebirth :: Pleasant-Dark (syntax color scheme) :: RegEx-Engine (compiles RegExes to NFA/DFA)
Manjaro Xfce x64 (Main system) :: Windows 10 Home (VirtualBox) :: Newest PureBasic version
User avatar
GedB
Addict
Addict
Posts: 1312
Joined: Fri May 16, 2003 3:47 pm
Location: England
Contact:

Re: RegularExpression functions: Add start pos parameter

Post by GedB »

Sicro wrote:It would be very good if the regular expression functions had a parameter for the start position.
Currently, I have to preprocess the string with Mid() before I can pass the string to the regular expression functions, which slows down the code considerably.

I hope that the functions receive the string by reference and not by value ...
Have you tried putting .{n} at the beginning of the your regular expression? This will ignore the first n characters in your string.


Sent from my iPhone using Tapatalk
User avatar
Sicro
Enthusiast
Enthusiast
Posts: 538
Joined: Wed Jun 25, 2014 5:25 pm
Location: Germany
Contact:

Re: RegularExpression functions: Add start pos parameter

Post by Sicro »

Thank you for your suggestion.

However, the first n characters will then not be ignored, but will also be included in the result.
With RegularExpressionNamedGroup() it can be bypassed:

Code: Select all

Define string$                    = "Hello Bob"
Define regEx$                     = ".*"
Define numberOfCharactersToIgnore = 6

If CreateRegularExpression(0, "(.{" + Str(numberOfCharactersToIgnore) + "})(?<root_group>" + regEx$ + ")")
  If ExamineRegularExpression(0, string$) And NextRegularExpressionMatch(0)
    Debug RegularExpressionNamedGroup(0, "root_group")
  EndIf
  FreeRegularExpression(0)
Else
  Debug RegularExpressionError()
EndIf
Maybe you meant it that way.
Unfortunately my Lexer doesn't run faster with this variant. Apparently, the regular expression functions copy the string passed by parameter at each call instead of passing the string by reference. That slows it down enormously.

My current variant with Mid() is therefore still the best solution, because I can also limit the string length, so that the regular expression function has less to copy each time and the whole process runs faster.
Image
Why OpenSource should have a license :: PB-CodeArchiv-Rebirth :: Pleasant-Dark (syntax color scheme) :: RegEx-Engine (compiles RegExes to NFA/DFA)
Manjaro Xfce x64 (Main system) :: Windows 10 Home (VirtualBox) :: Newest PureBasic version
User avatar
GedB
Addict
Addict
Posts: 1312
Joined: Fri May 16, 2003 3:47 pm
Location: England
Contact:

Re: RegularExpression functions: Add start pos parameter

Post by GedB »

Thanks for the detail. That was how I meant it and I was curious about the performance.


Sent from my iPhone using Tapatalk
#NULL
Addict
Addict
Posts: 1440
Joined: Thu Aug 30, 2007 11:54 pm
Location: right here

Re: RegularExpression functions: Add start pos parameter

Post by #NULL »

Maybe try PeekS() instead of Mid(), seems to be much faster if my test is correct. I guess Mid() copies the source string first and adds some calls to strlen() where PeekS() just reads and copies the substring.

Code: Select all

If #PB_Compiler_Debugger
  MessageRequester("", "disable the debugger!")
  End
EndIf
OpenConsole()

t1 = 0
t2 = 0

alternatingCases = 10
maxPerCase = 100000

s1.s = Space(10000)
s2.s = ""

;PokeS(@ s1 + 4999 * SizeOf(Character), "-0123456789-")
;alternatingCases = 1
;maxPerCase = 1;50000

For n=1 To alternatingCases
  
  ; ----------------------------------
  
  s2 = ""
  t = ElapsedMilliseconds()
  For i=0 To maxPerCase
    s2 = Mid(s1, 5000 + 1, 10)
    ;PrintN(s2)
  Next
  t = ElapsedMilliseconds()-t
  PrintN("case 1: " + t)
  t1 + t
  
  ; ----------------------------------
  
  s2 = ""
  t = ElapsedMilliseconds()
  For i=0 To maxPerCase
    s2 = PeekS(@ s1 + 5000 * SizeOf(Character), 10)
    ;PrintN(s2)
  Next
  t = ElapsedMilliseconds()-t
  PrintN("case 2: " + t)
  t2 + t
  
  ; ----------------------------------
  
Next

PrintN("")
PrintN("case 1 total: " + t1)
PrintN("case 2 total: " + t2)
Input()

Code: Select all

case 1 total: 3777
case 2 total: 49
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: RegularExpression functions: Add start pos parameter

Post by Little John »

#NULL wrote:Maybe try PeekS() instead of Mid(), seems to be much faster if my test is correct.
Interesting. Thanks for the idea and the test!

Code: Select all

Macro FastMid (_string_, _start_, _length_=-1)
   PeekS(@ _string_ + (_start_-1)*SizeOf(Character), _length_)
EndMacro
User avatar
RSBasic
Moderator
Moderator
Posts: 1218
Joined: Thu Dec 31, 2009 11:05 pm
Location: Gernsbach (Germany)
Contact:

Re: RegularExpression functions: Add start pos parameter

Post by RSBasic »

Little John wrote:
#NULL wrote:Maybe try PeekS() instead of Mid(), seems to be much faster if my test is correct.
Interesting. Thanks for the idea and the test!

Code: Select all

Macro FastMid (_string_, _start_, _length_=-1)
   PeekS(@ _string_ + (_start_-1)*SizeOf(Character), _length_)
EndMacro
Nice tip Image
Image
Image
User avatar
Sicro
Enthusiast
Enthusiast
Posts: 538
Joined: Wed Jun 25, 2014 5:25 pm
Location: Germany
Contact:

Re: RegularExpression functions: Add start pos parameter

Post by Sicro »

Very cool, but also shocking at the same time. With PeekS() my Lexer is 80% faster than with Mid(). Mid() should definitely be optimized.

Thanks #NULL for the idea to use PeekS() instead of Mid(), and thanks to Little John for the macro version.
I improved the macro a bit by encapsulating the parameter _start_ in parentheses. This ensures that mathematical operations are always processed in the correct order, even if a mathematical formula is passed:

Code: Select all

Macro FastMid(_string_, _start_, _length_=-1)
  PeekS(@_string_ + ((_start_) - 1) * SizeOf(Character), _length_)
EndMacro
Image
Why OpenSource should have a license :: PB-CodeArchiv-Rebirth :: Pleasant-Dark (syntax color scheme) :: RegEx-Engine (compiles RegExes to NFA/DFA)
Manjaro Xfce x64 (Main system) :: Windows 10 Home (VirtualBox) :: Newest PureBasic version
BarryG
Addict
Addict
Posts: 3292
Joined: Thu Apr 18, 2019 8:17 am

Re: RegularExpression functions: Add start pos parameter

Post by BarryG »

Hi, as far as I know, SizeOf(Character) is evaluated at runtime. So replacing it with a constant should make the macro a tad faster again.
Last edited by BarryG on Sat Apr 20, 2019 7:06 am, edited 1 time in total.
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: RegularExpression functions: Add start pos parameter

Post by Little John »

BarryG wrote:Hi, as far as I know, SizeOf(Character) is evaluated at runtime.
That assumption is wrong.

Try e.g.

Code: Select all

#Length = Len("abc")
Len() actually is evaluated at runtime, and that's why this code does not work.

The value of a constant is assigned at compile time. So from the fact that your above code works, we can draw the conclusion that SizeOf() is evaluated at compile time, too. That's probably the reason why in the help it is mentioned in the section "Compiler Functions". :-)
Your code works, but there is no advantage in it.
Post Reply