RegEx engine module that builds NFA/DFA and uses it for matching
Posted: Mon Sep 12, 2022 11:13 am
RegEx engine module for PureBasic
I've been working on a "regex engine" module for PureBasic for a while. The engine compiles a regular expression into an NFA (Nondeterministic finite automaton), optionally also into a very fast DFA (Deterministic finite automaton), and can run the NFA/DFA against a string.
When matching, the engine always chooses the longest match among several possible matches. No backtracking is required during this process, since all alternatives are checked simultaneously. So for a regular expression "0123x|0123y" and the string "0123y", after the failure at "x", the engine does not start again from the beginning at "0", but is behind "3" and the engine only has to read "y". In contrast to an NFA, a DFA does not try anything out, i.e. the DFA reads “y” directly.
It is also possible to pass multiple regular expressions to the engine and set unique ID numbers for them to be able to determine which regular expression matched in case of a match. This functionality makes it easy to create lexers. This is also the real motivation that led to this project. However, the engine is kept flexible in its usage and can easily be used for other purposes as well, which is why I don't consider it exclusively a lexer engine and therefore didn't name it that way.
You can find some code examples, the listing of the supported syntax for the regular expressions, further information and the module itself on GitHub.
Currently, the project is in beta phase, so I would be very grateful for feedback from you.
Features that are currently planned for upcoming versions: Click
These features will not be included in the current beta phase of version 1.0.0, but only in the next minor versions, e.g. 1.1.0, 1.2.0 etc.
Note that this project does not intend to provide a RegEx engine to replace more feature-rich engines like the PCRE natively integrated in PureBasic. So there will be no support for capturing groups, backreferences, word boundaries and anchors.
I've been working on a "regex engine" module for PureBasic for a while. The engine compiles a regular expression into an NFA (Nondeterministic finite automaton), optionally also into a very fast DFA (Deterministic finite automaton), and can run the NFA/DFA against a string.
When matching, the engine always chooses the longest match among several possible matches. No backtracking is required during this process, since all alternatives are checked simultaneously. So for a regular expression "0123x|0123y" and the string "0123y", after the failure at "x", the engine does not start again from the beginning at "0", but is behind "3" and the engine only has to read "y". In contrast to an NFA, a DFA does not try anything out, i.e. the DFA reads “y” directly.
It is also possible to pass multiple regular expressions to the engine and set unique ID numbers for them to be able to determine which regular expression matched in case of a match. This functionality makes it easy to create lexers. This is also the real motivation that led to this project. However, the engine is kept flexible in its usage and can easily be used for other purposes as well, which is why I don't consider it exclusively a lexer engine and therefore didn't name it that way.
You can find some code examples, the listing of the supported syntax for the regular expressions, further information and the module itself on GitHub.
Currently, the project is in beta phase, so I would be very grateful for feedback from you.
Features that are currently planned for upcoming versions: Click
These features will not be included in the current beta phase of version 1.0.0, but only in the next minor versions, e.g. 1.1.0, 1.2.0 etc.
Note that this project does not intend to provide a RegEx engine to replace more feature-rich engines like the PCRE natively integrated in PureBasic. So there will be no support for capturing groups, backreferences, word boundaries and anchors.