Page 1 of 1
Slow regexp replacement on long strings? Alternatives?
Posted: Wed Apr 08, 2020 4:07 pm
by Kukulkan
Hi,
I'm using PB regex to replace sensitive information in logfile strings (up to 5MB length). This is any sort of value for a system handling a lot of credentials. For this I use a regexp like this:
(search1|search2|search3|...|searchN)
to replace searchN with "---".
The problem is, that there are up to 500 "searchN" words in the regex to clean out and replace by "---".
This needs around 12 seconds inside a string with only 217 KB (Windows, Xeon 2.6Ghz).
I don't insist on regex and are open for any other solutions. Any ideas for a faster replacement of that many words in strings?
Re: Slow regexp replacement on long strings? Alternatives?
Posted: Wed Apr 08, 2020 4:33 pm
by Marc56us
Kukulkan wrote:
(search1|search2|search3|...|searchN)
to replace searchN with "---".
If they are fixed strings
ReplaceString() will be much faster.
(No point in using a Regex if you can't use a mask.)
If the log file is a CSV, it is also possible to load it into an SQL database.
(Disable auto-commit to load faster)
The problem is, that there are up to 500 "searchN" words in the regex to clean out and replace by "---".
No matter how you do it, searching for 500 different words and replacing them will be slow.
Considering that log files often have a constant number of fields, it is simpler and faster to remove the fields to be hidden. Using StringField()
Sometimes, to quickly create test data log, we just replace a part (beginning or end) of the fields to hide, whatever the content, it goes very fast.

Re: Slow regexp replacement on long strings? Alternatives?
Posted: Wed Apr 08, 2020 6:01 pm
by skywalk
Regex is 600x slower than PB MemoryString code.
So 12sec would drop to ~20msec.
I never use regex for this reason.
Re: Slow regexp replacement on long strings? Alternatives?
Posted: Wed Apr 08, 2020 10:07 pm
by idle
if your looking for exact or prefix matches you could try the FindStrings example in Squint, the example isn't particularly optimised though you could easily try it and overwrite the data in the source string as below, should be faster than replacestring as well
https://www.purebasic.fr/english/viewto ... 12&t=74786
Code: Select all
Global String1.s = "373 ac3 b9d45 b iPdC ks23 al97 373 ac5 al99 346 vs42159ssbpx roro ask ePOC foo bar xyz 12dk tifer erer e"
Global String2.s = "346 373 iPdC roro ePOC ac3" ;<-strings your interested in finding
Global Replace.s = "-----------------------------------------------------------"
Global FindStringsItems.FindStrings
Global *squint.squint = Squint_New()
FindStrings(*squint,@String1,@String2,@FindStringsItems) ;builds trie and returns the count
ForEach FindStringsItems\item()
Debug FindStringsItems\item()\key + " " + Str(FindStringsItems\item()\count)
ForEach FindStringsItems\item()\positions()
CopyMemory(@Replace,@string1+FindStringsItems\item()\positions(),FindStringsItems\item()\len*SizeOf(Character))
Next
Next
results in
--- --- b9d45 b ---- ks23 al97 --- ac5 al99 --- vs42159ssbpx ---- ask ---- foo bar xyz 12dk tifer erer e
Re: Slow regexp replacement on long strings? Alternatives?
Posted: Thu Apr 09, 2020 7:09 am
by Kukulkan
Upon your answers, I consider using ReplaceString() with #PB_String_InPlace. I only need to replace with a placeholder string of the same byte-length. I will give it a try.
Thanks all of you!

Re: Slow regexp replacement on long strings? Alternatives?
Posted: Thu Apr 09, 2020 8:26 am
by Josh
Are the searched sequences in the string always whole words, which are delimited by spaces, dots, commas or similar?
Re: Slow regexp replacement on long strings? Alternatives?
Posted: Thu Apr 09, 2020 8:30 am
by Kukulkan
Josh wrote:Are the searched sequences in the string always whole words, which are delimited by spaces, dots, commas or similar?
Hi Josh. Its mostly passwords or hex sequences (keys). Some regular, some in quotes and some in square brackets.
Re: Slow regexp replacement on long strings? Alternatives?
Posted: Sat Apr 11, 2020 2:57 am
by mchael
Have a look at xombie post in this thread:
viewtopic.php?t=26689
Re: Slow regexp replacement on long strings? Alternatives?
Posted: Wed Apr 15, 2020 2:48 pm
by Marc56us
Just in case you didn't find a viable solution, I made a test with PB's internal functions: Lists, ReplaceString (#PB_String_InPlace) )
To use #PB_String_InPlace, I adjust the keyword length with RSET
Code: Select all
ReplaceString(Txt$, All_KeyWords$(), RSet("", Len(All_KeyWords$()), "X"), #PB_String_InPlace)
I load all the keywords in a list, then I loop (ForEach) as many times as necessary all the previously loaded file in a single variable.
Log test file: 6.8 MB (35,526 lines)
Search: 500 keywords (all differents, so no regex)
Result: 264,000 keyword replaced
Time: 27 sec (14 without debug output informations)
Computer: i7-8700 @3.2Ghz file on SSD drive
And again, it's not very optimistic, I think we can do better with Peek and Poke.
Re: Slow regexp replacement on long strings? Alternatives?
Posted: Wed Apr 15, 2020 3:02 pm
by Kukulkan
@Marc56us: Thanks for the tests. I also found replacing faster than the RegEx, but not fast enough.
We now try using a B-Tree implementation for the keywords, so that there is only one loop needed through the initial logfile content. But we do in C as we will need it in other places, too. But no results yet as it is low priority...