Page 1 of 2
Speed up processing a certain string
Posted: Mon Aug 02, 2021 11:58 pm
by jacdelad
Hello,
I have a specific problem and hopefully I can explain it. I am processing a lot of cad-files. The strings in these cad files contain usually 8 values separated by spaces or tabs (or both) in variable amounts (one or more each). Also it's possible to encounter spaces/tabs at the beginning/end. Some lines are to be ignored, they start with a "*"; but it is possible that they have spaces/tabs in front of the "*".
I want to get a clear version of the string to extract the values. My current code is that:
Code: Select all
;Just for the Example
Define temp.s,start
temp=" 1 2 3 4 entry 6 dummy 8 "
;String processing
ReplaceString(temp,#TAB$," ",#PB_String_InPlace)
temp=Trim(temp)
If Len(temp)<>0 And Left(temp,1)<>"*"
start=FindString(temp," ")
While start
temp=ReplaceString(temp," "," ",#PB_String_CaseSensitive,start)
start=FindString(temp," ")
Wend
;Process String further...
EndIf
The code does the following:
- Replace all tabs with space; inplace
- Trim the left and right spaces away
- Check if the string is empty or to be ignored
- If not: Find the first occurence of double spaces
- While there are double spaces in the string -> replace them with single spaces
This gives me a cleaned up version which can be easily read via StringField().
Is there an approach that is faster? I am processing millions of entries.
Re: Speed up processing a certain string
Posted: Tue Aug 03, 2021 12:32 am
by STARGÅTE
What about just using a regular expression?
What is the pattern of the entries? Just numbers and characters?
Code: Select all
Enumeration
#RegularExpression
EndEnumeration
;Just for the Example
Define temp.s = " 1 2 3 4 entry 6 dummy 8 "
If CreateRegularExpression(#RegularExpression, "\w+")
If ExamineRegularExpression(#RegularExpression, temp)
While NextRegularExpressionMatch(#RegularExpression)
Debug RegularExpressionMatchString(#RegularExpression)
Wend
EndIf
EndIf
Re: Speed up processing a certain string
Posted: Tue Aug 03, 2021 1:42 am
by jacdelad
The pattern is roughly "word word number number number word number number"
e.g.: "D1 SOD523 285 705 0 DX.410052E 0 0"
Can I expect a regular expression to be faster yet flexible enough to cover up all my premises?
Re: Speed up processing a certain string
Posted: Tue Aug 03, 2021 5:14 am
by deathmx
Here is another way to do it. Seems to be a bit faster. Not sure if this is useful to you or not. It basically just filters everything.
If you think this is ok and need some changes if you could have bigger sample size it would be great maybe 100 lines? lol
Here are my results. code 1 is mine code 2 is yours.
-------------------------------------------------------
running each code 100000time/s
code 1 results:|1 2 3 4 entry 6 dummy 8|
code 1 :107ms
code 2 results:|1 2 3 4 entry 6 dummy 8|
code 2 :501ms
press any key to exit
-------------------------------------------------------
Code: Select all
temp.s=" 1 2 3 4 entry 6 dummy 8 "
#repeats = 100000
OpenConsole()
DisableDebugger
PrintN("running each code " + Str(#repeats) + "time/s")
timer = ElapsedMilliseconds()
newstring$ = Space(1000) ; give newstring a simple buffer
For i = 0 To #repeats
temp=" 1 2 3 4 entry 6 dummy 8 "
*chr.unicode = @temp
*newstring.UNICODE = @newstring$
If *chr\u = 0 Or *chr\u = '*'
Else
While *chr\u = ' ' Or *chr\u = #TAB ; searches for spaces in front and ignores them
*chr + 2
Wend
space = 0; required to makesure a space is not remembered from the previous string
While *chr\u <> 0 ;goes through string until end of string
If *chr\u = ' ' Or *chr\u = #TAB ; if space adds 1
space +1
Else
If space ; if there was a space or more before this character was found then add a single space
;newstring$ + " " + Chr(*chr\u) ;prevents space at the end of string
*newstring\u = ' ':*newstring + 2
*newstring\u = *chr\u:*newstring + 2
space = 0
Else
*newstring\u = *chr\u:*newstring + 2
;newstring$ + Chr(*chr\u)
EndIf
EndIf
*chr + 2 ; goes to next position in string
Wend
*newstring\u = 0
EndIf
Next
code1 = ElapsedMilliseconds() - timer
PrintN("code 1 results:|"+newstring$+"|")
PrintN("code 1 :" +Str(code1)+"ms")
timer = ElapsedMilliseconds()
For i = 0 To #repeats
temp=" 1 2 3 4 entry 6 dummy 8 "
;String processing
ReplaceString(temp,#TAB$," ",#PB_String_InPlace)
temp=Trim(temp)
If Len(temp)<>0 And Left(temp,1)<>"*"
start=FindString(temp," ")
While start
temp=ReplaceString(temp," "," ",#PB_String_CaseSensitive,start)
start=FindString(temp," ")
Wend
;Process String further...
EndIf
Next
code2 = ElapsedMilliseconds() - timer
PrintN("code 2 results:|"+temp+"|")
PrintN("code 2 :" +Str(code2)+"ms")
PrintN("")
PrintN("press any key to exit")
EnableDebugger
Input()
End
Re: Speed up processing a certain string
Posted: Tue Aug 03, 2021 5:27 am
by deathmx
if it's very slow and your using readstring on the files it maybe possible to speed it up by using readdata and just read from the memory directly.
Also you can speed it up a little more by using purebasic 6 (c backend) in compiler options with optimizations checked.
btw my computer is running at 0.79ghz at the moment, so my results will probably be slower than yours.
c backend with optimizations results:
running each code 100000time/s
code 1 results:|1 2 3 4 entry 6 dummy 8|
code 1 :77ms
code 2 results:|1 2 3 4 entry 6 dummy 8|
code 2 :523ms
press any key to exit
Re: Speed up processing a certain string
Posted: Tue Aug 03, 2021 5:56 am
by wilbert
It's best to get the fields at the same time and not use StringField.
If you would know for sure the fields never exceed a certain length like for example 20 characters, it can be made a lot faster.
The code below could also be made a bit faster by using ASM if desired.
Code: Select all
DisableDebugger
#MaxFields = 10
Global Dim Fields.s(#MaxFields - 1)
Procedure.i GetFields(*String.Character)
Protected *Start, Count
If *String
While Count < #MaxFields
While *String\c <= 32
If *String\c = 0
Break 2
EndIf
*String + SizeOf(Character)
Wend
*Start = *String
While *String\c > 32
*String + SizeOf(Character)
Wend
Fields(Count) = PeekS(*Start, (*String - *Start) >> #PB_Compiler_Unicode)
Count + 1
Wend
EndIf
ProcedureReturn Count
EndProcedure
EnableDebugger
Define temp.s
temp=" 1 2 3 4 entry 6 dummy 8 "
field_count = GetFields(@temp)
field = 0
While field < field_count
Debug Fields(field)
field + 1
Wend
Re: Speed up processing a certain string
Posted: Tue Aug 03, 2021 7:57 am
by infratec
If you want speed, I need more informations:
1. Loading the entire file in memory.
Is it ASCII, UNICODE or UTF8?
2. Process the whole memory in one go.
As result a list is generated.
As I can remember I already wrote such a thing ...
viewtopic.php?f=13&t=76303&p=562080
But as mentioned: dont use a string directly if your files are in ASCII.
Then it is better to directly 'scan' the buffer.
Re: Speed up processing a certain string
Posted: Tue Aug 03, 2021 10:12 am
by STARGÅTE
jacdelad wrote: Tue Aug 03, 2021 1:42 am
The pattern is roughly "word word number number number word number number"
e.g.: "D1 SOD523 285 705 0 DX.410052E 0 0"
Can I expect a regular expression to be faster yet flexible enough to cover up all my premises?
Yes of course. Regular expression are designed to do such parsing. You can express all cases which are a valid content and exclude comments.
Can you give a whole example file, which has to be parsed?
Re: Speed up processing a certain string
Posted: Tue Aug 03, 2021 11:57 am
by jacdelad
Thanks a lot for all the approaches. A little more information, as asked: the file is ASCII, I tried ReadData, but it didn't speed up (at least not for me), I cannot predict how long a single field is, C-Backend is not an option right now (I am using pointers on maps, which is not fixed yet).
Right now, the best approach seems like RegEx or using a buffer and pointers. I'll try this.
Re: Speed up processing a certain string
Posted: Tue Aug 03, 2021 1:58 pm
by Marc56us
I am processing millions of entries.
With a file of this size, reading from disk will take much longer than parsing the line anyway so no need to look for algorithm optimization.
I did a test on a file with 18,000,000 lines (2.8 GB)
i7-8700 @ 3.20GHz
HD SSD
regular expression
PB: 9 seconds
FindStr: 6 seconds
Grep (WSL2): 4 seconds

Re: Speed up processing a certain string
Posted: Tue Aug 03, 2021 3:40 pm
by jacdelad
It's literally also 10000s of files. Each file is between 1kB and maybe 1MB.
I know I can't speed up radiant the files, beside using ReadData maybe,
but speeding up the processing part could still help.
Re: Speed up processing a certain string
Posted: Tue Aug 03, 2021 4:02 pm
by fabulouspaul
jacdelad wrote: Tue Aug 03, 2021 1:42 am
The pattern is roughly "word word number number number word number number"
e.g.: "D1 SOD523 285 705 0 DX.410052E 0 0"
Can I expect a regular expression to be faster yet flexible enough to cover up all my premises?
I took Stargates example and just let the regular expression replace your blanks:
Code: Select all
EnableExplicit
Enumeration
#RegularExpression
EndEnumeration
;Just for the Example
Define temp.s = " 1 2 3 4 entry 6 dummy 8 "
Define timer
timer = ElapsedMilliseconds()
If CreateRegularExpression(#RegularExpression, "\s+")
If ExamineRegularExpression(#RegularExpression, temp)
temp = Trim(ReplaceRegularExpression(#RegularExpression, temp, " "))
If Left(temp, 1) <> "*" And Len(temp) > 0
Debug temp
EndIf
EndIf
EndIf
timer = ElapsedMilliseconds() - timer
Debug "Time: " + Str(timer) + " ms"
Maybe this works for you.
Re: Speed up processing a certain string
Posted: Tue Aug 03, 2021 4:35 pm
by wilbert
jacdelad wrote: Tue Aug 03, 2021 11:57 amI tried ReadData, but it didn't speed up (at least not for me)
ReadData is great when you are working directly on the loaded content with memory pointers.
PB strings are not that fast when you have to process millions of strings.
jacdelad wrote: Tue Aug 03, 2021 11:57 amRight now, the best approach seems like RegEx or using a buffer and pointers. I'll try this.
RegEx is easier coding but using pointers was faster when I tried.
Re: Speed up processing a certain string
Posted: Tue Aug 03, 2021 4:51 pm
by infratec
I'm also sure hat a PB solution with pointers and single pass is faster then regex.
Re: Speed up processing a certain string
Posted: Tue Aug 03, 2021 9:57 pm
by jacdelad
I used the code from @deathmx, modified it a bit and squeezed it into a macro. Took me a while to understand how it works (plus I had to change it to detect the * wherever it is) and it works.
Thanks for everyone's contribution. This already speeds my program signifcantly up plus I can use this method on another problem which will bring even much more speed.