Speed up processing a certain string

Just starting out? Need help? Post your questions and find answers here.
User avatar
jacdelad
Addict
Addict
Posts: 1431
Joined: Wed Feb 03, 2021 12:46 pm
Location: Planet Riesa
Contact:

Speed up processing a certain string

Post by jacdelad »

Hello,
I have a specific problem and hopefully I can explain it. I am processing a lot of cad-files. The strings in these cad files contain usually 8 values separated by spaces or tabs (or both) in variable amounts (one or more each). Also it's possible to encounter spaces/tabs at the beginning/end. Some lines are to be ignored, they start with a "*"; but it is possible that they have spaces/tabs in front of the "*".
I want to get a clear version of the string to extract the values. My current code is that:

Code: Select all

;Just for the Example
Define temp.s,start
temp="  1		2    3 4 entry  6  dummy 8 "

;String processing
ReplaceString(temp,#TAB$," ",#PB_String_InPlace)
temp=Trim(temp)
If Len(temp)<>0 And Left(temp,1)<>"*"
  start=FindString(temp,"  ")
  While start
    temp=ReplaceString(temp,"  "," ",#PB_String_CaseSensitive,start)
    start=FindString(temp,"  ")
  Wend
  ;Process String further...
EndIf
The code does the following:
  • Replace all tabs with space; inplace
  • Trim the left and right spaces away
  • Check if the string is empty or to be ignored
  • If not: Find the first occurence of double spaces
  • While there are double spaces in the string -> replace them with single spaces
This gives me a cleaned up version which can be easily read via StringField().
Is there an approach that is faster? I am processing millions of entries.
PureBasic 6.04/XProfan X4a/Embarcadero RAD Studio 11/Perl 5.2/Python 3.10
Windows 11/Ryzen 5800X/32GB RAM/Radeon 7770 OC/3TB SSD/11TB HDD
Synology DS1821+/36GB RAM/130TB
Synology DS920+/20GB RAM/54TB
Synology DS916+ii/8GB RAM/12TB
User avatar
STARGÅTE
Addict
Addict
Posts: 2067
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: Speed up processing a certain string

Post by STARGÅTE »

What about just using a regular expression?
What is the pattern of the entries? Just numbers and characters?

Code: Select all

Enumeration
	#RegularExpression
EndEnumeration

;Just for the Example
Define temp.s = "  1		2    3 4 entry  6  dummy 8 "

If CreateRegularExpression(#RegularExpression, "\w+")
	If ExamineRegularExpression(#RegularExpression, temp)
		While NextRegularExpressionMatch(#RegularExpression)
			Debug RegularExpressionMatchString(#RegularExpression)
		Wend
	EndIf
EndIf
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
User avatar
jacdelad
Addict
Addict
Posts: 1431
Joined: Wed Feb 03, 2021 12:46 pm
Location: Planet Riesa
Contact:

Re: Speed up processing a certain string

Post by jacdelad »

The pattern is roughly "word word number number number word number number"
e.g.: "D1 SOD523 285 705 0 DX.410052E 0 0"

Can I expect a regular expression to be faster yet flexible enough to cover up all my premises?
PureBasic 6.04/XProfan X4a/Embarcadero RAD Studio 11/Perl 5.2/Python 3.10
Windows 11/Ryzen 5800X/32GB RAM/Radeon 7770 OC/3TB SSD/11TB HDD
Synology DS1821+/36GB RAM/130TB
Synology DS920+/20GB RAM/54TB
Synology DS916+ii/8GB RAM/12TB
deathmx
User
User
Posts: 27
Joined: Mon Feb 26, 2018 3:14 am

Re: Speed up processing a certain string

Post by deathmx »

Here is another way to do it. Seems to be a bit faster. Not sure if this is useful to you or not. It basically just filters everything.
If you think this is ok and need some changes if you could have bigger sample size it would be great maybe 100 lines? lol

Here are my results. code 1 is mine code 2 is yours.
-------------------------------------------------------
running each code 100000time/s
code 1 results:|1 2 3 4 entry 6 dummy 8|
code 1 :107ms
code 2 results:|1 2 3 4 entry 6 dummy 8|
code 2 :501ms

press any key to exit
-------------------------------------------------------

Code: Select all


temp.s="  1		2    3 4 entry  6  dummy 8 "

#repeats = 100000

OpenConsole()
DisableDebugger
PrintN("running each code " + Str(#repeats) + "time/s")

timer = ElapsedMilliseconds()
newstring$ = Space(1000) ; give newstring a simple buffer

For i = 0 To #repeats
  temp="  1		2    3 4 entry  6  dummy 8 "
*chr.unicode = @temp 
*newstring.UNICODE = @newstring$
If *chr\u = 0 Or *chr\u = '*'
  
Else
  
  While *chr\u = ' ' Or *chr\u = #TAB ; searches for spaces in front and ignores them
    *chr + 2  
  Wend
  space = 0; required to makesure a space is not remembered from the previous string
  While *chr\u <> 0  ;goes through string until end of string
    If *chr\u = ' ' Or *chr\u = #TAB ; if space adds 1
      space +1
    Else
      If space  ; if there was a space or more before this character was found then add a single space
        ;newstring$ + " " + Chr(*chr\u) ;prevents space at the end of string
        *newstring\u = ' ':*newstring + 2
        *newstring\u = *chr\u:*newstring + 2
        space = 0
      Else
        *newstring\u = *chr\u:*newstring + 2
        ;newstring$ + Chr(*chr\u)
      EndIf
      
    EndIf
    *chr + 2 ; goes to next position in string
  Wend
    *newstring\u = 0
EndIf

Next


code1 = ElapsedMilliseconds() - timer
PrintN("code 1 results:|"+newstring$+"|")
PrintN("code 1 :" +Str(code1)+"ms")

timer = ElapsedMilliseconds()


For i = 0 To #repeats
  temp="  1		2    3 4 entry  6  dummy 8 "
  ;String processing
ReplaceString(temp,#TAB$," ",#PB_String_InPlace)
temp=Trim(temp)
If Len(temp)<>0 And Left(temp,1)<>"*"
  start=FindString(temp,"  ")
  While start
    temp=ReplaceString(temp,"  "," ",#PB_String_CaseSensitive,start)
    start=FindString(temp,"  ")
  Wend
  ;Process String further...
EndIf
Next

code2 = ElapsedMilliseconds() - timer

PrintN("code 2 results:|"+temp+"|")
PrintN("code 2 :" +Str(code2)+"ms")
PrintN("")
PrintN("press any key to exit")
EnableDebugger
Input()

End



deathmx
User
User
Posts: 27
Joined: Mon Feb 26, 2018 3:14 am

Re: Speed up processing a certain string

Post by deathmx »

if it's very slow and your using readstring on the files it maybe possible to speed it up by using readdata and just read from the memory directly.

Also you can speed it up a little more by using purebasic 6 (c backend) in compiler options with optimizations checked.

btw my computer is running at 0.79ghz at the moment, so my results will probably be slower than yours.

c backend with optimizations results:
running each code 100000time/s
code 1 results:|1 2 3 4 entry 6 dummy 8|
code 1 :77ms
code 2 results:|1 2 3 4 entry 6 dummy 8|
code 2 :523ms

press any key to exit
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Speed up processing a certain string

Post by wilbert »

It's best to get the fields at the same time and not use StringField.
If you would know for sure the fields never exceed a certain length like for example 20 characters, it can be made a lot faster.

The code below could also be made a bit faster by using ASM if desired.

Code: Select all

DisableDebugger

#MaxFields = 10

Global Dim Fields.s(#MaxFields - 1)

Procedure.i GetFields(*String.Character)
  Protected *Start, Count
  If *String
    
    While Count < #MaxFields
      
      While *String\c <= 32
        If *String\c = 0
          Break 2
        EndIf
        *String + SizeOf(Character)
      Wend
      *Start = *String
      While *String\c > 32
        *String + SizeOf(Character)
      Wend
      
      Fields(Count) = PeekS(*Start, (*String - *Start) >> #PB_Compiler_Unicode)
      Count + 1
      
    Wend
    
  EndIf
  ProcedureReturn Count
EndProcedure

EnableDebugger


Define temp.s
temp="  1		2    3 4 entry  6  dummy 8 "

field_count = GetFields(@temp)
field = 0
While field < field_count
  Debug Fields(field)
  field + 1
Wend
Last edited by wilbert on Tue Aug 03, 2021 7:58 am, edited 1 time in total.
Windows (x64)
Raspberry Pi OS (Arm64)
infratec
Always Here
Always Here
Posts: 6817
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: Speed up processing a certain string

Post by infratec »

If you want speed, I need more informations:

1. Loading the entire file in memory.
Is it ASCII, UNICODE or UTF8?

2. Process the whole memory in one go.
As result a list is generated.

As I can remember I already wrote such a thing ...

viewtopic.php?f=13&t=76303&p=562080

But as mentioned: dont use a string directly if your files are in ASCII.
Then it is better to directly 'scan' the buffer.
User avatar
STARGÅTE
Addict
Addict
Posts: 2067
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: Speed up processing a certain string

Post by STARGÅTE »

jacdelad wrote: Tue Aug 03, 2021 1:42 am The pattern is roughly "word word number number number word number number"
e.g.: "D1 SOD523 285 705 0 DX.410052E 0 0"

Can I expect a regular expression to be faster yet flexible enough to cover up all my premises?
Yes of course. Regular expression are designed to do such parsing. You can express all cases which are a valid content and exclude comments.

Can you give a whole example file, which has to be parsed?
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
User avatar
jacdelad
Addict
Addict
Posts: 1431
Joined: Wed Feb 03, 2021 12:46 pm
Location: Planet Riesa
Contact:

Re: Speed up processing a certain string

Post by jacdelad »

Thanks a lot for all the approaches. A little more information, as asked: the file is ASCII, I tried ReadData, but it didn't speed up (at least not for me), I cannot predict how long a single field is, C-Backend is not an option right now (I am using pointers on maps, which is not fixed yet).

Right now, the best approach seems like RegEx or using a buffer and pointers. I'll try this.
PureBasic 6.04/XProfan X4a/Embarcadero RAD Studio 11/Perl 5.2/Python 3.10
Windows 11/Ryzen 5800X/32GB RAM/Radeon 7770 OC/3TB SSD/11TB HDD
Synology DS1821+/36GB RAM/130TB
Synology DS920+/20GB RAM/54TB
Synology DS916+ii/8GB RAM/12TB
Marc56us
Addict
Addict
Posts: 1477
Joined: Sat Feb 08, 2014 3:26 pm

Re: Speed up processing a certain string

Post by Marc56us »

I am processing millions of entries.
With a file of this size, reading from disk will take much longer than parsing the line anyway so no need to look for algorithm optimization.

I did a test on a file with 18,000,000 lines (2.8 GB)
i7-8700 @ 3.20GHz
HD SSD
regular expression

PB: 9 seconds
FindStr: 6 seconds
Grep (WSL2): 4 seconds

:wink:
User avatar
jacdelad
Addict
Addict
Posts: 1431
Joined: Wed Feb 03, 2021 12:46 pm
Location: Planet Riesa
Contact:

Re: Speed up processing a certain string

Post by jacdelad »

It's literally also 10000s of files. Each file is between 1kB and maybe 1MB.

I know I can't speed up radiant the files, beside using ReadData maybe,
but speeding up the processing part could still help.
PureBasic 6.04/XProfan X4a/Embarcadero RAD Studio 11/Perl 5.2/Python 3.10
Windows 11/Ryzen 5800X/32GB RAM/Radeon 7770 OC/3TB SSD/11TB HDD
Synology DS1821+/36GB RAM/130TB
Synology DS920+/20GB RAM/54TB
Synology DS916+ii/8GB RAM/12TB
fabulouspaul
User
User
Posts: 34
Joined: Sun Nov 23, 2014 1:18 pm

Re: Speed up processing a certain string

Post by fabulouspaul »

jacdelad wrote: Tue Aug 03, 2021 1:42 am The pattern is roughly "word word number number number word number number"
e.g.: "D1 SOD523 285 705 0 DX.410052E 0 0"

Can I expect a regular expression to be faster yet flexible enough to cover up all my premises?
I took Stargates example and just let the regular expression replace your blanks:

Code: Select all

EnableExplicit

Enumeration
	#RegularExpression
EndEnumeration

;Just for the Example
Define temp.s = "  1		2    3 4 entry  6  dummy 8 "
Define timer

timer = ElapsedMilliseconds()

If CreateRegularExpression(#RegularExpression, "\s+")
  If ExamineRegularExpression(#RegularExpression, temp)
    temp = Trim(ReplaceRegularExpression(#RegularExpression, temp, " "))
    If Left(temp, 1) <> "*" And Len(temp) > 0
      Debug temp
    EndIf    
  EndIf
EndIf

timer = ElapsedMilliseconds() - timer

Debug "Time: " + Str(timer) + " ms"
Maybe this works for you.
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Speed up processing a certain string

Post by wilbert »

jacdelad wrote: Tue Aug 03, 2021 11:57 amI tried ReadData, but it didn't speed up (at least not for me)
ReadData is great when you are working directly on the loaded content with memory pointers.
PB strings are not that fast when you have to process millions of strings.
jacdelad wrote: Tue Aug 03, 2021 11:57 amRight now, the best approach seems like RegEx or using a buffer and pointers. I'll try this.
RegEx is easier coding but using pointers was faster when I tried.
Windows (x64)
Raspberry Pi OS (Arm64)
infratec
Always Here
Always Here
Posts: 6817
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: Speed up processing a certain string

Post by infratec »

I'm also sure hat a PB solution with pointers and single pass is faster then regex.
User avatar
jacdelad
Addict
Addict
Posts: 1431
Joined: Wed Feb 03, 2021 12:46 pm
Location: Planet Riesa
Contact:

Re: Speed up processing a certain string

Post by jacdelad »

I used the code from @deathmx, modified it a bit and squeezed it into a macro. Took me a while to understand how it works (plus I had to change it to detect the * wherever it is) and it works.

Thanks for everyone's contribution. This already speeds my program signifcantly up plus I can use this method on another problem which will bring even much more speed.
PureBasic 6.04/XProfan X4a/Embarcadero RAD Studio 11/Perl 5.2/Python 3.10
Windows 11/Ryzen 5800X/32GB RAM/Radeon 7770 OC/3TB SSD/11TB HDD
Synology DS1821+/36GB RAM/130TB
Synology DS920+/20GB RAM/54TB
Synology DS916+ii/8GB RAM/12TB
Post Reply