Why FindString() is so terribly slow?

Just starting out? Need help? Post your questions and find answers here.
User avatar
Josh
Addict
Addict
Posts: 1183
Joined: Sat Feb 13, 2010 3:45 pm

Re: Why FindString() is so terribly slow?

Post by Josh »

mestnyi wrote:Maybe there was no need to write here, but I think it is somehow connected. :)
Can you help me speed up the process?
On macos I could not adapt your options. :oops:

Code: Select all

EnableExplicit

Procedure StringFields (String$, List Fields$(), Separator$)
 ;BESCHREIBUNG: Zerlegt String$ in Teilstrings, welche durch Separator$ getrennt sind und
 ;              stellt diese in die List Fields$.
 ;PARAMETER   : String$    = Der zu durchsuchende String
 ;              Fields$()  = List in die die Teilstrings gestellt werden
 ;              Separator$ = Trennzeichen (auch mehrere Zeichen)
 ;RÜCKGABEWERT: keiner
 ;¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
  Structure CHARARRAY : c.c[0] : EndStructure
  Protected *Sta.CHARARRAY
  Protected *End.CHARARRAY
  Protected *Sep.CHARARRAY
  Protected LenSeparator
  Protected IsMatched
  Protected i

  #SOC  = SizeOf (Character)

  *Sta = @String$
  *End = @String$
  *Sep = @Separator$

  LenSeparator = Len (Separator$)

 ;Single Sign Separator
  If LenSeparator = 1

    While *End\c
      If *End\c = *Sep\c
        AddElement (Fields$())
        Fields$() = PeekS (*Sta, (*End-*Sta)/#SOC)
        *Sta = *End + #SOC
      EndIf
      *End + #SOC
    Wend

    AddElement (Fields$())
    Fields$() = PeekS (*Sta, (*End-*Sta)/#SOC)

 ;Multi Sign Separator
  Else

    While *End\c
      IsMatched = #True
      For i = 0 To LenSeparator - 1
        If *End\c[i] <> *Sep\c[i]
          IsMatched = #False
          Break
        EndIf
      Next
      If IsMatched
        AddElement (Fields$())
        Fields$() = PeekS (*Sta, (*End-*Sta)/#SOC)
        *Sta = *End + LenSeparator * #SOC
        *End + LenSeparator * #SOC
      Else
        *End + #SOC
      EndIf
    Wend

    AddElement (Fields$())
    Fields$() = PeekS (*Sta, (*End-*Sta)/#SOC)

  EndIf

EndProcedure

Global Text.s, String.s, a,f,f1,l,time.i, c.s = #LF$, LN.i=50000
Global NewList String.s()

Global *Buffer = AllocateMemory(LN*100)
Global *Pointer = *Buffer


time = ElapsedMilliseconds()
For a = 0 To LN
  Text.s = "Item "+Str(a) + c
  CopyMemoryString(PeekS(@Text), @*Pointer) ; 3
Next
Text = PeekS(*Buffer)
Text.s = Trim(Text.s, c)
Debug Str(ElapsedMilliseconds()-time) + " text collection time"



time = ElapsedMilliseconds()
StringFields (Text, String(), c)
Debug Str(ElapsedMilliseconds()-time) + " text parse time "
Debug "all count "+ListSize(String())
P.S.: Don't use constants in strings (c.s = #LF$)
sorry for my bad english
User avatar
skywalk
Addict
Addict
Posts: 4224
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: Why FindString() is so terribly slow?

Post by skywalk »

Why no constants in strings?
s$ + #CRLF$
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
mestnyi
Addict
Addict
Posts: 1098
Joined: Mon Nov 25, 2013 6:41 am

Re: Why FindString() is so terribly slow?

Post by mestnyi »

Josh Your work is pretty good, but not fast enough.

242 text collection time
1195 text parse time
all count 500001

I found a solution with regular expressions, but I would like even faster.

244 text collection time
479 text parse time
all count 500001

Code: Select all

EnableExplicit

Global Text.s, String.s, a,f,f1,l,time.i, c.s = #LF$, LN.i=500000;0
Global NewList String.s()

Global *Buffer = AllocateMemory(LN*100)
Global *Pointer = *Buffer

time = ElapsedMilliseconds()
For a = 0 To LN
  Text.s = "Item "+Str(a) + c
  CopyMemoryString(PeekS(@Text), @*Pointer) ; 3
Next
Text = PeekS(*Buffer)
Text.s = Trim(Text.s, c)
Debug Str(ElapsedMilliseconds()-time) + " text collection time"


time = ElapsedMilliseconds()
If CreateRegularExpression(0, ~"^.*", #PB_RegularExpression_MultiLine)
  If ExamineRegularExpression(0, Text.s)
    While NextRegularExpressionMatch(0)
      
      AddElement(String()) 
      String() =  RegularExpressionMatchString(0)
      
    Wend
  EndIf
Else
  Debug RegularExpressionError()
EndIf

Debug Str(ElapsedMilliseconds()-time) + " text parse time "
Debug "all count "+ListSize(String())

; ForEach String()
;     Debug "get - "+String()
;  Next
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3943
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Why FindString() is so terribly slow?

Post by wilbert »

mestnyi wrote:Josh Your work is pretty good, but not fast enough.

242 text collection time
1195 text parse time
all count 500001

I found a solution with regular expressions, but I would like even faster.

244 text collection time
479 text parse time
all count 500001
Don't do timings with debug enabled.
It's not a fair comparison.

For the fastest approach, it's important to think about what you need.
If you want to split a string into a list with the items separated by a single separator character, the fastest approach is different compared to finding a multi character string.
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
Josh
Addict
Addict
Posts: 1183
Joined: Sat Feb 13, 2010 3:45 pm

Re: Why FindString() is so terribly slow?

Post by Josh »

mestnyi wrote:Josh Your work is pretty good, but not fast enough.

242 text collection time
1195 text parse time
all count 500001

I found a solution with regular expressions, but I would like even faster.

244 text collection time
479 text parse time
all count 500001
Didn't want to make much changes in your code. Like wilbert wrote, try it without debugging. Times on my old and not fast machine:

My Code:
226 text collection time
135 text parse time
all count 500001

Your Code with regular expressions:
225 text collection time
488 text parse time
all count 500001

skywalk wrote:Why no constants in strings?
s$ + #CRLF$
I meant that it makes no sense to use a string instead of a constant (Code mestnyi: c.s = # LF$). Apart from the fact that I find it confusing, the code with constants is faster.

Code: Select all

LF$ = #LF$

time = ElapsedMilliseconds()
For i = 1 To 1000000
  x$ = LF$
Next
time = ElapsedMilliseconds() - time
msg$ + time + #CRLF$

time = ElapsedMilliseconds()
For i = 1 To 1000000
  x$ = #LF$
Next
time = ElapsedMilliseconds() - time
msg$ + time + #CRLF$

MessageRequester ("", msg$)
sorry for my bad english
mestnyi
Addict
Addict
Posts: 1098
Joined: Mon Nov 25, 2013 6:41 am

Re: Why FindString() is so terribly slow?

Post by mestnyi »

wilbert wrote:Don't do timings with debug enabled.
It's not a fair comparison.
Why is it not fair, I check both codes equally? :oops:
wilbert wrote:For the fastest approach, it's important to think about what you need.
If you want to split a string into a list with the items separated by a single separator character, the fastest approach is different compared to finding a multi character string.
This is enough for me, but 2-3 times faster. :D

Code: Select all

While *End\c
  If *End\c = #LF 
    String.s = PeekS (*Sta, (*End-*Sta)/#SOC)
    
    
    *Sta = *End + #SOC
  EndIf
  
  *End + #SOC
Wend
User avatar
Josh
Addict
Addict
Posts: 1183
Joined: Sat Feb 13, 2010 3:45 pm

Re: Why FindString() is so terribly slow?

Post by Josh »

mestnyi wrote:
wilbert wrote:Don't do timings with debug enabled.
It's not a fair comparison.
Why is it not fair, I check both codes equally? :oops:
No, you can not say that you use both codes with debugger and then get a usable result. In the first case, you use a code with many lines and in the other case a one-line Pb function (which does the same thing), your code will always lose, because there are many lines to debug in your code.
sorry for my bad english
User avatar
the.weavster
Addict
Addict
Posts: 1577
Joined: Thu Jul 03, 2003 6:53 pm
Location: England

Re: Why FindString() is so terribly slow?

Post by the.weavster »

Fred wrote:About the comparison to other languages, some use 'tricks' to perform well in 'real wold' scenario. It's the Python case, which looks at the end of the string first (which is exactly your case).
I just changed the Python code to this:

Code: Select all

from time import time

s = ' ' * 500000 + '!' + ' ' * 500000
t = time()
for i in range(1000):
    x = s.find('!')
print(int((time()-t)*1000), 'ms')
And it's still only 49ms so I don't think your theory about Python's trick is correct.
mestnyi
Addict
Addict
Posts: 1098
Joined: Mon Nov 25, 2013 6:41 am

Re: Why FindString() is so terribly slow?

Post by mestnyi »

So actually your code is faster. :)
In my real code, for some reason, it is inferior to a regular expression. :shock:
how else can you speed it up?
Can I go through not characters, but words?

Code: Select all

EnableExplicit
DisableDebugger

Global Text.s, String.s, a,time.i,time1.i,time2.i,time3.i, c.s = #LF$, LN.i=500000
Global NewList String.s()

Global *Buffer = AllocateMemory(1000000000)
Global *Pointer = *Buffer
Global pos,len


time = ElapsedMilliseconds()
For a = 0 To LN
  Text.s = "Item qwertyuiop[]asdfghjkl;'\`zxcvbnm,./ asdfghjkll;;;llkkjjjhuhdjkvbkjdskvjsdnvsdf1234567890-=qwertyuiop[]asdfghjkl;'\`zxcvbnm,./"+Str(a) + c
  CopyMemoryString(PeekS(@Text), @*Pointer) ; 3
Next
Text = PeekS(*Buffer)


time = ElapsedMilliseconds()
Define *Sta.Character = @Text
Define *End.Character = @Text

#SOC  = SizeOf (Character)

While *End\c
  
  If *End\c = #LF             
    len = (*End-*Sta)/#SOC
    String.s = PeekS (*Sta, len)
    
    AddElement(String()) 
    String() = String.s
    
    *Sta = *End + #SOC
    pos+len
  EndIf
  
  *End + #SOC
Wend
time1 = ElapsedMilliseconds()-time

time = ElapsedMilliseconds()
If CreateRegularExpression(0, ~".*\n?")
  If ExamineRegularExpression(0, Text.s)
    While NextRegularExpressionMatch(0)
      
      String.s = Trim(RegularExpressionMatchString(0), #LF$)
      AddElement(String()) 
      String() = String.s
      
    Wend
  EndIf
Else
  Debug RegularExpressionError()
EndIf
time2 = ElapsedMilliseconds()-time

; time = ElapsedMilliseconds()
; For a = 1 To LN
;   String.s = StringField(Text.s, a, #LF$)
;   Len = Len(String)
;   
;   AddElement(String()) 
;   String() = String.s
;   
;   pos + Len + 1
; Next
; time3 = ElapsedMilliseconds()-time

MessageRequester( "", Str(time1) + " - time_1 text parse" +#CRLF$+ Str(time2) + " - time_2 text parse" +#CRLF$+ Str(time3) + " - time_3 text parse")

wilbert
PureBasic Expert
PureBasic Expert
Posts: 3943
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Why FindString() is so terribly slow?

Post by wilbert »

mestnyi wrote:In my real code, for some reason, it is inferior to a regular expression. :shock:
how else can you speed it up?
Can I go through not characters, but words?
This is already a bit faster

Code: Select all

time = ElapsedMilliseconds()

Define *Sta.Character = @Text
Define *End.Character = @Text

#SOC  = SizeOf (Character)

While *End\c
  If *End\c = #LF             
    AddElement(String())
    String() = PeekS(*Sta, (*End-*Sta) >> #PB_Compiler_Unicode)
    *Sta = *End + #SOC
  EndIf
  *End + #SOC
Wend

time1 = ElapsedMilliseconds()-time
When you are using assembly code, you can check 16 characters at once for being a #LF or not.
That usually speeds things up but is more complicated to implement.
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
skywalk
Addict
Addict
Posts: 4224
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: Why FindString() is so terribly slow?

Post by skywalk »

Code posted here...
I am confused by the result of PB FindString() no case vs C's no case function. :shock:
PB = 3.9s vs StrStrIW() = 12.5s?

Code: Select all

; Count(n),                                            200
; findstring_+case(ms),                                227
; findstringmem_+case(ms),                             1660
; findstring_-case(ms),                                3886
; findstringmem_-case(ms),                             2073
; findstringC_+case(ms),                               233
; findstringASMU_+case(ms),                            229
; findstringSSE2_+case(ms),                            125
; findstringC_-case(ms),                               12539
; findstring_+case : findstringmem_+case(%),           631
; findstring_-case : findstringmem_-case(%),           -47
; findstring_+case : findstring_-case(%),              1612
; findstringmem_+case : findstringmem_-case(%),        25
; findstring_+case : findstringC_+case(%),             3
; findstring_+case : findstringASMU_+case(%),          1
; findstring_+case : findstringSSE2_+case(%),          -45
; findstring_+case : findstringC_-case(%),             5424
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3943
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Why FindString() is so terribly slow?

Post by wilbert »

skywalk wrote:I am confused by the result of PB FindString() no case vs C's no case function. :shock:
The problem with comparing case insensitive functions is that there is no standard approach.
Some routines only do a case insensitive compare for A-Z, others use a lookup table to handle all unicode characters and also take into account the rules for a specific language. The second approach is much slower.
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
skywalk
Addict
Addict
Posts: 4224
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: Why FindString() is so terribly slow?

Post by skywalk »

Makes sense, but I assumed PB's FindString(casesensitive) was a wrapper for the underlying wcsstr().
Their results are nearly identical.
But why the big difference when comparing FindString(nocase) vs StrStrIW()?
This is counterintuitive.
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
Post Reply