Why FindString() is so terribly slow?

Josh · Post by **Josh** » Tue Dec 18, 2018 5:09 am

mestnyi wrote:Maybe there was no need to write here, but I think it is somehow connected.
Can you help me speed up the process?
On macos I could not adapt your options.

Code: Select all

EnableExplicit

Procedure StringFields (String$, List Fields$(), Separator$)
 ;BESCHREIBUNG: Zerlegt String$ in Teilstrings, welche durch Separator$ getrennt sind und
 ;              stellt diese in die List Fields$.
 ;PARAMETER   : String$    = Der zu durchsuchende String
 ;              Fields$()  = List in die die Teilstrings gestellt werden
 ;              Separator$ = Trennzeichen (auch mehrere Zeichen)
 ;RÜCKGABEWERT: keiner
 ;¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
  Structure CHARARRAY : c.c[0] : EndStructure
  Protected *Sta.CHARARRAY
  Protected *End.CHARARRAY
  Protected *Sep.CHARARRAY
  Protected LenSeparator
  Protected IsMatched
  Protected i

  #SOC  = SizeOf (Character)

  *Sta = @String$
  *End = @String$
  *Sep = @Separator$

  LenSeparator = Len (Separator$)

 ;Single Sign Separator
  If LenSeparator = 1

    While *End\c
      If *End\c = *Sep\c
        AddElement (Fields$())
        Fields$() = PeekS (*Sta, (*End-*Sta)/#SOC)
        *Sta = *End + #SOC
      EndIf
      *End + #SOC
    Wend

    AddElement (Fields$())
    Fields$() = PeekS (*Sta, (*End-*Sta)/#SOC)

 ;Multi Sign Separator
  Else

    While *End\c
      IsMatched = #True
      For i = 0 To LenSeparator - 1
        If *End\c[i] <> *Sep\c[i]
          IsMatched = #False
          Break
        EndIf
      Next
      If IsMatched
        AddElement (Fields$())
        Fields$() = PeekS (*Sta, (*End-*Sta)/#SOC)
        *Sta = *End + LenSeparator * #SOC
        *End + LenSeparator * #SOC
      Else
        *End + #SOC
      EndIf
    Wend

    AddElement (Fields$())
    Fields$() = PeekS (*Sta, (*End-*Sta)/#SOC)

  EndIf

EndProcedure

Global Text.s, String.s, a,f,f1,l,time.i, c.s = #LF$, LN.i=50000
Global NewList String.s()

Global *Buffer = AllocateMemory(LN*100)
Global *Pointer = *Buffer


time = ElapsedMilliseconds()
For a = 0 To LN
  Text.s = "Item "+Str(a) + c
  CopyMemoryString(PeekS(@Text), @*Pointer) ; 3
Next
Text = PeekS(*Buffer)
Text.s = Trim(Text.s, c)
Debug Str(ElapsedMilliseconds()-time) + " text collection time"



time = ElapsedMilliseconds()
StringFields (Text, String(), c)
Debug Str(ElapsedMilliseconds()-time) + " text parse time "
Debug "all count "+ListSize(String())

P.S.: Don't use constants in strings (c.s = #LF$)

skywalk · Post by **skywalk** » Tue Dec 18, 2018 6:06 am

Why no constants in strings?
s$ + #CRLF$

mestnyi · Post by **mestnyi** » Tue Dec 18, 2018 6:08 am

Josh Your work is pretty good, but not fast enough.

242 text collection time
1195 text parse time
all count 500001

I found a solution with regular expressions, but I would like even faster.

244 text collection time
479 text parse time
all count 500001

Code: Select all

EnableExplicit

Global Text.s, String.s, a,f,f1,l,time.i, c.s = #LF$, LN.i=500000;0
Global NewList String.s()

Global *Buffer = AllocateMemory(LN*100)
Global *Pointer = *Buffer

time = ElapsedMilliseconds()
For a = 0 To LN
  Text.s = "Item "+Str(a) + c
  CopyMemoryString(PeekS(@Text), @*Pointer) ; 3
Next
Text = PeekS(*Buffer)
Text.s = Trim(Text.s, c)
Debug Str(ElapsedMilliseconds()-time) + " text collection time"


time = ElapsedMilliseconds()
If CreateRegularExpression(0, ~"^.*", #PB_RegularExpression_MultiLine)
  If ExamineRegularExpression(0, Text.s)
    While NextRegularExpressionMatch(0)
      
      AddElement(String()) 
      String() =  RegularExpressionMatchString(0)
      
    Wend
  EndIf
Else
  Debug RegularExpressionError()
EndIf

Debug Str(ElapsedMilliseconds()-time) + " text parse time "
Debug "all count "+ListSize(String())

; ForEach String()
;     Debug "get - "+String()
;  Next

wilbert · Post by **wilbert** » Tue Dec 18, 2018 7:04 am

mestnyi wrote:Josh Your work is pretty good, but not fast enough.

242 text collection time
1195 text parse time
all count 500001

I found a solution with regular expressions, but I would like even faster.

244 text collection time
479 text parse time
all count 500001

Don't do timings with debug enabled.
It's not a fair comparison.

For the fastest approach, it's important to think about what you need.
If you want to split a string into a list with the items separated by a single separator character, the fastest approach is different compared to finding a multi character string.

Josh · Post by **Josh** » Tue Dec 18, 2018 8:30 am

mestnyi wrote:Josh Your work is pretty good, but not fast enough.

242 text collection time
1195 text parse time
all count 500001

I found a solution with regular expressions, but I would like even faster.

244 text collection time
479 text parse time
all count 500001

Didn't want to make much changes in your code. Like wilbert wrote, try it without debugging. Times on my old and not fast machine:

My Code:
226 text collection time
135 text parse time
all count 500001

Your Code with regular expressions:
225 text collection time
488 text parse time
all count 500001

skywalk wrote:Why no constants in strings?
s$ + #CRLF$

I meant that it makes no sense to use a string instead of a constant (Code mestnyi: c.s = # LF$). Apart from the fact that I find it confusing, the code with constants is faster.

Code: Select all

LF$ = #LF$

time = ElapsedMilliseconds()
For i = 1 To 1000000
  x$ = LF$
Next
time = ElapsedMilliseconds() - time
msg$ + time + #CRLF$

time = ElapsedMilliseconds()
For i = 1 To 1000000
  x$ = #LF$
Next
time = ElapsedMilliseconds() - time
msg$ + time + #CRLF$

MessageRequester ("", msg$)

mestnyi · Post by **mestnyi** » Tue Dec 18, 2018 10:05 am

wilbert wrote:Don't do timings with debug enabled.
It's not a fair comparison.

Why is it not fair, I check both codes equally?

wilbert wrote:For the fastest approach, it's important to think about what you need.
If you want to split a string into a list with the items separated by a single separator character, the fastest approach is different compared to finding a multi character string.

This is enough for me, but 2-3 times faster.

Code: Select all

While *End\c
  If *End\c = #LF 
    String.s = PeekS (*Sta, (*End-*Sta)/#SOC)
    
    
    *Sta = *End + #SOC
  EndIf
  
  *End + #SOC
Wend

Josh · Post by **Josh** » Tue Dec 18, 2018 10:29 am

mestnyi wrote:
wilbert wrote:Don't do timings with debug enabled.
It's not a fair comparison.
Why is it not fair, I check both codes equally?

No, you can not say that you use both codes with debugger and then get a usable result. In the first case, you use a code with many lines and in the other case a one-line Pb function (which does the same thing), your code will always lose, because there are many lines to debug in your code.

the.weavster · Post by **the.weavster** » Tue Dec 18, 2018 12:47 pm

Fred wrote:About the comparison to other languages, some use 'tricks' to perform well in 'real wold' scenario. It's the Python case, which looks at the end of the string first (which is exactly your case).

I just changed the Python code to this:

Code: Select all

from time import time

s = ' ' * 500000 + '!' + ' ' * 500000
t = time()
for i in range(1000):
    x = s.find('!')
print(int((time()-t)*1000), 'ms')

And it's still only 49ms so I don't think your theory about Python's trick is correct.

mestnyi · Post by **mestnyi** » Tue Dec 18, 2018 1:07 pm

So actually your code is faster.

In my real code, for some reason, it is inferior to a regular expression.

how else can you speed it up?
Can I go through not characters, but words?

Code: Select all

EnableExplicit
DisableDebugger

Global Text.s, String.s, a,time.i,time1.i,time2.i,time3.i, c.s = #LF$, LN.i=500000
Global NewList String.s()

Global *Buffer = AllocateMemory(1000000000)
Global *Pointer = *Buffer
Global pos,len


time = ElapsedMilliseconds()
For a = 0 To LN
  Text.s = "Item qwertyuiop[]asdfghjkl;'\`zxcvbnm,./ asdfghjkll;;;llkkjjjhuhdjkvbkjdskvjsdnvsdf1234567890-=qwertyuiop[]asdfghjkl;'\`zxcvbnm,./"+Str(a) + c
  CopyMemoryString(PeekS(@Text), @*Pointer) ; 3
Next
Text = PeekS(*Buffer)


time = ElapsedMilliseconds()
Define *Sta.Character = @Text
Define *End.Character = @Text

#SOC  = SizeOf (Character)

While *End\c
  
  If *End\c = #LF             
    len = (*End-*Sta)/#SOC
    String.s = PeekS (*Sta, len)
    
    AddElement(String()) 
    String() = String.s
    
    *Sta = *End + #SOC
    pos+len
  EndIf
  
  *End + #SOC
Wend
time1 = ElapsedMilliseconds()-time

time = ElapsedMilliseconds()
If CreateRegularExpression(0, ~".*\n?")
  If ExamineRegularExpression(0, Text.s)
    While NextRegularExpressionMatch(0)
      
      String.s = Trim(RegularExpressionMatchString(0), #LF$)
      AddElement(String()) 
      String() = String.s
      
    Wend
  EndIf
Else
  Debug RegularExpressionError()
EndIf
time2 = ElapsedMilliseconds()-time

; time = ElapsedMilliseconds()
; For a = 1 To LN
;   String.s = StringField(Text.s, a, #LF$)
;   Len = Len(String)
;   
;   AddElement(String()) 
;   String() = String.s
;   
;   pos + Len + 1
; Next
; time3 = ElapsedMilliseconds()-time

MessageRequester( "", Str(time1) + " - time_1 text parse" +#CRLF$+ Str(time2) + " - time_2 text parse" +#CRLF$+ Str(time3) + " - time_3 text parse")

wilbert · Post by **wilbert** » Tue Dec 18, 2018 3:35 pm

mestnyi wrote:In my real code, for some reason, it is inferior to a regular expression.
how else can you speed it up?
Can I go through not characters, but words?

This is already a bit faster

Code: Select all

time = ElapsedMilliseconds()

Define *Sta.Character = @Text
Define *End.Character = @Text

#SOC  = SizeOf (Character)

While *End\c
  If *End\c = #LF             
    AddElement(String())
    String() = PeekS(*Sta, (*End-*Sta) >> #PB_Compiler_Unicode)
    *Sta = *End + #SOC
  EndIf
  *End + #SOC
Wend

time1 = ElapsedMilliseconds()-time

When you are using assembly code, you can check 16 characters at once for being a #LF or not.
That usually speeds things up but is more complicated to implement.

skywalk · Post by **skywalk** » Tue Dec 18, 2018 5:16 pm

Code posted here...
I am confused by the result of PB FindString() no case vs C's no case function.

PB = 3.9s vs StrStrIW() = 12.5s?

Code: Select all

; Count(n),                                            200
; findstring_+case(ms),                                227
; findstringmem_+case(ms),                             1660
; findstring_-case(ms),                                3886
; findstringmem_-case(ms),                             2073
; findstringC_+case(ms),                               233
; findstringASMU_+case(ms),                            229
; findstringSSE2_+case(ms),                            125
; findstringC_-case(ms),                               12539
; findstring_+case : findstringmem_+case(%),           631
; findstring_-case : findstringmem_-case(%),           -47
; findstring_+case : findstring_-case(%),              1612
; findstringmem_+case : findstringmem_-case(%),        25
; findstring_+case : findstringC_+case(%),             3
; findstring_+case : findstringASMU_+case(%),          1
; findstring_+case : findstringSSE2_+case(%),          -45
; findstring_+case : findstringC_-case(%),             5424

wilbert · Post by **wilbert** » Tue Dec 18, 2018 5:35 pm

skywalk wrote:I am confused by the result of PB FindString() no case vs C's no case function.

The problem with comparing case insensitive functions is that there is no standard approach.
Some routines only do a case insensitive compare for A-Z, others use a lookup table to handle all unicode characters and also take into account the rules for a specific language. The second approach is much slower.

skywalk · Post by **skywalk** » Tue Dec 18, 2018 7:04 pm

Makes sense, but I assumed PB's FindString(casesensitive) was a wrapper for the underlying wcsstr().
Their results are nearly identical.
But why the big difference when comparing FindString(nocase) vs StrStrIW()?
This is counterintuitive.

PureBasic Forums - English

Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?