Why FindString() is so terribly slow?

STARGÅTE · Post by **STARGÅTE** » Mon Nov 26, 2018 8:41 am

@uwekel:

Your Python code takes on my system 330ms (in comparison to 2787ms in PureBasic).
And your code need more time (663ms), if the string contains unicode characters (which is in PB automatically).

Code: Select all

from time import time

s = '√' * 1000000 + '∑'
t = time()
for i in range(1000):
    x = s.find('∑')
print(int((time()-t)*1000), 'ms')
input()

Code: Select all

OpenConsole()
s.s = LSet("", 1000000, "√") + "∑"
t = ElapsedMilliseconds()
For i = 1 To 1000
  x = FindString(s, "∑")
Next
PrintN(Str(ElapsedMilliseconds()-t))
Input()

So, PB is not terribly slow!

I think we can speed up the FindString in PB by using commands such PCMPEQB/PCMPEQW to compare multiply bytes/word in one instruction. Maybe that uses Python, because 2787ms (PB) / 8 Bytes = 348ms (Python)

#NULL · Post by **#NULL** » Mon Nov 26, 2018 8:44 am

Maybe there is some optimization going on. every loop iteration produces the same x and x is not even used. Can you try to add x up to some sum and print that so the compiler can't elide anything?

Marc56us · Post by **Marc56us** » Mon Nov 26, 2018 8:54 am

Regular expression version: 2 ms (1000 time faster than FindString)

Code: Select all

Text$ = Space(1000000) + "!"
Debug "String Lenght: " + Len(Text$)
Start = ElapsedMilliseconds()

CreateRegularExpression(0, "!")
If MatchRegularExpression(0, Text$)
    Stop = ElapsedMilliseconds()
    ExamineRegularExpression(0, Text$)
    While NextRegularExpressionMatch(0)
        Debug "Found: « " + RegularExpressionMatchString(0) + 
              " » at position: " + RegularExpressionMatchPosition(0) +
              " in: " + Str(Stop - Start) + " ms"
    Wend
EndIf

FreeRegularExpression(0)

Code: Select all

String Lenght: 1000001
Found: « ! » at position: 1000001 in: 2 ms

(and with debug on)

So, yes FindString() is slow, but RegEx engine is here and do the job and do more things easyly for strings search.

the.weavster · Post by **the.weavster** » Mon Nov 26, 2018 9:36 am

Marc56us wrote:Regular expression version: 2 ms (1000 time faster than FindString)
..//..
So, yes FindString() is slow, but RegEx engine is here and do the job and do more things easyly for strings search.

I think you're supposed to perform the search 1000 times for the test

Marc56us · Post by **Marc56us** » Mon Nov 26, 2018 9:43 am

the.weavster wrote:I think you're supposed to perform the search 1000 times for the test

Oh

yes

With 1000 time: 2049 ms so same as FindString()

NicTheQuick · Post by **NicTheQuick** » Mon Nov 26, 2018 11:00 am

I tested it for fun too:

Code: Select all

Purebasic     1640 ms
Python2.7.15  1280 ms
Python3.6.7     76 ms

That is very impressive. Somehow Python 3.6 does a much better job at finding something.

Here is the Python code I used:

Code: Select all

# -*- coding: utf-8 -*-
from time import time

s = '√' * 1000000 + '∑'
t = time()
sum = 0
for i in range(1000):
	x = s.find('∑')
	sum += x
print(int((time()-t)*1000))
print(sum)

Edit:
You can even make it faster with Unicode in Python2.7:

Code: Select all

# -*- coding: utf-8 -*-
from time import time

s = u'√' * 1000000 + u'∑'
t = time()
sum = 0
for i in range(1000):
	x = s.find(u'∑')
	sum += x
print(int((time()-t)*1000))
print(sum)

Now the timings are:

Code: Select all

Python2.7.15   575 ms
Python3.6.7     70 ms

Post by **RSBasic** » Mon Nov 26, 2018 11:21 am

ccode wrote:@RSBasic

x = FindString(LCase(s), LCase("!"))

is slower!!!

Hm, that's weird. I tested on an old laptop with 1 to 100:

Code: Select all

s.s = Space(1000000) + "!"
t = ElapsedMilliseconds()
For i = 1 To 100
  x = FindString(s, "!")
Next
MessageRequester("", Str(ElapsedMilliseconds() - t))
t = ElapsedMilliseconds()
For i = 1 To 100
  x = FindString(s, "!", 0, #PB_String_NoCase)
Next
MessageRequester("", Str(ElapsedMilliseconds() - t))
t = ElapsedMilliseconds()
For i = 1 To 100
  x = FindString(LCase(s), LCase("!"))
Next
MessageRequester("", Str(ElapsedMilliseconds() - t))

340
4289
975

LCase() is faster than #PB_String_NoCase!

Marc56us · Post by **Marc56us** » Mon Nov 26, 2018 11:59 am

RSBasic wrote:LCase() is faster than #PB_String_NoCase!

This is logical: #PB_String_NoCase must test the two forms of each letter while LCase (and UCase) automatically transform all letters (before) so they do only one test after

Code: Select all

  #PB_String_CaseSensitive: case sensitive search (a=a) (default).
  #PB_String_NoCase       : case insensitive search (A=a).

It therefore seems that the automatic transformation is faster than the double test

infratec · Post by **infratec** » Mon Nov 26, 2018 12:53 pm

Ups, already cleared that 100x faster is not true

NicTheQuick · Post by **NicTheQuick** » Mon Nov 26, 2018 12:56 pm

infratec wrote:In my test it runs in python 1.5 times faster then PB.
But not 100x faster

Same PC, with Linux x86.

If you read my post, you will see that Python 3.6 runs more than 23 times faster than PB.

infratec · Post by **infratec** » Mon Nov 26, 2018 1:02 pm

Maybe Python 3 starts searching from the end

#NULL · Post by **#NULL** » Mon Nov 26, 2018 1:11 pm

infratec wrote:Maybe Python 3 starts searching from the end

If i add the sigma to the beginning of the string instead then it takes 0 milliseconds with Python 2.7.15rc1

infratec · Post by **infratec** » Mon Nov 26, 2018 1:22 pm

I was talking about 3 and not 2.

Since 3 is faster then it is possible if it start from beginning.

It is also possible that 3 starts 2 threads one starts from the beginning one from the end.
If the pointer matches in the middle the string was not found.
Else one of them finds it earlier and stop the searching.

Derren · Post by **Derren** » Mon Nov 26, 2018 2:02 pm

I tried this code by Nic (viewtopic.php?p=529390#p529390)with an online python compiler: https://repl.it/repls/CommonMeagerVendor

Code: Select all

# -*- coding: utf-8 -*-
from time import time

s = '√' * 1000000 + '∑'
t = time()
sum = 0
for i in range(1000):
   x = s.find('∑')
   sum += x
print(int((time()-t)*1000))
print(sum)

I can't get under 300 ms, for some reason, BUT changing the string like this:

Code: Select all

s = '√' * 500000 + '∑' + '√' * 500000

which I suppose (not fluent in python) puts the Sigma in the middle of the 1 mio character string, it significantly faster (about 50ms).

Also, shouldn't a "from end" search need only 0ms, like #NULL wrote, if that's the case when the Sigma is in the beginning.

NicTheQuick · Post by **NicTheQuick** » Mon Nov 26, 2018 2:07 pm

infratec wrote:Since 3 is faster then it is possible if it start from beginning.

It is also possible that 3 starts 2 threads one starts from the beginning one from the end.
If the pointer matches in the middle the string was not found.
Else one of them finds it earlier and stop the searching.

That would be a bad idea. The creation of the threads alone would last a small amount of time. That is the reason why you should use thread pools when you want to distribute work. But Python 3 does nothing of that. Also traversing a memory block from the end back to the beginning is not a good choice regarding cache management and the result you want to know. You want to know the first occurrence of a string. Why should you start from the end then? That makes no sense.
But what I think Python 3 is doing is that it knows that s and '∑' were not changed since the last iteration of the for loop and the 'find' method has no side effects, so it can simply use the result of the previous call instead of searching for '∑' in s a second time.

PureBasic Forums - English

Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?

Re: Why FindString() is so terribly slow?