Is findstring really slow?

Randy Walker · Post by **Randy Walker** » Wed Apr 23, 2025 2:26 am

Obvious;y as seen at the bottom of this code snippit, 'countstring" is the simple way to get a word count.
But recently I saw someone complain that "findstring" was slow, so I wrote this simple routine to count occurrences of a word to see just how slow it really was. Seems quite fast to me, or maybe its just the clock speed of my CPU that gives it speed. I let you be the judge.
I used The constitution of the US as a text source for my word search.
I just right clicked this link and used the [Save link as] option to download the Titan Bug.txt file : https://www.gutenberg.org/cache/epub/5/pg5.txt

Code: Select all

; text source: https://www.gutenberg.org/cache/epub/5/pg5.txt
Debug FileSize("D:\TEMP\TEMP\Titan Bug.txt")
search$ = "free"
ReadFile(0,"D:\TEMP\TEMP\Titan Bug.txt",#PB_File_SharedRead | #PB_Ascii)
Buffer.s = ReadString(o,#PB_File_IgnoreEOL)
CloseFile(0)
time = ElapsedMilliseconds()
L = Len(buffer)
Debug L
P = 1
While P < L
  P = FindString(buffer,search$,P+1)
  If P
    C +1
  Else
    Break
  EndIf
Wend
Debug C
Debug ElapsedMilliseconds() - time

time = ElapsedMilliseconds()
Debug CountString(buffer,search$)
Debug ElapsedMilliseconds() - time

I count zero milliseconds to be pretty fast.

Post by **idle** » Wed Apr 23, 2025 4:02 am

No it's not slow it's just a wrapper to the c strstr function which it's fine for small strings it's a naive search which isn't optimized.

If you want faster search you can often use a boyer moore which is pretty much the gold standard for string search functions but the assumption is that you're searching for repeated occurrences of the search term and that your search term is multiple characters long. The search speed gets faster as the length of the search term increases as it doesn't' need to compare each character of the string and so it will often jump through the search string by the length of the search term when there's no partial match. but it has a slow start since it has to build the skip table with each change of the search term or specifically when the pos=0

there are also SSE simd parallel searches that will beat boyer moore but as a generic function a BMSearch is hard to beat.

This should really start to shine when your search term is more than 4 characters long and it reports the byte position in the memory so you can use peeks to extract the string

Code: Select all

 Structure ara
  a.a[0] 
 EndStructure

Procedure BMSearch(*pinput,inlen,*pat.ara,palen,pos=0) 
  ;booyer moore search 
  Protected i,t,len,skip,*input, *pa.Ascii,*pb.Ascii   
  
  Structure ST
    a.a[256] 
  EndStructure  
  Static skiptable.ST 
  inlen-pos 
  *input = *pinput+pos 
  len = inlen - palen
  If pos = 0  
    For i = 0 To 255
      Skiptable\a[i] = 255;
    Next  
    t= palen-1
    For i = 0 To t
      skiptable\a[*pat\a[i]] = i
    Next     
  EndIf 
  i=0
  skip=0
  While skip <= len
    i = palen - 1;
    *pa = (*input + skip + i)
    *pb = *pat+i
    While (*pb\a = *pa\a) 
      i-1 
      *pa - 1 
      *pb - 1
    Wend   
    If i > 0 
      t = i - Skiptable\a[*pa\a]   
      If t > 1 
        skip + t 
      Else 
        skip + 1
      EndIf 
    Else   
      ProcedureReturn skip + pos 
    EndIf     
  Wend  
  ProcedureReturn 0
  
EndProcedure

AZJIO · Post by **AZJIO** » Wed Apr 23, 2025 5:39 am

Randy Walker wrote: Wed Apr 23, 2025 2:26 am
Code: Select all
P+1)

Code: Select all

length = Len(search$)
P + length)

wilbert · Post by **wilbert** » Wed Apr 23, 2025 6:15 am

Randy Walker wrote: Wed Apr 23, 2025 2:26 am But recently I saw someone complain that "findstring" was slow

It depends on what you need it for.
It is especially slower as what you can achieve if you need to do multiple searches on a long string.
In the example code you posted you start FindString at a certain position but all characters prior to that position have to be scanned each time to see if there is no string termination.

BarryG · Post by **BarryG** » Wed Apr 23, 2025 8:23 am

Randy Walker wrote: Wed Apr 23, 2025 2:26 amrecently I saw someone complain that "findstring" was slow

Was that an old post? Because it used to be slow in older versions of PureBasic, but this was fixed some time ago. It can find a match at the end of a huge 100 MB string in less than half a second here on my old PC. Hardly what I'd call "slow".

Code: Select all

Debug FindString(Space(100*1048756)+"z","z")

NicTheQuick · Post by **NicTheQuick** » Wed Apr 23, 2025 8:59 am

As long as you test it with an enabled debugger your code will always be slowed down by it drastically.
Don't compare speeds with an enabled debugger!

NicTheQuick · Post by **NicTheQuick** » Wed Apr 23, 2025 9:43 am

I rewrote the code to make it usable with disabled debugger. I tested it with the `pg5.txt` file.
Even with the debugger it only takes 0 ms. We need a way bigger text file to create a reasonable test case.

Besides of that, idle already posted a code with the Boyer–Moore string-search algorithm if I see that right. It has a performance of O(m + n) when m and n are the lengths of the input text and search pattern. This algorithm especially makes sense if you want to search for the same pattern over and over.

Here's the code:

Code: Select all

; text source: https://www.gutenberg.org/cache/epub/5/pg5.txt

EnableExplicit
Define file.s, buffer.s, search.s

file.s = "/home/nicolas/tmp/purebasic/pg5.txt"
search = "free"


Debug "Filesize: " + StrU(FileSize(file))

If Not ReadFile(0, file,#PB_File_SharedRead | #PB_Ascii)
	DebuggerError("Could not open file: " + file)
	End 1
EndIf
buffer.s = ReadString(0, #PB_File_IgnoreEOL)
CloseFile(0)

Define time1.i, time2.i, L.i, P.i, C.i

time1 = ElapsedMilliseconds()
L = Len(buffer)
Debug "String legngth: " + StrU(L)
P = 1
While P < L
	P = FindString(buffer, search, P + 1)
	If P
		C + 1
	Else
		Break
	EndIf
Wend
Debug "Occurences: " + StrU(C)
time1 = ElapsedMilliseconds() - time1


time2 = ElapsedMilliseconds()
C = CountString(buffer, search)
time2 = ElapsedMilliseconds() - time2

MessageRequester("Times", "FindString: " + time1 + " ms" + #LF$ + "CountString: " + time2 + " ms")

Axolotl · Post by **Axolotl** » Wed Apr 23, 2025 10:29 am

I haven't had any speed problems yet. If possible the use of optional StartPosition is a speed-booster.
Depending on what you have in mind, there are other algorithms that all have their strengths and weaknesses ...... but the professionals have already started to explain this.

BTW: I like this pages for learning stuff. (However, you need to be able to read and understand C/C++, Java, Python or similar.)
Maybe it helps a bit (if you want to move in that direction.)
Tutorial to learn Data Structures and Algorithms

Or as Overview and Introduction:
Searching Algorithms

Little John · Post by **Little John** » Wed Apr 23, 2025 11:21 am

Axolotl wrote: Wed Apr 23, 2025 10:29 am Tutorial to learn Data Structures and Algorithms

I have some books about data structures and algorithms. The website mentioned above is apparently also a good source of information on the subject. Many thanks for the link!

Quin · Post by **Quin** » Wed Apr 23, 2025 2:17 pm

I don't think it's FindString() that's truly slow, as others have mentioned here it's not the fastest string algorithm out there but it's not too bad either. What is slow, at least as far as I recall last time I tested in I believe PB 6.10, StringField is horrifyingly slow.

Little John · Post by **Little John** » Wed Apr 23, 2025 2:24 pm

Quin wrote: Wed Apr 23, 2025 2:17 pm What is slow, at least as far as I recall last time I tested in I believe PB 6.10, StringField is horrifyingly slow.

The reason is, that StringField() uses Shlemiel the painter’s algorithm, see freak's corresponding message.

Randy Walker · Post by **Randy Walker** » Wed Apr 23, 2025 7:33 pm

BarryG wrote: Wed Apr 23, 2025 8:23 am
Randy Walker wrote: Wed Apr 23, 2025 2:26 amrecently I saw someone complain that "findstring" was slow
Was that an old post?

It was afairly recent pos. I don'rt know where exactly -- have to go back and review all the threads I've posted in.

... It can find a match at the end of a huge 100 MB string in less than half a second here on my old PC. Hardly what I'd call "slow".
Code: Select all
Debug FindString(Space(100*1048756)+"z","z")

Thank you. Exactly my point.

Randy Walker · Post by **Randy Walker** » Wed Apr 23, 2025 7:42 pm

wilbert wrote: Wed Apr 23, 2025 6:15 am
Randy Walker wrote: Wed Apr 23, 2025 2:26 am But recently I saw someone complain that "findstring" was slow
It depends on what you need it for.
It is especially slower as what you can achieve if you need to do multiple searches on a long string.
In the example code you posted you start FindString at a certain position but all characters prior to that position have to be scanned each time to see if there is no string termination.

It's very true. My code was not as optimized as it could be. OH!!, Did I happen to mention somewhere? "I *never* claimed to be a programmer."

PureBasic Forums - English

Is findstring really slow?

Is findstring really slow?

Re: Is findstring really slow?

Re: Is findstring really slow?

Re: Is findstring really slow?

Re: Is findstring really slow?

Re: Is findstring really slow?

Re: Is findstring really slow?

Re: Is findstring really slow?

Re: Is findstring really slow?

Re: Is findstring really slow?

Re: Is findstring really slow?

Re: Is findstring really slow?

Re: Is findstring really slow?