Page 1 of 2

Find an string in a linked list, two ways one good one BAD!!

Posted: Mon Jul 17, 2006 4:36 pm
by Jan Vooijs
Code updated for 5.20+

Hello,

I would like to share my findings into how to find an item in a linked list, in this case a certain string. This is part of a much larger program, but simple to understand. In my app i have a list of known spam adresses (12000 of them) and want to find if the email i recieve is from a spammer.

Look at this code run for yourself and see the difference. One warning do NOT try with more than 10000 since the 'wrong' for-next loop takes ages to run.

Notice i have had to 'slow down' the right way 520 times!!

Here is my code contribution:

Code: Select all

EnableExplicit

NewList slTest.s()

Define Seek.s

Define X.l
Define Count.l
Define Idx.l
Define T1.l, T2.l

Seek.s = "xxtest"

For x = 1 To 10000
   AddElement( slTest.s())
   slTest.s() = "test+" + Str( x)
Next

; this is index number 10000 and not 10001 since we start from ZERO!!
AddElement( slTest())
slTest() = Seek.s

OpenConsole()

; Try to find the "xxtest" in the linkedlist and how long it takes, version 1 for-next loop...
PrintN( "=Start = " + Seek.s)

Idx. l = -1
T1.l = ElapsedMilliseconds()
Count.l = ListSize( slTest.s())
For Idx.l = 0 To Count.l
   SelectElement( slTest.s(), Idx.l)
   If slTest.s() = Seek.s
      Break
   EndIf
Next
T2.l = ElapsedMilliseconds()
    
If Idx.l > -1
   PrintN( "=Found in: " + Str( T2 - T1) + "ms")
   PrintN( "At index: "+ Str( Idx.l))
EndIf


; Now the ForEach methode. It is SOO fast had to slow it down 520 times..
Idx. l = -1
T1.l = ElapsedMilliseconds()
For x = 1 To 520
   ForEach slTest.s()
      If slTest.s() = Seek.s
         Idx.l = ListIndex( slTest.s())
         Break
      EndIf
   Next
Next
T2.l = ElapsedMilliseconds()
    
If Idx.l > -1
   PrintN( "=Found in: " + Str( T2 - T1) + "ms")
   PrintN( "At index: "+ Str( Idx.l))
Else
   PrintN( "!!NOT!! Found in: " + Str( T2 - T1) + "ms")
EndIf

Input()

CloseConsole()

End
Have fun with it, oh by the way my app runs now so fast (yes 100 times faster)!!

Jan V.

Posted: Mon Jul 17, 2006 5:21 pm
by Killswitch
Christ that's a dramatic speed improvement - I'm a bit dubious as to how it manages to do it this quickly though?

Posted: Mon Jul 17, 2006 5:28 pm
by srod
That certainly is fast! :)

It's not that the 'fast method' is fast as such, but that the first method, relying on SelectElement(), is notoriously slow! :) This is because (I think!) that when you use SelectElement() to say, move to the 10th element, PB first iterates through the first 9 elements to get there etc!

Posted: Mon Jul 17, 2006 5:44 pm
by Jan Vooijs
srod wrote:That certainly is fast! :)

It's not that the 'fast method' is fast as such, but that the first method, relying on SelectElement(), is notoriously slow! :) This is because (I think!) that when you use SelectElement() to say, move to the 10th element, PB first iterates through the first 9 elements to get there etc!
@Srod,
Oops forget that but there is a warning in my above post! It certainly looks likes that because if your run the for-next loop version with 100000 (100k) it takes for EVER to run (ahem 179 secs against 486ms), i inserted an 'every 10k blib message' and it runs slower and slower every next 10k items in the list, so yes SelectElemnt() traves every time from the start, tough luck if you need the last item.

First I did not notice the lack in performance since i have so many spam adresses i figured oke it is slow. But then converted to test and was amazed!!

Jan V.

Posted: Mon Jul 17, 2006 7:08 pm
by Killswitch
However it works (I'm still hanning onto the "More of Fred's Witchcraft" explination :P) I'm so glad you posted it, it's come at exactly the right time for me!

Posted: Mon Jul 17, 2006 7:47 pm
by Trond
Turn the first loop into this and you'll see it keep up without problems:

Code: Select all

Idx. l = -1 
T1.l = ElapsedMilliseconds() 
Count.l = CountList( slTest.s()) 
FirstElement(slTest())
For Idx.l = 0 To Count.l 
   If slTest.s() = Seek.s 
      Break 
   EndIf 
   NextElement(slTest())
Next 
T2.l = ElapsedMilliseconds()

Posted: Mon Jul 17, 2006 8:02 pm
by dioxin
I'm curious. Why would you use a linked list for this task? Surely an array is more appropriate and would be a lot faster? You can probably scan a 100,000 element array in under 10ms compared to your current linked list search of 480ms.


Of course, if you want real speed then you'd use an entirely different approach such as a sorted array searched with a binary chop (slow to insert new elements but fast to find them) or a binary tree structure (fast to insert and search but more complex to program) both of which I'd expect you to be able to scan in under 10usec.

Paul.

Posted: Mon Jul 17, 2006 9:49 pm
by Joakim Christiansen
It will jump compulsory to every element until the target position is reached
That is why SelectElement() is slow!

Posted: Tue Jul 18, 2006 2:36 am
by Jan Vooijs
dioxin wrote:I'm curious. Why would you use a linked list for this task? Surely an array is more appropriate and would be a lot faster? You can probably scan a 100,000 element array in under 10ms compared to your current linked list search of 480ms.


Of course, if you want real speed then you'd use an entirely different approach such as a sorted array searched with a binary chop (slow to insert new elements but fast to find them) or a binary tree structure (fast to insert and search but more complex to program) both of which I'd expect you to be able to scan in under 10usec.

Paul.
Yes and NO, if you read cairfully I SLOWED the ForEach loop DOWN because it is so fast i could not time it!! It runs on 100000 (100k) in just 16ms (on averidge) not bad for a 100K scan. I did not try an array this is fast enough for the future!!

Why linked list? Easy to progam, sorting is quick (and very easy). Think of it as a large stack of cards. While in an array one has to be cairfull not to add 'out of bounds'! Look at how easy i fill the database with 10000 (10k) strings, four lines of code!!

Jan V.

(edit an typo it was 4 lines and not three!!)

Posted: Tue Jul 18, 2006 2:39 am
by Jan Vooijs
Trond wrote:Turn the first loop into this and you'll see it keep up without problems:

Code: Select all

Idx. l = -1 
T1.l = ElapsedMilliseconds() 
Count.l = CountList( slTest.s()) 
FirstElement(slTest())
For Idx.l = 0 To Count.l 
   If slTest.s() = Seek.s 
      Break 
   EndIf 
   NextElement(slTest())
Next 
T2.l = ElapsedMilliseconds()
You are right it is as fast as the second methode. But because the ForEach looks more 'nicer' i stick to that one.

As this proofs you can do one thing in more than two ways!!

Thanks Trond

Jan V.

Posted: Wed Jul 19, 2006 1:22 am
by sverson
@Jan Vooijs: If you want it really fast try my X-Tree.pbi...

It's quite slow when you add elements - but it's just in time sorted.
But XTree_FindKey is verry fast!!!
500.000 records - max 19 steps to find the key!

You need X-Tree.pbi (beta) to compile the Demo.

Code: Select all

#ListSize = 500000

XIncludeFile "X-Tree.pbi"

Global Dim SourceList.s(#ListSize)
Global NewList LinkedList.s()
Global *XTree.XTree_RN
Global StrToFind.s

Procedure.s FillSourceList() ;/ Same source for LinkedList and X-Tree
  Protected ListPos.l, KeyStr.s, KeyPos.l, EMS.l
  EMS = ElapsedMilliseconds()
  For ListPos = 1 To #ListSize
    KeyStr=""
    For KeyPos=1 To 10
      KeyStr + RSet(Hex(Random($FF)),2,"0")
    Next
    SourceList(ListPos) = KeyStr
  Next
  ProcedureReturn "SourceList(): "+Str(#ListSize)+" Elements stored in " +Str(ElapsedMilliseconds()-EMS)+"ms."
EndProcedure

Procedure.s FillLinkedList()
  Protected ListPos.l, EMS.l
  EMS = ElapsedMilliseconds()
  For ListPos = 1 To #ListSize
    AddElement(LinkedList())
    LinkedList() = SourceList(ListPos)
  Next
  ProcedureReturn "LinkedList(): "+Str(#ListSize)+" Elements stored in " +Str(ElapsedMilliseconds()-EMS)+"ms."
EndProcedure

Procedure.s FillXTree()
  Protected ListPos.l, EMS.l
  EMS = ElapsedMilliseconds()
  *XTree.XTree_RN = XTree_NewTree(10)
  For ListPos = 1 To #ListSize
    XTree_NewKey(*XTree,SourceList(ListPos))
  Next
  ProcedureReturn "XTree: "+Str(#ListSize)+" Elements stored in " +Str(ElapsedMilliseconds()-EMS)+"ms."
EndProcedure

Procedure.s FindLinkedList(KeyStr.s)
  Protected EMS.l
  EMS = ElapsedMilliseconds()
  ForEach LinkedList()
    If LinkedList() = KeyStr
      ProcedureReturn "LinkedList(): '"+KeyStr+"' found after " +Str(ElapsedMilliseconds()-EMS)+"ms."
    EndIf
  Next
  ProcedureReturn "LinkedList(): '"+KeyStr+"' NOT found after " +Str(ElapsedMilliseconds()-EMS)+"ms."
EndProcedure
  
Procedure.s FindXTree(KeyStr.s)
  Protected EMS.l
  EMS = ElapsedMilliseconds()
  If XTree_FindKey(*XTree,KeyStr)
    ProcedureReturn "XTree: '"+KeyStr+"' found after " +Str(ElapsedMilliseconds()-EMS)+"ms."
  Else
    ProcedureReturn "XTree: '"+KeyStr+"' NOT found after " +Str(ElapsedMilliseconds()-EMS)+"ms."
  EndIf
EndProcedure


Debug FillSourceList()
Debug FillLinkedList()
Debug FillXTree()

For x=1 To 20
  Debug "___________________________"
  StrToFind = SourceList(Random(#ListSize))
  If Not Random(3) : StrToFind+" ERROR!" : EndIf
  Debug FindLinkedList(StrToFind)
  Debug FindXTree(StrToFind)
  Debug "¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯"
Next
;-) sverson

Posted: Wed Jul 19, 2006 1:42 am
by Xombie
My red-black tree stuff is still hiding out here somewhere on the forum. At the time, it took roughly 4.5 seconds to add 2,000,000 random elements and somewhere around 0 to find a key from those 2 million records. It was actually one of my first tidbits I released on the forum :) I should optimize that sucker....

Posted: Wed Jul 19, 2006 1:56 am
by Jan Vooijs
sverson wrote:@Jan Vooijs: If you want it really fast try my X-Tree.pbi...

... SNIP EDIT ...

;-) sverson
I will look into it but seeing the code the first time i'am afriad my code needs a rewrite to use yours, maybe if i run into perfomance problems i will use it. One thing if i remember correctly a X-tree it not so fast if there are a lot off 'almost similair' entries. I have +350 Hotmail accounts sending spam (among the 18000+ entries).

My current version (the ForEach-methode) looks in under 15ms through the current database of 18000+ adresses, which in my opinion is very fast

And until i hit the 10.000.000 (10 million mark i find it in just under 1200ms). And 1 million is 'just' 93ms. In the last 3 years i build the spam adresses from zero to hero, pardon to 18000+ so i expect to run a long time with the ForEach-methode.

An other idea is to use an "index", every letter has a starting point in the linked list since it is sorted, so switching to that first index an search from there will likely speed it up. Or better making a linked list for every letter and digit (ouch)...

But thanks for the sugestion.

And by the way found a third version which is also very fast:

Code: Select all

Idx. l = -1
T1.l = ElapsedMilliseconds()
ResetList( slTest.s())
While NextElement(slTest())
   If slTest.s() = Seek.s
      Break
   EndIf
Wend
T2.l = ElapsedMilliseconds()
The trick is the "ResetList()' and 'NextElement()'. It is slightly modified from the help-file where i found it today. And yes as fast (sometimes 1ms faster) then my ForEach-methode. But i find that ForEach so elegant i stick to it (for now).

But than this post was more of showing "how to kiil a bird with more than one stone" (example not that I would kill a bird, heavens!!) ;)

Jan V.

Posted: Wed Jul 19, 2006 2:01 am
by Jan Vooijs
Xombie wrote:My red-black tree stuff is still hiding out here somewhere on the forum. At the time, it took roughly 4.5 seconds to add 2,000,000 random elements and somewhere around 0 to find a key from those 2 million records. It was actually one of my first tidbits I released on the forum :) I should optimize that sucker....
Xombie is XSub not a program of yours? Found it somewhere and looks promessing for subtitleing some TV shows I have.

And again if i hit a perfomance barrier i look into yours code too.

Thanks,

Jan V.

Posted: Wed Jul 19, 2006 2:38 am
by sverson
Xombie wrote:At the time, it took roughly 4.5 seconds to add 2,000,000 random elements....
@Xombie:
On my PC your Red/Black Tree took 25 seconds to add 500.000 elements.
I could not test the lookop function because RBTree does not support string-keys.
XTree added 500.000 sorted string elements in 17,5 seconds. ;-)
Debugger wrote:SourceList(): 500000 Elements stored in 2516ms.
LinkedList(): 500000 Elements stored in 390ms.
XTree: 500000 Elements stored in 17438ms.
RBTree: 500000 Elements stored in 25031ms.
___________________________
LinkedList(): '3855BDC7418D8C00560A ERROR!' NOT found after 63ms.
XTree: '3855BDC7418D8C00560A ERROR!' NOT found after 0ms.
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
___________________________
LinkedList(): '60D2EACF61D94470E73F' found after 15ms.
XTree: '60D2EACF61D94470E73F' found after 0ms.
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
___________________________
LinkedList(): 'DDDBF81FEB5641553150' found after 63ms.
XTree: 'DDDBF81FEB5641553150' found after 0ms.
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
___________________________
LinkedList(): '3181D1212E5100ECC98E ERROR!' NOT found after 62ms.
XTree: '3181D1212E5100ECC98E ERROR!' NOT found after 0ms.
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
... ...
___________________________
LinkedList(): 'BDDD3EE1B29A21F264A5' found after 31ms.
XTree: 'BDDD3EE1B29A21F264A5' found after 0ms.
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
___________________________
LinkedList(): '3A1F49558F907763DD08' found after 63ms.
XTree: '3A1F49558F907763DD08' found after 0ms.
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

Code: Select all

Procedure.s FillRBTree()
  Protected ListPos.l, EMS.l
  EMS = ElapsedMilliseconds()
  *RBTree = rbtCreate()
  For ListPos = 1 To #ListSize
    rbtAdd(*RBTree,ListPos,SourceList(ListPos))
  Next
  ProcedureReturn "RBTree: "+Str(#ListSize)+" Elements stored in " +Str(ElapsedMilliseconds()-EMS)+"ms."
EndProcedure
;-) sverson