Multi-pattern searching

Just starting out? Need help? Post your questions and find answers here.
Inf0Byt3
PureBasic Fanatic
PureBasic Fanatic
Posts: 2236
Joined: Fri Dec 09, 2005 12:15 pm
Location: Elbonia

Multi-pattern searching

Post by Inf0Byt3 »

I'm trying these days to find a set of fixed length binary patterns in a big binary memory buffer. There are approx. 10000 patterns, each of them has exactly 32 bytes in length and the main buffer I must search in is about 10mb big. Time is critical and I can't find a way to find the data quick enough (e.g under one second). I've tried everything:

- Building a CRC32 table to store fragments of the buffer and searching them
- Building a lookup table (just like QuickSearch does) and skipping n bytes
- Dumb search (awfully slow)
- Boyer-Moore and others (linear)

I know that it is possible to search even faster, but I can't create something that optimized.

Has anyone needed this in the past and has some code available or an idea? One with some brain could make an automaton like Aho-Corasick but that's not my case...
None are more hopelessly enslaved than those who falsely believe they are free. (Goethe)
User avatar
idle
Always Here
Always Here
Posts: 5897
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Post by idle »

Sounds interesting, though I'm still not sure exactly what you need, by the sounds of it what you want is a hash table.
User avatar
Rook Zimbabwe
Addict
Addict
Posts: 4322
Joined: Tue Jan 02, 2007 8:16 pm
Location: Cypress TX
Contact:

Post by Rook Zimbabwe »

I think my good friend Inf0byt3 is building a virus scanner or malware scanner and needs to look for certain signatures...

OK this may sound stupid...

After you assemble the data in the buffer have you tried sorting it into ascending order THEN searching it?

I wiggled the sort example from the help files a bit... it dies work, but tends to DROP the 0's I think

Code: Select all


; Change the number of elements here, to see the speed of the sort routine :)
;
#NbElements = 10

Dim Array.s(#NbElements)

For k = 0 To #NbElements
  Array(k) = BinQ(Random(10))
Next

OpenConsole()

; Fill the array with lot of random number
;

For k=0 To #NbElements
  Print(Array(k)+" ")
Next

; Sort the whole array !
;
SortArray(Array(), 0)

PrintN("")
PrintN("")

For k=0 To #NbElements
  Print(Array(k)+" ")
Next

MessageRequester("Sort Example","Sort finished.",0)
Input()

End
Binarily speaking... it takes 10 to Tango!!!

Image
http://www.bluemesapc.com/
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

He's going to have to cheat I think.

If this is a virus scanner then the patterns have very limited number of places in a file that they can appear in order to execute or be infectious. (ie have a pattern and a list of possible offsets) macro viruses might be an exception there but there are probably shortcuts to tell is certain doc types contain macros or not.

If this isn't a virus scanner than what is it? (A hint?) if things like quicksearch and what not are still way too slow I think you are going to need a solution based very specifically on your data and patterns.

---- an idea though ---
How many of the patterns start with the same byte(s), if you just do one pass of the file and at each byte just check the patterns that start with that byte (need the patterns hashed or something I guess) then move a byte and drop off the patterns that don't match that one too.. or something
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
Inf0Byt3
PureBasic Fanatic
PureBasic Fanatic
Posts: 2236
Joined: Fri Dec 09, 2005 12:15 pm
Location: Elbonia

Post by Inf0Byt3 »

Hehe you got me... It's that scanning engine I'm trying to build again. I am searching for a solution for this matter for a long time (2 years) but I had moments I thought it will never be ready (though hope dies last). I've covered pretty much everything so far, except for the pattern matcher.

@Rook, Idle
A hashmap would be indeed good if I were to compare a pattern with others. The real problem is that a 10 MB file generates 10485760 such 32 bit long "32-grams". Trying to match each of these with the other 10000, even with a hashmap is terribly slow. I've tried many tricks, e.g comparing only the first and last byte or using a crc32 to minimize the scanning time. While it was really fast compared to a linear search, it was too slow for a time-critical program (e.g scanning for 500 signatures took 15 seconds in a 10 mb file).

@Paul,
I already covered that part, e.g only the important areas of an exe are scanned (the sections), for documents as well. The unneeded data is skipped automatically. The engine must eventually support 2 scanning modes, one will use hashes (done) and this one with simple pattern matching. With just hashes, it won't get too far because you can't detect families of viruses and if a byte gets changed no detection occurs.
How many of the patterns start with the same byte(s), if you just do one pass of the file and at each byte just check the patterns that start with that byte (need the patterns hashed or something I guess) then move a byte and drop off the patterns that don't match that one too.. or something
I tried something like this but it was slow because there are too many patterns.
None are more hopelessly enslaved than those who falsely believe they are free. (Goethe)
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

What if the hash was 32bit made up of the first 4 bytes of the pattern. This can be compared with a long = long comparison which will be quick and which would have not too many doubleups. You would have to move through the file one byte at a time still rather than 4byte jumps but the hash wouldn't match anyware near as often and when it did the "buckets" would not contain as many items.

If 32bit turns out to be too big and you want to use arrays for very quick access then 16 bit lookups could be put in memory 65k items with doubleups and I guess plenty of gaps. How similar are the first could of bytes in the patterns? using 16bits would there be many items in each bucket?
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
Inf0Byt3
PureBasic Fanatic
PureBasic Fanatic
Posts: 2236
Joined: Fri Dec 09, 2005 12:15 pm
Location: Elbonia

Post by Inf0Byt3 »

Interesting idea. I'll have to try this one and make some speed comparisons.
How similar are the first could of bytes in the patterns? using 16bits would there be many items in each bucket?
The first bytes are not very similar. And I can make sure they're even more dispersed depending on the place the signature is extracted from.

As for the number of items in a bucket, it wouldn't be that big... I can extract common signatures to have less of them.

Ideally, this algorithm should have supported variable length patterns but being able to choose the signatures manually I thought that searching for fixed patterns would be faster.
None are more hopelessly enslaved than those who falsely believe they are free. (Goethe)
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

In a perfect world then you could have an array of 65k members with the pattern as the member (eg a struct with the byte vals) then loop through the file's bytes once and for each byte just check if that byte and the next making a 16bit number as the array index has a value, if it does then you have a pattern match. (you would do one lookup instead of 10,000 for each byte of the file) if you couldn't avoid a few double ups then you still use the index as the two byte vals but you might have a structure with several patterns in there that have the same index so you may have to do a couple of checks per byte in the file instead of just one.

On the flip side, the more double ups you have (adding comparisons and slowing things) then the more empty array members you will have so that you are flying through those bytes very fast instead, hard to guess is this would balance out or not, depends on the contents of the file I guess
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
User avatar
idle
Always Here
Always Here
Posts: 5897
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Post by idle »

something like this maybe, store hashes of the crc32 from the sigs into the hashtable.

feed files to be scanned into a buffer in a thread doing a recursive directroy search.

swap pointers to buffers

mutex buffer

Code: Select all


for a = 1 to lenBuff - maxsize 

 for b = minsize to maxsize  
  
      crc = CRC32Fingerprint(*tbuff+a+b,maxsize-b) 
     
       If HashObj 
          HT_ITEM.HashTData
          HT_ITEM\key$ = Str(crc)
          If HashObj\GetItem(HT_ITEM)
            ;Hello
          endif 
      endif 

   next 
next 

this assumes you use srods ***** hashtable class (thats 5 stars, not abuse :lol: )
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

Thanks for the endorsement there idle, :wink: but in this case, with out and out speed being crucial, my hash table class isn't really an option as the gains in flexibility come with the cost of a drop in speed. Fine for general applications, but not a specialised one like info's! :)
I may look like a mule, but I'm not a complete ass.
User avatar
idle
Always Here
Always Here
Posts: 5897
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Post by idle »

srod wrote:Thanks for the endorsement there idle, :wink: but in this case, with out and out speed being crucial, my hash table class isn't really an option as the gains in flexibility come with the cost of a drop in speed. Fine for general applications, but not a specialised one like info's! :)
speed may be an issue if he's only scanning what's already in memory but as soon the hd get involved, there won't be any penalty to speak of.
Inf0Byt3
PureBasic Fanatic
PureBasic Fanatic
Posts: 2236
Joined: Fri Dec 09, 2005 12:15 pm
Location: Elbonia

Post by Inf0Byt3 »

Haha unbelievable... Searching 10000 patterns in 5mb of data in 2.6 seconds. There must be a mistake somewhere I tell you.

Run this without debugger:

The code is a few posts down...

Am I dreaming or it really works?
Last edited by Inf0Byt3 on Thu Aug 21, 2008 4:44 pm, edited 1 time in total.
None are more hopelessly enslaved than those who falsely believe they are free. (Goethe)
User avatar
Rook Zimbabwe
Addict
Addict
Posts: 4322
Joined: Tue Jan 02, 2007 8:16 pm
Location: Cypress TX
Contact:

Post by Rook Zimbabwe »

That is very cool bit of code there!!! :D
Binarily speaking... it takes 10 to Tango!!!

Image
http://www.bluemesapc.com/
Inf0Byt3
PureBasic Fanatic
PureBasic Fanatic
Posts: 2236
Joined: Fri Dec 09, 2005 12:15 pm
Location: Elbonia

Post by Inf0Byt3 »

Thanks! I hope it's not a mirage and it really finds that string :lol:. Now it needs 4.4 secs for 50000 patterns in 5mb of data. It seems that the number of patterns doesn't affect it <too> much (perfect for its purpose). Thank you for all the help guys 8).

I've fixed a few bits in the code. With some optimization (ASM level maybe) it could lead to some really fast searches. If anyone needs it here it is:

Code: Select all

Structure Pattern
 CRC32.l
 Pointer.b[16] ;16 bytes of memory (a pattern's length)
EndStructure

Global PatternIndex.l
Global Dim Patterns.Pattern(0)

;Borrowed from Guimauve -> http://www.purebasic.fr/english/viewtopic.php?t=16506
;Binary Search
Procedure.l DichotomicSearch(ValeurRecherchee.l)

 Position.l = -1
 Fin.l = PatternIndex-1
 
 While Debut <= Fin
  Milieu = (Debut + Fin+1) >> 1
  If ValeurRecherchee < Patterns(Milieu)\CRC32
   Fin = Milieu - 1
  ElseIf ValeurRecherchee > Patterns(Milieu)\CRC32
   Debut = Milieu + 1
  Else   
   Debut = Fin + 1
   Position = Milieu               
  EndIf
 Wend
 
 ProcedureReturn Position
 
EndProcedure

Procedure AddPattern(*Pattern)
 
 ReDim Patterns.Pattern(PatternIndex)
 CopyMemory(*Pattern,@Patterns(PatternIndex)\Pointer[0],16) ;Copy the bytes in the array   
 Patterns(PatternIndex)\CRC32 = CRC32Fingerprint(*Pattern,16) ;The index is made with the CRC32
 PatternIndex + 1
 
EndProcedure

Procedure.l FindPattern(*MainMem, MainLen.l)
 
 SeekOnce = 16 ;(because we wouldn't like to peek from *MainMem+MainLen+16)
 
 While SeekOnce <= MainLen
 
  CRC32 = CRC32Fingerprint(*MainMem+SeekOnce,16)
  Result = DichotomicSearch(CRC32)
  If Result <> -1
   If PeekB(*MainMem+SeekOnce) = Patterns(Result)\Pointer[0]
    If CompareMemory(*MainMem+SeekOnce,@Patterns(Result)\Pointer[0],16) = 1
     Break
    EndIf
   EndIf
  EndIf

  SeekOnce + 1
 
 Wend
 
 ProcedureReturn Result

EndProcedure

OpenConsole()
 
;A ~ 5mb file
LenFile = (1024*1024*5)+16 + 2
*Buff = AllocateMemory(LenFile)
For Fill = 0 To LenFile-17
 PokeB(*Buff+Pointer,1+Random(254))
 Pointer + 1
Next
PokeB(*Buff+Pointer,123)
Pointer + 1
PokeS(*Buff+Pointer,"alex-1234-12bx56",16) ; Add inside the "file" a pattern that we'll match
Pointer + 16
PokeB(*Buff+Pointer,123)

;Fill the table with patterns
For AddPatt = 0 To 50000
 Patt.s = Left(HexQ(Random(1234567890))+HexQ(Random(1234567890))+HexQ(Random(1234567890)),16)
 If Patt <> "" And Len(Patt) = 16
  AddPattern(@Patt)
 EndIf
Next

;Here comment this to remove the known pattern
AddPattern(@"alex-1234-12bx56") ;(+1) Adding a known pattern here

;Now sort them
SortStructuredArray(Patterns(),0,OffsetOf(Pattern\CRC32), #PB_Sort_Long)

;Start the scan
PrintN("Loaded: "+Str(PatternIndex)+" patterns")
StartTime = timeGetTime_()
Result = FindPattern(*Buff, LenFile)
PrintN ("Search took: "+Str(timeGetTime_()-StartTime)+"ms")
PrintN ("Search returned: "+Str(Result))
If Result <> -1
 PrintN ("Found string: "+PeekS(@Patterns(Result)\Pointer[0],16))
Else
 PrintN ("String not found")
EndIf
If *Buff
 FreeMemory(*Buff)
EndIf
Input()
None are more hopelessly enslaved than those who falsely believe they are free. (Goethe)
User avatar
idle
Always Here
Always Here
Posts: 5897
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Post by idle »

will still be faster with hashes, pm'd you the code.

table size of 10000, collisions to be expected

Loaded: 50002 patterns
Binary Search took: 2391ms
Search returned: 5777
Found string: alex-1234-12bx56
Hash Search took: 563ms
Search returned: alex-1234-12bx56

table size of 100000 less chance of collisions

Loaded: 50002 patterns
Binary Search took: 2359ms
Search returned: 5675
Found string: alex-1234-12bx56
Hash Search took: 62ms
Search returned: alex-1234-12bx56


ps don't listen to srod putting down his ***** hashtable (five stars, not abuse), he's talking four stars (****) :wink:
Post Reply