Multi-pattern searching
Multi-pattern searching
I'm trying these days to find a set of fixed length binary patterns in a big binary memory buffer. There are approx. 10000 patterns, each of them has exactly 32 bytes in length and the main buffer I must search in is about 10mb big. Time is critical and I can't find a way to find the data quick enough (e.g under one second). I've tried everything:
- Building a CRC32 table to store fragments of the buffer and searching them
- Building a lookup table (just like QuickSearch does) and skipping n bytes
- Dumb search (awfully slow)
- Boyer-Moore and others (linear)
I know that it is possible to search even faster, but I can't create something that optimized.
Has anyone needed this in the past and has some code available or an idea? One with some brain could make an automaton like Aho-Corasick but that's not my case...
- Building a CRC32 table to store fragments of the buffer and searching them
- Building a lookup table (just like QuickSearch does) and skipping n bytes
- Dumb search (awfully slow)
- Boyer-Moore and others (linear)
I know that it is possible to search even faster, but I can't create something that optimized.
Has anyone needed this in the past and has some code available or an idea? One with some brain could make an automaton like Aho-Corasick but that's not my case...
None are more hopelessly enslaved than those who falsely believe they are free. (Goethe)
- Rook Zimbabwe
- Addict
- Posts: 4322
- Joined: Tue Jan 02, 2007 8:16 pm
- Location: Cypress TX
- Contact:
I think my good friend Inf0byt3 is building a virus scanner or malware scanner and needs to look for certain signatures...
OK this may sound stupid...
After you assemble the data in the buffer have you tried sorting it into ascending order THEN searching it?
I wiggled the sort example from the help files a bit... it dies work, but tends to DROP the 0's I think
OK this may sound stupid...
After you assemble the data in the buffer have you tried sorting it into ascending order THEN searching it?
I wiggled the sort example from the help files a bit... it dies work, but tends to DROP the 0's I think
Code: Select all
; Change the number of elements here, to see the speed of the sort routine :)
;
#NbElements = 10
Dim Array.s(#NbElements)
For k = 0 To #NbElements
Array(k) = BinQ(Random(10))
Next
OpenConsole()
; Fill the array with lot of random number
;
For k=0 To #NbElements
Print(Array(k)+" ")
Next
; Sort the whole array !
;
SortArray(Array(), 0)
PrintN("")
PrintN("")
For k=0 To #NbElements
Print(Array(k)+" ")
Next
MessageRequester("Sort Example","Sort finished.",0)
Input()
End
He's going to have to cheat I think.
If this is a virus scanner then the patterns have very limited number of places in a file that they can appear in order to execute or be infectious. (ie have a pattern and a list of possible offsets) macro viruses might be an exception there but there are probably shortcuts to tell is certain doc types contain macros or not.
If this isn't a virus scanner than what is it? (A hint?) if things like quicksearch and what not are still way too slow I think you are going to need a solution based very specifically on your data and patterns.
---- an idea though ---
How many of the patterns start with the same byte(s), if you just do one pass of the file and at each byte just check the patterns that start with that byte (need the patterns hashed or something I guess) then move a byte and drop off the patterns that don't match that one too.. or something
If this is a virus scanner then the patterns have very limited number of places in a file that they can appear in order to execute or be infectious. (ie have a pattern and a list of possible offsets) macro viruses might be an exception there but there are probably shortcuts to tell is certain doc types contain macros or not.
If this isn't a virus scanner than what is it? (A hint?) if things like quicksearch and what not are still way too slow I think you are going to need a solution based very specifically on your data and patterns.
---- an idea though ---
How many of the patterns start with the same byte(s), if you just do one pass of the file and at each byte just check the patterns that start with that byte (need the patterns hashed or something I guess) then move a byte and drop off the patterns that don't match that one too.. or something
Paul Dwyer
“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
Hehe you got me... It's that scanning engine I'm trying to build again. I am searching for a solution for this matter for a long time (2 years) but I had moments I thought it will never be ready (though hope dies last). I've covered pretty much everything so far, except for the pattern matcher.
@Rook, Idle
A hashmap would be indeed good if I were to compare a pattern with others. The real problem is that a 10 MB file generates 10485760 such 32 bit long "32-grams". Trying to match each of these with the other 10000, even with a hashmap is terribly slow. I've tried many tricks, e.g comparing only the first and last byte or using a crc32 to minimize the scanning time. While it was really fast compared to a linear search, it was too slow for a time-critical program (e.g scanning for 500 signatures took 15 seconds in a 10 mb file).
@Paul,
I already covered that part, e.g only the important areas of an exe are scanned (the sections), for documents as well. The unneeded data is skipped automatically. The engine must eventually support 2 scanning modes, one will use hashes (done) and this one with simple pattern matching. With just hashes, it won't get too far because you can't detect families of viruses and if a byte gets changed no detection occurs.
@Rook, Idle
A hashmap would be indeed good if I were to compare a pattern with others. The real problem is that a 10 MB file generates 10485760 such 32 bit long "32-grams". Trying to match each of these with the other 10000, even with a hashmap is terribly slow. I've tried many tricks, e.g comparing only the first and last byte or using a crc32 to minimize the scanning time. While it was really fast compared to a linear search, it was too slow for a time-critical program (e.g scanning for 500 signatures took 15 seconds in a 10 mb file).
@Paul,
I already covered that part, e.g only the important areas of an exe are scanned (the sections), for documents as well. The unneeded data is skipped automatically. The engine must eventually support 2 scanning modes, one will use hashes (done) and this one with simple pattern matching. With just hashes, it won't get too far because you can't detect families of viruses and if a byte gets changed no detection occurs.
I tried something like this but it was slow because there are too many patterns.How many of the patterns start with the same byte(s), if you just do one pass of the file and at each byte just check the patterns that start with that byte (need the patterns hashed or something I guess) then move a byte and drop off the patterns that don't match that one too.. or something
None are more hopelessly enslaved than those who falsely believe they are free. (Goethe)
What if the hash was 32bit made up of the first 4 bytes of the pattern. This can be compared with a long = long comparison which will be quick and which would have not too many doubleups. You would have to move through the file one byte at a time still rather than 4byte jumps but the hash wouldn't match anyware near as often and when it did the "buckets" would not contain as many items.
If 32bit turns out to be too big and you want to use arrays for very quick access then 16 bit lookups could be put in memory 65k items with doubleups and I guess plenty of gaps. How similar are the first could of bytes in the patterns? using 16bits would there be many items in each bucket?
If 32bit turns out to be too big and you want to use arrays for very quick access then 16 bit lookups could be put in memory 65k items with doubleups and I guess plenty of gaps. How similar are the first could of bytes in the patterns? using 16bits would there be many items in each bucket?
Paul Dwyer
“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
Interesting idea. I'll have to try this one and make some speed comparisons.
As for the number of items in a bucket, it wouldn't be that big... I can extract common signatures to have less of them.
Ideally, this algorithm should have supported variable length patterns but being able to choose the signatures manually I thought that searching for fixed patterns would be faster.
The first bytes are not very similar. And I can make sure they're even more dispersed depending on the place the signature is extracted from.How similar are the first could of bytes in the patterns? using 16bits would there be many items in each bucket?
As for the number of items in a bucket, it wouldn't be that big... I can extract common signatures to have less of them.
Ideally, this algorithm should have supported variable length patterns but being able to choose the signatures manually I thought that searching for fixed patterns would be faster.
None are more hopelessly enslaved than those who falsely believe they are free. (Goethe)
In a perfect world then you could have an array of 65k members with the pattern as the member (eg a struct with the byte vals) then loop through the file's bytes once and for each byte just check if that byte and the next making a 16bit number as the array index has a value, if it does then you have a pattern match. (you would do one lookup instead of 10,000 for each byte of the file) if you couldn't avoid a few double ups then you still use the index as the two byte vals but you might have a structure with several patterns in there that have the same index so you may have to do a couple of checks per byte in the file instead of just one.
On the flip side, the more double ups you have (adding comparisons and slowing things) then the more empty array members you will have so that you are flying through those bytes very fast instead, hard to guess is this would balance out or not, depends on the contents of the file I guess
On the flip side, the more double ups you have (adding comparisons and slowing things) then the more empty array members you will have so that you are flying through those bytes very fast instead, hard to guess is this would balance out or not, depends on the contents of the file I guess
Paul Dwyer
“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
something like this maybe, store hashes of the crc32 from the sigs into the hashtable.
feed files to be scanned into a buffer in a thread doing a recursive directroy search.
swap pointers to buffers
mutex buffer
this assumes you use srods ***** hashtable class (thats 5 stars, not abuse
)
feed files to be scanned into a buffer in a thread doing a recursive directroy search.
swap pointers to buffers
mutex buffer
Code: Select all
for a = 1 to lenBuff - maxsize
for b = minsize to maxsize
crc = CRC32Fingerprint(*tbuff+a+b,maxsize-b)
If HashObj
HT_ITEM.HashTData
HT_ITEM\key$ = Str(crc)
If HashObj\GetItem(HT_ITEM)
;Hello
endif
endif
next
next

Thanks for the endorsement there idle,
but in this case, with out and out speed being crucial, my hash table class isn't really an option as the gains in flexibility come with the cost of a drop in speed. Fine for general applications, but not a specialised one like info's! 


I may look like a mule, but I'm not a complete ass.
speed may be an issue if he's only scanning what's already in memory but as soon the hd get involved, there won't be any penalty to speak of.srod wrote:Thanks for the endorsement there idle,but in this case, with out and out speed being crucial, my hash table class isn't really an option as the gains in flexibility come with the cost of a drop in speed. Fine for general applications, but not a specialised one like info's!
Haha unbelievable... Searching 10000 patterns in 5mb of data in 2.6 seconds. There must be a mistake somewhere I tell you.
Run this without debugger:
The code is a few posts down...
Am I dreaming or it really works?
Run this without debugger:
The code is a few posts down...
Am I dreaming or it really works?
Last edited by Inf0Byt3 on Thu Aug 21, 2008 4:44 pm, edited 1 time in total.
None are more hopelessly enslaved than those who falsely believe they are free. (Goethe)
- Rook Zimbabwe
- Addict
- Posts: 4322
- Joined: Tue Jan 02, 2007 8:16 pm
- Location: Cypress TX
- Contact:
Thanks! I hope it's not a mirage and it really finds that string
. Now it needs 4.4 secs for 50000 patterns in 5mb of data. It seems that the number of patterns doesn't affect it <too> much (perfect for its purpose). Thank you for all the help guys
.
I've fixed a few bits in the code. With some optimization (ASM level maybe) it could lead to some really fast searches. If anyone needs it here it is:


I've fixed a few bits in the code. With some optimization (ASM level maybe) it could lead to some really fast searches. If anyone needs it here it is:
Code: Select all
Structure Pattern
CRC32.l
Pointer.b[16] ;16 bytes of memory (a pattern's length)
EndStructure
Global PatternIndex.l
Global Dim Patterns.Pattern(0)
;Borrowed from Guimauve -> http://www.purebasic.fr/english/viewtopic.php?t=16506
;Binary Search
Procedure.l DichotomicSearch(ValeurRecherchee.l)
Position.l = -1
Fin.l = PatternIndex-1
While Debut <= Fin
Milieu = (Debut + Fin+1) >> 1
If ValeurRecherchee < Patterns(Milieu)\CRC32
Fin = Milieu - 1
ElseIf ValeurRecherchee > Patterns(Milieu)\CRC32
Debut = Milieu + 1
Else
Debut = Fin + 1
Position = Milieu
EndIf
Wend
ProcedureReturn Position
EndProcedure
Procedure AddPattern(*Pattern)
ReDim Patterns.Pattern(PatternIndex)
CopyMemory(*Pattern,@Patterns(PatternIndex)\Pointer[0],16) ;Copy the bytes in the array
Patterns(PatternIndex)\CRC32 = CRC32Fingerprint(*Pattern,16) ;The index is made with the CRC32
PatternIndex + 1
EndProcedure
Procedure.l FindPattern(*MainMem, MainLen.l)
SeekOnce = 16 ;(because we wouldn't like to peek from *MainMem+MainLen+16)
While SeekOnce <= MainLen
CRC32 = CRC32Fingerprint(*MainMem+SeekOnce,16)
Result = DichotomicSearch(CRC32)
If Result <> -1
If PeekB(*MainMem+SeekOnce) = Patterns(Result)\Pointer[0]
If CompareMemory(*MainMem+SeekOnce,@Patterns(Result)\Pointer[0],16) = 1
Break
EndIf
EndIf
EndIf
SeekOnce + 1
Wend
ProcedureReturn Result
EndProcedure
OpenConsole()
;A ~ 5mb file
LenFile = (1024*1024*5)+16 + 2
*Buff = AllocateMemory(LenFile)
For Fill = 0 To LenFile-17
PokeB(*Buff+Pointer,1+Random(254))
Pointer + 1
Next
PokeB(*Buff+Pointer,123)
Pointer + 1
PokeS(*Buff+Pointer,"alex-1234-12bx56",16) ; Add inside the "file" a pattern that we'll match
Pointer + 16
PokeB(*Buff+Pointer,123)
;Fill the table with patterns
For AddPatt = 0 To 50000
Patt.s = Left(HexQ(Random(1234567890))+HexQ(Random(1234567890))+HexQ(Random(1234567890)),16)
If Patt <> "" And Len(Patt) = 16
AddPattern(@Patt)
EndIf
Next
;Here comment this to remove the known pattern
AddPattern(@"alex-1234-12bx56") ;(+1) Adding a known pattern here
;Now sort them
SortStructuredArray(Patterns(),0,OffsetOf(Pattern\CRC32), #PB_Sort_Long)
;Start the scan
PrintN("Loaded: "+Str(PatternIndex)+" patterns")
StartTime = timeGetTime_()
Result = FindPattern(*Buff, LenFile)
PrintN ("Search took: "+Str(timeGetTime_()-StartTime)+"ms")
PrintN ("Search returned: "+Str(Result))
If Result <> -1
PrintN ("Found string: "+PeekS(@Patterns(Result)\Pointer[0],16))
Else
PrintN ("String not found")
EndIf
If *Buff
FreeMemory(*Buff)
EndIf
Input()
None are more hopelessly enslaved than those who falsely believe they are free. (Goethe)
will still be faster with hashes, pm'd you the code.
table size of 10000, collisions to be expected
Loaded: 50002 patterns
Binary Search took: 2391ms
Search returned: 5777
Found string: alex-1234-12bx56
Hash Search took: 563ms
Search returned: alex-1234-12bx56
table size of 100000 less chance of collisions
Loaded: 50002 patterns
Binary Search took: 2359ms
Search returned: 5675
Found string: alex-1234-12bx56
Hash Search took: 62ms
Search returned: alex-1234-12bx56
ps don't listen to srod putting down his ***** hashtable (five stars, not abuse), he's talking four stars (****)
table size of 10000, collisions to be expected
Loaded: 50002 patterns
Binary Search took: 2391ms
Search returned: 5777
Found string: alex-1234-12bx56
Hash Search took: 563ms
Search returned: alex-1234-12bx56
table size of 100000 less chance of collisions
Loaded: 50002 patterns
Binary Search took: 2359ms
Search returned: 5675
Found string: alex-1234-12bx56
Hash Search took: 62ms
Search returned: alex-1234-12bx56
ps don't listen to srod putting down his ***** hashtable (five stars, not abuse), he's talking four stars (****)
