Multi-pattern searching

Just starting out? Need help? Post your questions and find answers here.
Inf0Byt3
PureBasic Fanatic
PureBasic Fanatic
Posts: 2236
Joined: Fri Dec 09, 2005 12:15 pm
Location: Elbonia

Post by Inf0Byt3 »

Excellent! Thank you for the help.

Never thought a piece of code would make me so happy :P .
None are more hopelessly enslaved than those who falsely believe they are free. (Goethe)
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

idle wrote:ps don't listen to srod putting down his ***** hashtable (five stars, not abuse), he's talking four stars (****) :wink:
:lol:

I always talk **** lad; quality **** mind, not any old ****. That's the problem with being full of ****! :)

Those are some impressive timings there. Wow.
I may look like a mule, but I'm not a complete ass.
User avatar
idle
Always Here
Always Here
Posts: 5897
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Post by idle »

Now srod just has to come back and accept his five starts.

opps simultaneous post :lol:
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

I'm not really seeing why you need to hash in this case, or use crc's. Could you describe how you did your tests (like what you used for patterns etc) so I can give my idea a go. If the idea I have is slower I would like to understand why for my own learning purposes.
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
Inf0Byt3
PureBasic Fanatic
PureBasic Fanatic
Posts: 2236
Joined: Fri Dec 09, 2005 12:15 pm
Location: Elbonia

Post by Inf0Byt3 »

I used hashes so I can make the upper algorithm (e.g the one who compares the patterns with a piece of code in the buffer) to skip bytes. Comparing one-on-one in that much data is too slow. The hash, being a number is easy to spot with hash-maps or binary search.

About the test, I used random patterns made from HexQ(Random(number)), all of them 16 bytes long. I added about 50000 of them and a known one as well. When added, a CRC is calculated for each pattern. Then I enumerated the bytes from the main memory (just like a sliding window), calculated crc32 for each of the pieces and used binary search/hash map to spot that crc in the already-built list. If it is there, we got a match, if not we continue to hash and check.

I hope I explained it correctly.
None are more hopelessly enslaved than those who falsely believe they are free. (Goethe)
User avatar
idle
Always Here
Always Here
Posts: 5897
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Post by idle »

I'm a bit tired so forgive the lecture style talk.

Hashing is a mostly O(1) constant operation and at the worst case O(n) Binary search is log2(n) it's quite a big difference.

As long as you've got the right mix, table structure, size and hash function it's just like looking up an index in an array except you don't need to know the index, there is no searching as such.

In the case of exact pattern matching, you have a couple of choices with a hash table, you can directly key the pattern into the hash table (could be variable size may be large, and may take a while to compare ) or you could just key a unique hash of the pattern into the table with either a crc32 or md5 and store a index along with the element in the hash structure that points to the index of the pattern in an external array. Then its just a case of finding a suitable hash key function.
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

ok, I whipped up a quick and dirty version of what I was talking about. Not sure how it compares as I'm not sure I'm comparing apples to apples here. The file is a copy of shell32.dll (8.2mb) there are 10,000 random patterns with that weird text string of yours in at number 5000 (but it checks all of them them in the whole file anyway to test speed.

With debugger on it's 1.4s with debugger off it's 0.17sec

Anyones guess as to whether I'm doing the right thing though. :P

Code: Select all

Structure ByteWord
    StructureUnion 
        B.c[2]
        W.l  ;words are signed :(
    EndStructureUnion 
EndStructure

Structure Patterns
    Count.l
    Pattern.s[10]
EndStructure

Structure MemoryArray 
    Byte.c[0] 
EndStructure 

#PatternCount = 10000
Dim RawPatterns.s(#PatternCount)
Dim StructPtn.Patterns(66000)

;create some 32byte randome patterns
For i = 1 To #PatternCount
    For j = 1 To 32
        RawPatterns(i) = RawPatterns(i) + Chr(Random(254) + 1)  ;don't want nulls for the test
    Next
Next

;Set test pattern 
rawpatterns(5000) = "alex-1234-12bx56"

;Load Into search structure. (speed not so important here so I used the asc(mid()) rather than pointers as its quicker to write
For i = 1 To #PatternCount 
    Idx.ByteWord
    Idx\b[0] = Asc(Mid(RawPatterns(i),1,1))
    Idx\b[1] = Asc(Mid(RawPatterns(i),2,1))
    ;check bucket overflow 
    If StructPtn(Idx\w)\count > 8
        Debug "Bucket Full"
    Else
        StructPtn(Idx\w)\count = StructPtn(Idx\w)\count + 1
        StructPtn(Idx\w)\Pattern[StructPtn(Idx\w)\count] = RawPatterns(i)
    EndIf    
Next

;Load a File of a few meg (shell32.dll = >8mb)
OpenFile(0,"C:\WINDOWS\ServicePackFiles\i386\shell32.dll")
    FileLen.l = Lof(0)
    *FileData = AllocateMemory(FileLen)
    ReadData(0,*Filedata,FileLen)
CloseFile(0)

;Add search string a fair way through the file
PokeS(*Filedata + 5000 ,"alex-1234-12bx56")

;Search
*FileByteArray.MemoryArray = *Filedata

Time1 = ElapsedMilliseconds()

For i = 1 To FileLen -32

    ;check for index match
    Idx\b[0] = *FileByteArray\byte[i]
    Idx\b[1] = *FileByteArray\byte[i+1]
    
    For j = 1 To StructPtn(Idx\w)\count
        If CompareMemory(@StructPtn(Idx\w)\Pattern[j],*Filedata + i,16)
            Debug i
        EndIf
    Next
Next

Debug "Done"
;Debug Str(ElapsedMilliseconds() - time1)
MessageRequester("", Str(ElapsedMilliseconds() - time1))
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
Inf0Byt3
PureBasic Fanatic
PureBasic Fanatic
Posts: 2236
Joined: Fri Dec 09, 2005 12:15 pm
Location: Elbonia

Post by Inf0Byt3 »

:shock:

Interesting approach :).

I need to have a closer look.
None are more hopelessly enslaved than those who falsely believe they are free. (Goethe)
User avatar
Rook Zimbabwe
Addict
Addict
Posts: 4322
Joined: Tue Jan 02, 2007 8:16 pm
Location: Cypress TX
Contact:

Post by Rook Zimbabwe »

Oy! Paul... that is brain stretching me!!! 8)
Binarily speaking... it takes 10 to Tango!!!

Image
http://www.bluemesapc.com/
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

Sorry about that, the code turned out to be very hard to read :( Personally, when other's post code like this my eyes glaze over! It started out as a not-so-complex idea that got kind of convoluted in the implementation but it felt good to actually realise the idea in code (although in the end idle's hash was probably faster).

The main "dirty" part of this little snippet is the hard coded bucket size (10) which I just pulled out of thin air, I chucked a check on it's overflow as I really had no idea if it would be big enough, although it seems to be for this example. If you had a pattern list that was planned to expand you'd want some kind of dynamic list there. 66k is hardcoded to be higher than a 16bit var, that's kind of ugly, and 16 in the comparison is just to match the size of the string we are looking for, that could be a dynamic length in the structure too I guess if all patterns are different lengths

The simple point is that the first twp bytes of the pattern are the array index so there's no looping or searching there, the bytes you are checking become the lookup for the pattern.
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
User avatar
idle
Always Here
Always Here
Posts: 5897
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Post by idle »

pdwyer wrote:
The simple point is that the first twp bytes of the pattern are the array index so there's no looping or searching there, the bytes you are checking become the lookup for the pattern.
I can see that working, though in the cases of a pattern having the same leading bytes you'd run into complications. I suppose thats where Associative array implimentaions take over from look up tables
http://en.wikipedia.org/wiki/Associative_array
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

In this case I can handle up to 10 collisions in any one index but yes, in production code you would want to have something a little more elastic
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
User avatar
idle
Always Here
Always Here
Posts: 5897
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Post by idle »

I hadn't noticed that :oops:
I've still got the flu and am well off the pace today.
Post Reply