Page 2 of 2

Posted: Fri Aug 22, 2008 11:29 am
by Inf0Byt3
Excellent! Thank you for the help.

Never thought a piece of code would make me so happy :P .

Posted: Fri Aug 22, 2008 11:46 am
by srod
idle wrote:ps don't listen to srod putting down his ***** hashtable (five stars, not abuse), he's talking four stars (****) :wink:
:lol:

I always talk **** lad; quality **** mind, not any old ****. That's the problem with being full of ****! :)

Those are some impressive timings there. Wow.

Posted: Fri Aug 22, 2008 11:48 am
by idle
Now srod just has to come back and accept his five starts.

opps simultaneous post :lol:

Posted: Fri Aug 22, 2008 12:15 pm
by pdwyer
I'm not really seeing why you need to hash in this case, or use crc's. Could you describe how you did your tests (like what you used for patterns etc) so I can give my idea a go. If the idea I have is slower I would like to understand why for my own learning purposes.

Posted: Fri Aug 22, 2008 12:56 pm
by Inf0Byt3
I used hashes so I can make the upper algorithm (e.g the one who compares the patterns with a piece of code in the buffer) to skip bytes. Comparing one-on-one in that much data is too slow. The hash, being a number is easy to spot with hash-maps or binary search.

About the test, I used random patterns made from HexQ(Random(number)), all of them 16 bytes long. I added about 50000 of them and a known one as well. When added, a CRC is calculated for each pattern. Then I enumerated the bytes from the main memory (just like a sliding window), calculated crc32 for each of the pieces and used binary search/hash map to spot that crc in the already-built list. If it is there, we got a match, if not we continue to hash and check.

I hope I explained it correctly.

Posted: Fri Aug 22, 2008 1:51 pm
by idle
I'm a bit tired so forgive the lecture style talk.

Hashing is a mostly O(1) constant operation and at the worst case O(n) Binary search is log2(n) it's quite a big difference.

As long as you've got the right mix, table structure, size and hash function it's just like looking up an index in an array except you don't need to know the index, there is no searching as such.

In the case of exact pattern matching, you have a couple of choices with a hash table, you can directly key the pattern into the hash table (could be variable size may be large, and may take a while to compare ) or you could just key a unique hash of the pattern into the table with either a crc32 or md5 and store a index along with the element in the hash structure that points to the index of the pattern in an external array. Then its just a case of finding a suitable hash key function.

Posted: Fri Aug 22, 2008 1:55 pm
by pdwyer
ok, I whipped up a quick and dirty version of what I was talking about. Not sure how it compares as I'm not sure I'm comparing apples to apples here. The file is a copy of shell32.dll (8.2mb) there are 10,000 random patterns with that weird text string of yours in at number 5000 (but it checks all of them them in the whole file anyway to test speed.

With debugger on it's 1.4s with debugger off it's 0.17sec

Anyones guess as to whether I'm doing the right thing though. :P

Code: Select all

Structure ByteWord
    StructureUnion 
        B.c[2]
        W.l  ;words are signed :(
    EndStructureUnion 
EndStructure

Structure Patterns
    Count.l
    Pattern.s[10]
EndStructure

Structure MemoryArray 
    Byte.c[0] 
EndStructure 

#PatternCount = 10000
Dim RawPatterns.s(#PatternCount)
Dim StructPtn.Patterns(66000)

;create some 32byte randome patterns
For i = 1 To #PatternCount
    For j = 1 To 32
        RawPatterns(i) = RawPatterns(i) + Chr(Random(254) + 1)  ;don't want nulls for the test
    Next
Next

;Set test pattern 
rawpatterns(5000) = "alex-1234-12bx56"

;Load Into search structure. (speed not so important here so I used the asc(mid()) rather than pointers as its quicker to write
For i = 1 To #PatternCount 
    Idx.ByteWord
    Idx\b[0] = Asc(Mid(RawPatterns(i),1,1))
    Idx\b[1] = Asc(Mid(RawPatterns(i),2,1))
    ;check bucket overflow 
    If StructPtn(Idx\w)\count > 8
        Debug "Bucket Full"
    Else
        StructPtn(Idx\w)\count = StructPtn(Idx\w)\count + 1
        StructPtn(Idx\w)\Pattern[StructPtn(Idx\w)\count] = RawPatterns(i)
    EndIf    
Next

;Load a File of a few meg (shell32.dll = >8mb)
OpenFile(0,"C:\WINDOWS\ServicePackFiles\i386\shell32.dll")
    FileLen.l = Lof(0)
    *FileData = AllocateMemory(FileLen)
    ReadData(0,*Filedata,FileLen)
CloseFile(0)

;Add search string a fair way through the file
PokeS(*Filedata + 5000 ,"alex-1234-12bx56")

;Search
*FileByteArray.MemoryArray = *Filedata

Time1 = ElapsedMilliseconds()

For i = 1 To FileLen -32

    ;check for index match
    Idx\b[0] = *FileByteArray\byte[i]
    Idx\b[1] = *FileByteArray\byte[i+1]
    
    For j = 1 To StructPtn(Idx\w)\count
        If CompareMemory(@StructPtn(Idx\w)\Pattern[j],*Filedata + i,16)
            Debug i
        EndIf
    Next
Next

Debug "Done"
;Debug Str(ElapsedMilliseconds() - time1)
MessageRequester("", Str(ElapsedMilliseconds() - time1))

Posted: Fri Aug 22, 2008 2:43 pm
by Inf0Byt3
:shock:

Interesting approach :).

I need to have a closer look.

Posted: Fri Aug 22, 2008 5:57 pm
by Rook Zimbabwe
Oy! Paul... that is brain stretching me!!! 8)

Posted: Sat Aug 23, 2008 3:28 am
by pdwyer
Sorry about that, the code turned out to be very hard to read :( Personally, when other's post code like this my eyes glaze over! It started out as a not-so-complex idea that got kind of convoluted in the implementation but it felt good to actually realise the idea in code (although in the end idle's hash was probably faster).

The main "dirty" part of this little snippet is the hard coded bucket size (10) which I just pulled out of thin air, I chucked a check on it's overflow as I really had no idea if it would be big enough, although it seems to be for this example. If you had a pattern list that was planned to expand you'd want some kind of dynamic list there. 66k is hardcoded to be higher than a 16bit var, that's kind of ugly, and 16 in the comparison is just to match the size of the string we are looking for, that could be a dynamic length in the structure too I guess if all patterns are different lengths

The simple point is that the first twp bytes of the pattern are the array index so there's no looping or searching there, the bytes you are checking become the lookup for the pattern.

Posted: Sat Aug 23, 2008 4:09 am
by idle
pdwyer wrote:
The simple point is that the first twp bytes of the pattern are the array index so there's no looping or searching there, the bytes you are checking become the lookup for the pattern.
I can see that working, though in the cases of a pattern having the same leading bytes you'd run into complications. I suppose thats where Associative array implimentaions take over from look up tables
http://en.wikipedia.org/wiki/Associative_array

Posted: Sat Aug 23, 2008 4:15 am
by pdwyer
In this case I can handle up to 10 collisions in any one index but yes, in production code you would want to have something a little more elastic

Posted: Sat Aug 23, 2008 4:23 am
by idle
I hadn't noticed that :oops:
I've still got the flu and am well off the pace today.