Page 2 of 2
Posted: Fri Aug 22, 2008 11:29 am
by Inf0Byt3
Excellent! Thank you for the help.
Never thought a piece of code would make me so happy

.
Posted: Fri Aug 22, 2008 11:46 am
by srod
idle wrote:ps don't listen to srod putting down his ***** hashtable (five stars, not abuse), he's talking four stars (****)

I always talk **** lad; quality **** mind, not any old ****. That's the problem with being full of ****!
Those are some impressive timings there. Wow.
Posted: Fri Aug 22, 2008 11:48 am
by idle
Now srod just has to come back and accept his five starts.
opps simultaneous post

Posted: Fri Aug 22, 2008 12:15 pm
by pdwyer
I'm not really seeing why you need to hash in this case, or use crc's. Could you describe how you did your tests (like what you used for patterns etc) so I can give my idea a go. If the idea I have is slower I would like to understand why for my own learning purposes.
Posted: Fri Aug 22, 2008 12:56 pm
by Inf0Byt3
I used hashes so I can make the upper algorithm (e.g the one who compares the patterns with a piece of code in the buffer) to skip bytes. Comparing one-on-one in that much data is too slow. The hash, being a number is easy to spot with hash-maps or binary search.
About the test, I used random patterns made from HexQ(Random(number)), all of them 16 bytes long. I added about 50000 of them and a known one as well. When added, a CRC is calculated for each pattern. Then I enumerated the bytes from the main memory (just like a sliding window), calculated crc32 for each of the pieces and used binary search/hash map to spot that crc in the already-built list. If it is there, we got a match, if not we continue to hash and check.
I hope I explained it correctly.
Posted: Fri Aug 22, 2008 1:51 pm
by idle
I'm a bit tired so forgive the lecture style talk.
Hashing is a mostly O(1) constant operation and at the worst case O(n) Binary search is log2(n) it's quite a big difference.
As long as you've got the right mix, table structure, size and hash function it's just like looking up an index in an array except you don't need to know the index, there is no searching as such.
In the case of exact pattern matching, you have a couple of choices with a hash table, you can directly key the pattern into the hash table (could be variable size may be large, and may take a while to compare ) or you could just key a unique hash of the pattern into the table with either a crc32 or md5 and store a index along with the element in the hash structure that points to the index of the pattern in an external array. Then its just a case of finding a suitable hash key function.
Posted: Fri Aug 22, 2008 1:55 pm
by pdwyer
ok, I whipped up a quick and dirty version of what I was talking about. Not sure how it compares as I'm not sure I'm comparing apples to apples here. The file is a copy of shell32.dll (8.2mb) there are 10,000 random patterns with that weird text string of yours in at number 5000 (but it checks all of them them in the whole file anyway to test speed.
With debugger on it's 1.4s with debugger off it's 0.17sec
Anyones guess as to whether I'm doing the right thing though.
Code: Select all
Structure ByteWord
StructureUnion
B.c[2]
W.l ;words are signed :(
EndStructureUnion
EndStructure
Structure Patterns
Count.l
Pattern.s[10]
EndStructure
Structure MemoryArray
Byte.c[0]
EndStructure
#PatternCount = 10000
Dim RawPatterns.s(#PatternCount)
Dim StructPtn.Patterns(66000)
;create some 32byte randome patterns
For i = 1 To #PatternCount
For j = 1 To 32
RawPatterns(i) = RawPatterns(i) + Chr(Random(254) + 1) ;don't want nulls for the test
Next
Next
;Set test pattern
rawpatterns(5000) = "alex-1234-12bx56"
;Load Into search structure. (speed not so important here so I used the asc(mid()) rather than pointers as its quicker to write
For i = 1 To #PatternCount
Idx.ByteWord
Idx\b[0] = Asc(Mid(RawPatterns(i),1,1))
Idx\b[1] = Asc(Mid(RawPatterns(i),2,1))
;check bucket overflow
If StructPtn(Idx\w)\count > 8
Debug "Bucket Full"
Else
StructPtn(Idx\w)\count = StructPtn(Idx\w)\count + 1
StructPtn(Idx\w)\Pattern[StructPtn(Idx\w)\count] = RawPatterns(i)
EndIf
Next
;Load a File of a few meg (shell32.dll = >8mb)
OpenFile(0,"C:\WINDOWS\ServicePackFiles\i386\shell32.dll")
FileLen.l = Lof(0)
*FileData = AllocateMemory(FileLen)
ReadData(0,*Filedata,FileLen)
CloseFile(0)
;Add search string a fair way through the file
PokeS(*Filedata + 5000 ,"alex-1234-12bx56")
;Search
*FileByteArray.MemoryArray = *Filedata
Time1 = ElapsedMilliseconds()
For i = 1 To FileLen -32
;check for index match
Idx\b[0] = *FileByteArray\byte[i]
Idx\b[1] = *FileByteArray\byte[i+1]
For j = 1 To StructPtn(Idx\w)\count
If CompareMemory(@StructPtn(Idx\w)\Pattern[j],*Filedata + i,16)
Debug i
EndIf
Next
Next
Debug "Done"
;Debug Str(ElapsedMilliseconds() - time1)
MessageRequester("", Str(ElapsedMilliseconds() - time1))
Posted: Fri Aug 22, 2008 2:43 pm
by Inf0Byt3
Interesting approach

.
I need to have a closer look.
Posted: Fri Aug 22, 2008 5:57 pm
by Rook Zimbabwe
Oy! Paul... that is brain stretching me!!!

Posted: Sat Aug 23, 2008 3:28 am
by pdwyer
Sorry about that, the code turned out to be very hard to read

Personally, when other's post code like this my eyes glaze over! It started out as a not-so-complex idea that got kind of convoluted in the implementation but it felt good to actually realise the idea in code (although in the end idle's hash was probably faster).
The main "dirty" part of this little snippet is the hard coded bucket size (10) which I just pulled out of thin air, I chucked a check on it's overflow as I really had no idea if it would be big enough, although it seems to be for this example. If you had a pattern list that was planned to expand you'd want some kind of dynamic list there. 66k is hardcoded to be higher than a 16bit var, that's kind of ugly, and 16 in the comparison is just to match the size of the string we are looking for, that could be a dynamic length in the structure too I guess if all patterns are different lengths
The simple point is that the first twp bytes of the pattern
are the array index so there's no looping or searching there, the bytes you are checking become the lookup for the pattern.
Posted: Sat Aug 23, 2008 4:09 am
by idle
pdwyer wrote:
The simple point is that the first twp bytes of the pattern are the array index so there's no looping or searching there, the bytes you are checking become the lookup for the pattern.
I can see that working, though in the cases of a pattern having the same leading bytes you'd run into complications. I suppose thats where Associative array implimentaions take over from look up tables
http://en.wikipedia.org/wiki/Associative_array
Posted: Sat Aug 23, 2008 4:15 am
by pdwyer
In this case I can handle up to 10 collisions in any one index but yes, in production code you would want to have something a little more elastic
Posted: Sat Aug 23, 2008 4:23 am
by idle
I hadn't noticed that
I've still got the flu and am well off the pace today.