XORC32 (A XOR based 32bit checksum)

Share your advanced PureBasic knowledge/code with the community.
User avatar
Rescator
Addict
Addict
Posts: 1769
Joined: Sat Feb 19, 2005 5:05 pm
Location: Norway

XORC32 (A XOR based 32bit checksum)

Post by Rescator »

Behavior is similar to PureBasic's CRC32Fingerprint().
Performance is about ten times faster than PureBasic's CRC32.

Then again this is a much simpler form of checksum since it's not a Cyclic Redundancy Check (CRC),
XORC32 has more in common with a Longitudinal Redundancy Check (LRC)
The ASM can probably be optimized some more too.

Please note that I have not done any collision tests or entropy spread tests.
In fact, I don't really care about the typical spread that other checksums or hashes have.
This is just intended to quickly check if the data is changed or not.
A small change will cause a small difference, and a large a large difference.
This checksum is best used in combination with filename + filesize + XORC32 and where possible file modified date as well.

If anyone has some links for collision test tools handy please post'em,
I will most likely tweak this one in the future if there are any particular issues and update this post.

I ended up making this after a tired night surfing around looking for a simple XOR based checksum
(as opposed to the several dozens of cycling redundancy checksums (CRC) out there)
so if anyone got some XOR checksum code they know about please shout out.

Remember! ANY checksum or hash algorithm has collisions, if a checksum or hash match, there is still a chance the file is different, even with SHA512 etc.
The only way to 100% positively check if a files content is different or not is to do a byte for byte comparison.
That being said, there is one thing you can count on always... If a checksum/hash does NOT match, then it's 100% likely the file is NOT identical.
In other words you may get false positives, but you will never get false negatives, a lot of people don't always remember that fact! ;)

Code: Select all

;XORC32 v1.1
;Copyright (c) 2009 Roger Hågensen, EmSai.
;http://EmSai.net/
;
;This software is provided 'as-is', without any express or implied
;warranty. In no event will the authors be held liable for any damages
;arising from the use of this software.
;
;Permission is granted to anyone to use this software for any purpose,
;including commercial applications, and to alter it and redistribute it
;freely, subject to the following restrictions:
;
;   1. The origin of this software must not be misrepresented; you must not
;   claim that you wrote the original software. If you use this software
;   in a product, an acknowledgment in the product documentation would be
;   appreciated but is not required.
;   2. Altered source versions must be plainly marked as such, and must not be
;   misrepresented as being the original software.
;   3. This notice may not be removed or altered from any source
;   distribution.

EnableExplicit

;Trivia: $43524F58=CROX in other words "XORC" but in little endian order.

;Note!
;We will use $43524F58 to make sure $0 or $FFFFFFFF when XOR'ed with
;$0 or $FFFFFFFF do not result in a checksum of $0 and similar issues,
;which is important if a file start with or contains a long series of these,
;which is not that uncommon in data files for example.
;Obviously this means we move this XOR behaviour to someplace else,
;but we're much less likely to find a series of $BB (example) than $0 or $FF in files.
;The ROL (rotate left) and BSWAP (byteswap) is done to ensure that repeats of the same data in a row
;produces different checksums for the next 4 bytes read.

Procedure.l XORC32(*mem,len.i,crc.l=$43524F58)
	CompilerIf #PB_Compiler_Processor=#PB_Processor_x64
		!MOV rcx,qword[p.v_len]
		!MOV rdx,qword[p.p_mem]
	CompilerElse
		!MOV ecx,dword[p.v_len]
		!MOV edx,dword[p.p_mem]
	CompilerEndIf
	!MOV eax,dword[p.v_crc]
	!MOV edi,$43524F58
	CompilerIf #PB_Compiler_Processor=#PB_Processor_x64
		!MOV rbx,rdx
		!CMP rbx,0
	CompilerElse
		!MOV ebx,edx
		!CMP ebx,0
	CompilerEndIf
	!JE XORC32_END
	CompilerIf #PB_Compiler_Processor=#PB_Processor_x64
		!CMP rcx,3
	CompilerElse
		!CMP ecx,3
	CompilerEndIf
	!JLE XORC32_LOOPEND
	!XORC32_LOOP:
		!BSWAP eax
		!ROL eax,1
		!ADD eax,edi
		CompilerIf #PB_Compiler_Processor=#PB_Processor_x64
 		!XOR eax,dword[rdx]
			!ADD rdx,4
			!ADD rcx,-4
			!CMP rcx,3
		CompilerElse
 		!XOR eax,dword[edx]
			!ADD edx,4
			!ADD ecx,-4
			!CMP ecx,3
		CompilerEndIf
		!JG XORC32_LOOP
	!XORC32_LOOPEND:
	CompilerIf #PB_Compiler_Processor=#PB_Processor_x64
		!CMP rcx,0
	CompilerElse
		!CMP ecx,0
	CompilerEndIf
	!JE XORC32_END
	CompilerIf #PB_Compiler_Processor=#PB_Processor_x64
		!MOV rbx,0
		!CMP rcx,1
	CompilerElse
		!MOV ebx,0
		!CMP ecx,1
	CompilerEndIf
	!JE XORC32_END1
	CompilerIf #PB_Compiler_Processor=#PB_Processor_x64
		!CMP rcx,3
	CompilerElse
		!CMP ecx,3
	CompilerEndIf
	!JE XORC32_END3
	!XORC32_END2:
		CompilerIf #PB_Compiler_Processor=#PB_Processor_x64
			!MOV bx,word[rdx]
		CompilerElse
			!MOV bx,word[edx]
		CompilerEndIf
		!JMP XORC32_ENDXOR
	!XORC32_END3:
	CompilerIf #PB_Compiler_Processor=#PB_Processor_x64
		!MOV bx,word[rdx+1]
	CompilerElse
		!MOV bx,word[edx+1]
	CompilerEndIf
	!SAL ebx,8
	!XORC32_END1:
		CompilerIf #PB_Compiler_Processor=#PB_Processor_x64
			!ADD bl,byte[rdx]
		CompilerElse
			!ADD bl,byte[edx]
		CompilerEndIf
	!XORC32_ENDXOR:
		!BSWAP eax
		!ROL eax,1
		!ADD eax,edi
		!XOR eax,ebx
	!XORC32_END:
	ProcedureReturn
EndProcedure
Examples:

Code: Select all

Define dstl.l
Define start.l,stop.l,n.l,i.l,*src
Define a.l,b.l,len.l

timeBeginPeriod_(1)

ReadFile(1,#PB_Compiler_Home+"PureBasic.exe")
len=Lof(1)
*src=AllocateMemory(len)

i=1 ;change the loop count to something higher for speed tests.

dstl=$43524F58 ;XORC
start=timeGetTime_()
For n=1 To i
 dstl=CRC32Fingerprint(*src,len)
Next
stop=timeGetTime_()
Debug Hex(dstl,#PB_Long)
Debug Bin(dstl,#PB_Long)
a=stop-start

dstl=$43524F58 ;XORC
start=timeGetTime_()
For n=1 To i
 dstl=XORC32(*src,len)
Next
stop=timeGetTime_()
Debug Hex(dstl,#PB_Long)
Debug Bin(dstl,#PB_Long)
b=stop-start

timeEndPeriod_(1)

MessageRequester("",Str(a)+#LF$+Str(b))

Code: Select all

;Make sure to turn off "Create unicode executable" in compiler options
;so you can see the collision matches of CRC32.
;Collisions exists for XORC32 obviously, but I'm too lazy to search for them! :)
;The number of collisions in XORC32 should be close to that of CRC32,
;but may possibly be better since XORC32 does not have issues with $0
;that CRC32 especially struggles with at the beginning of files.

Define.l c
Define test1$,test2$

test1$="fc0591"
test2$="123rainerbommert"
c=XORC32(@test1$,Len(test1$))
Debug Hex(c,#PB_Long)
c=XORC32(@test2$,Len(test2$))
Debug Hex(c,#PB_Long)
c=CRC32Fingerprint(@test1$,Len(test1$))
Debug Hex(c,#PB_Long)
c=CRC32Fingerprint(@test2$,Len(test2$))
Debug Hex(c,#PB_Long)
Debug ""

test1$="a1sellers"
test2$="advertees"
c=XORC32(@test1$,Len(test1$))
Debug Hex(c,#PB_Long)
c=XORC32(@test2$,Len(test2$))
Debug Hex(c,#PB_Long)
c=CRC32Fingerprint(@test1$,Len(test1$))
Debug Hex(c,#PB_Long)
c=CRC32Fingerprint(@test2$,Len(test2$))
Debug Hex(c,#PB_Long)
Debug ""

test1$="Maria has nine red beds."
test2$="Steven has fifteen white tables."
c=XORC32(@test1$,Len(test1$))
Debug Hex(c,#PB_Long)
c=XORC32(@test2$,Len(test2$))
Debug Hex(c,#PB_Long)
c=CRC32Fingerprint(@test1$,Len(test1$))
Debug Hex(c,#PB_Long)
c=CRC32Fingerprint(@test2$,Len(test2$))
Debug Hex(c,#PB_Long)
Debug ""

test1$="Joe has fourteen magenta things."
test2$="Lars has thirteen black balls."
c=XORC32(@test1$,Len(test1$))
Debug Hex(c,#PB_Long)
c=XORC32(@test2$,Len(test2$))
Debug Hex(c,#PB_Long)
c=CRC32Fingerprint(@test1$,Len(test1$))
Debug Hex(c,#PB_Long)
c=CRC32Fingerprint(@test2$,Len(test2$))
Debug Hex(c,#PB_Long)
Debug ""

;Let's see how the checksum algos behave when there is a tiny change,
;this lets you see the fundamental difference in how XORC32 and CRC32 work.
test1$="The quick brown fox jumps over the lazy cog"
test2$="The quick brown fox jumps over the lazy dog"
c=XORC32(@test1$,Len(test1$))
Debug Hex(c,#PB_Long)
c=XORC32(@test2$,Len(test2$))
Debug Hex(c,#PB_Long)
c=CRC32Fingerprint(@test1$,Len(test1$))
Debug Hex(c,#PB_Long)
c=CRC32Fingerprint(@test2$,Len(test2$))
Debug Hex(c,#PB_Long)
Debug ""

;And here is a empty string, let's see what they return with that.
test1$=""
c=XORC32(@test1$,Len(test1$)) ;will either return $43524F58 or whatever you use as the init/chain value.
Debug Hex(c,#PB_Long)
;c=CRC32Fingerprint(@test1$,Len(test2$)) ;commented out since it crashes... oops? ;) (PB 4.40B7)
;Debug Hex(c,#PB_Long)
Debug ""
Last edited by Rescator on Sun Nov 22, 2009 1:32 am, edited 6 times in total.
+18
Enthusiast
Enthusiast
Posts: 228
Joined: Fri Oct 24, 2008 2:07 pm

Re: XORC32 (A XOR based 32bit checksum)

Post by +18 »

Thanks for sharing :D
User avatar
Rescator
Addict
Addict
Posts: 1769
Joined: Sat Feb 19, 2005 5:05 pm
Location: Norway

Re: XORC32 (A XOR based 32bit checksum)

Post by Rescator »

Updated the code. version is still 1.0 since the only thing changed is x64 support.
The checksum is still 32bit though and identical, it's just that now x64 memory addressing and memlength is 64bit on x64 and 32bit on x86.

The code got a bit messier, but hopefully it'll be a simple example on how to support both x86 and x64 "easily" with PureBasic and ASM.

As seen, only things related to memory pointers and memory length was changed, the rest are normal x86 instructions which x64 happily uses
(thank AMD for that, Intel had a way different idea with their original 64bit CPUs).
Oh yeah, and please note the .i for the len variable, it was a .l previously
Fred
Administrator
Administrator
Posts: 18249
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Re: XORC32 (A XOR based 32bit checksum)

Post by Fred »

The problem with such simple algorithm is that actually you're not really sure than your data didn't changed, as collision are high. That makes the routine almost useless, as you can't trust it. That's why checksum based algo aren't that simple, and requiers some times to compute (CRC32 is still very fast compared to other).
User avatar
Rescator
Addict
Addict
Posts: 1769
Joined: Sat Feb 19, 2005 5:05 pm
Location: Norway

Re: XORC32 (A XOR based 32bit checksum)

Post by Rescator »

True. But the reason I'm looking for and experimenting with simple XOR checksums is that CRC32 was designed for transmission errors (originating back to the modem days I believe?)
while I only need to check if the data has been changed. (like say an installer or copy feature or file patcher, etc.) under the assumption that the two "files" being compared are both undamaged by any transmission data corruption (which are checked during unpacking or downloading separately anyway)
This means that intentional changes is what should be looked at. Like say a v1.0 and v1.1 of a program etc. Something simpler than CRC32 might work just as well in that case.

I will update this thread if I find out anything interesting and I hope others do as well,
I'm still looking for examples of XOR checksums and collision tests and tools sites, Google searching is not being very helpful for me so far. :roll:
User avatar
Rescator
Addict
Addict
Posts: 1769
Joined: Sat Feb 19, 2005 5:05 pm
Location: Norway

Re: XORC32 (A XOR based 32bit checksum)

Post by Rescator »

Updated the first post with some more text/babble, and some examples plus some CRC32 collisions I found when searching the net.

PS! A nice tip (not just for XORC32 but any checksum/hash function) is to give weight values to various info.

For example:

If the filename is different you can obviously assume it's not the same file so that's a weight of 1.0
If the filename is a match then that is a weight of 0.0
If the filesize is different that too is a obvious weight of 1.0
If the filesize is a match then that is a weight of 0.0
If the checksum is different that is also a weight of 1.0
If the checksum is a match then that is a weight of 0.5
If the last modified date is different then that is a weight of 0.5

If the resulting weight sum is 1.0 or higher then you can safely assume the file content (if file exists) is not the
same as it originally was supposed to be and you can replace/restore it etc.

Why this weight check system? Because it avoids unnecessarily copying a file or data from a disc or downloading from the net,
obviously you need to figure out what to do in the case you get a weight sum between 0.0 and 1.0 but that is a issue you and the user need to figure out :P

Oh! And a weight sum close to 0.0 does not mean the file is unchanged, you'd still need to do a byte to byte comparison for that, but it's just as fast to copy/replace it "just in case" at that stage anyway. Last modified date can usually be relied upon.
Thorium
Addict
Addict
Posts: 1305
Joined: Sat Aug 15, 2009 6:59 pm

Re: XORC32 (A XOR based 32bit checksum)

Post by Thorium »

A XOr based ultra fast checksum is very interessting for checks that are not vital. For example on anti-cheating for games. You don't want to slow down the game but you want to do some checks to make cheating harder. So a XOr based checksum is pretty good for that job. You don't need high accuracy because it can be beaten anyway but you don't want to slow down performance for a none vital calculation.

Now it would be exciting if we had a XOr based checksum that can be calculated in parallel. Checksums like CRC32 must be calculated sequential but it must be possible to create a low accuracy checksum that can be calculated parallel. So we could use SIMD for it and boost up performance by like 10 times.
User avatar
Rescator
Addict
Addict
Posts: 1769
Joined: Sat Feb 19, 2005 5:05 pm
Location: Norway

Re: XORC32 (A XOR based 32bit checksum)

Post by Rescator »

Added a byteswap which basically does a little to big endian and vice versa swap of bytes.
This increases the entropy between each internal loop of XORC32,
otherwise the only other change for v1.1 is that the ROL was moved.
Thorium wrote:A XOr based ultra fast checksum is very interessting for checks that are not vital.
Which was the idea behind it. Obviously XORC32 is not as resilient as CRC32 when it comes to transmission errors where one or two adjoining bits are changed
or a burst of 1's or 0's. But when it comes to checking if a file is changed or not it should do an ok job.
Still haven't found any test tools unfortunately, would be nice to see how close (or not) I am to CRC32's collision rate, especially since XORC32 does not have an issue with leading zeroes in a file etc.
Thorium wrote:Now it would be exciting if we had a XOr based checksum that can be calculated in parallel. Checksums like CRC32 must be calculated sequential but it must be possible to create a low accuracy checksum that can be calculated parallel. So we could use SIMD for it and boost up performance by like 10 times.
Not sure there are many "fast" checksums or CRCs or hashes that are able to do that,
especially if you want them to be easily chainable (taking the returned checksum and using it as the init for the next round of block of data you want to check).
There are at least two out there that can do this but no idea about their speed etc.

Then again one could always do parallel processing using XORC32 and then XOR the results of datablock A and B together as long as they are consistently XORed in a A,B,A,B sequence that would work, obviously this would not be the same as doing A,A,A,A etc.

PS! It might be possible if the routine used a separate checksum in the loop, and when done it xor that with the init and return the result, in that case A,B,A,B should give the same checksum as A,A,A,A processing. But considering how simple and tight the loop is I'm not sure how much faster it could be.
My advise is to process files in parallel, sure you get some extra filesystem overhead, but if you buffering is nice (65K buffers used maybe?) then that would be minimal disk overhead.
freak
PureBasic Team
PureBasic Team
Posts: 5944
Joined: Fri Apr 25, 2003 5:21 pm
Location: Germany

Re: XORC32 (A XOR based 32bit checksum)

Post by freak »

It might also be interesting to test this against the Adler32 algorithm (from zlib) as well. It is also designed to get better speeds than crc32 at the expense of the collision performance.

http://en.wikipedia.org/wiki/Adler-32
http://www.zlib.net/zlib_tech.html (a comparison to crc32 is at the bottom of the page)

You should be able to import the adler32() function from the zlib included in the PB package directly if you want to try it.
quidquid Latine dictum sit altum videtur
User avatar
Rescator
Addict
Addict
Posts: 1769
Joined: Sat Feb 19, 2005 5:05 pm
Location: Norway

Re: XORC32 (A XOR based 32bit checksum)

Post by Rescator »

Thanks freak.
But if I where using zlib.lib then I probably use crc32() from that lib instead as it's 3 or so times faster than the native PureBasic one.
zlibs crc32 (at least the one in PureBasic's zlib.lib which is one of the fastest crc32's I've seen) is about 140% slower than XORC32, adler32 is about 30% slower than XORC32
as to a comparison of collisions between adler32 and XORC32 I have no idea but the same or better than adler32 I hope.
adler32 uses two 16bit values if I recall correctly.

I do plan to do a 32bit collision test later on XORC32, I just need to roll up my sleeves and dump out 4+ billion values.
If it behaves as I expect then XORC32 will just like CRC32 be able to 100% reliably checksum a 0-4byte value, and like CRC32 be able to restore the original value (aka dictionary lookup)

But there is no way I can test beyond 32bits of data in a similar way, I'd need either houndred of terrabyte of free disk space or years to test that heh.
Unless someone (hint hint) has links to some test tools that do this the smart way (which I've got no clue on how to do). :)
Post Reply