Faster MD5 File Hashes.

Just starting out? Need help? Post your questions and find answers here.
jassing
Addict
Addict
Posts: 1885
Joined: Wed Feb 17, 2010 12:00 am

Faster MD5 File Hashes.

Post by jassing »

I almost posted this in tips/tricks -- but not sure if it's faster 'everywhere' -- here at the office, I tested files from 1 meg to 800 megs, and using the chunked approach was faster, not by a great margin, but faster -- and if doing md5's on lots of files; any little help counts.
I even tested with updating a progress bar in a window and it was still faster...

Could this be checked on linux/mac by some folks (and windows) to see if is universally faster? I tested winxp,2003,7 both 32 and 64 bit os's; ranging on 7 year old single core machines to brand new quad core system.

It may have been my test files, which were just random files I grabbed off the server...

Code: Select all


Global gnChunkSize=(1024 * 32)  ; 32 seemed to be more universally usable; experiment at will.
;                                                       there is a 'sweet spot' -- it becomes slower if too small or too big of a buffer is used.

Procedure.s iMD5FileFingerprint( cFile.s )
	Protected cHash.s, nBytes, hFile, *pDataChunk, hMD5
	Shared gnChunkSize
	
	If FileSize(cFile)>-1
		hFile = ReadFile(#PB_Any, cFile, #PB_File_SharedRead)
		If hfile 
			FileBuffersSize(hFile, gnChunkSize )
			*pDataChunk = AllocateMemory( gnChunkSize )
			If *pDataChunk
				hMD5 = ExamineMD5Fingerprint(#PB_Any)
				If hMD5
					While Not Eof(hFile)
						; You could update a progress bar...
						nBytes = ReadData(hFile, *p, gnChunkSize)
						NextFingerprint(hMD5, *p, nBytes)
					Wend
					cHash = FinishFingerprint(hMD5)
				Else
					Debug "Failed to init md5"
				EndIf 
				FreeMemory(*pDataChunk)
			Else
				Debug "Failed to allocate "+Str(gnChunkSize)+" bytes of memory"
			EndIf 
			CloseFile(hFile)
		Else
			Debug "Failed to openf ile"
		EndIf
	Else
		Debug "File does Not exist"
	EndIf
	
	ProcedureReturn cHash
EndProcedure
Last edited by jassing on Wed May 01, 2013 3:11 pm, edited 1 time in total.
infratec
Always Here
Always Here
Posts: 7582
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: Faster MD5 File Hashes.

Post by infratec »

Hi,

no time to test it yet,
but one point:

Your CloseFile(hfile) is at the wrong place. :wink:

Bernd
jassing
Addict
Addict
Posts: 1885
Joined: Wed Feb 17, 2010 12:00 am

Re: Faster MD5 File Hashes.

Post by jassing »

infratec wrote:Hi,

no time to test it yet,
but one point:

Your CloseFile(hfile) is at the wrong place. :wink:

Bernd
aye - it is indeed -- thanks -- I had re-ordered the things that could fail and missed that...
infratec
Always Here
Always Here
Posts: 7582
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: Faster MD5 File Hashes.

Post by infratec »

Hi,

just made a few tests:

You are right :!:

With the following code

Code: Select all

#LargeFile$ = "t:\tmp\ISO-Images\UltimateBootCD\ubcd503.iso"
#gnChunkSize = 1024 * 32


Procedure.s iMD5FileFingerprint(cFile.s)
  
  Protected cHash.s, nBytes, hFile, *pDataChunk, hMD5
  
  
  hFile = ReadFile(#PB_Any, cFile, #PB_File_SharedRead)
  If hfile
    *pDataChunk = AllocateMemory(#gnChunkSize)
    If *pDataChunk
      hMD5 = ExamineMD5Fingerprint(#PB_Any)
      If hMD5
        While Not Eof(hFile)
          ; You could update a progress bar...
          nBytes = ReadData(hFile, *pDataChunk, #gnChunkSize)
          NextFingerprint(hMD5, *pDataChunk, nBytes)
        Wend
        cHash = FinishFingerprint(hMD5)
      Else
        Debug "Failed to init md5"
      EndIf
      FreeMemory(*pDataChunk)
    Else
      Debug "Failed to allocate "+Str(#gnChunkSize)+" bytes of memory"
    EndIf
    CloseFile(hFile)
  Else
    Debug "Failed to open file"
  EndIf
  
  ProcedureReturn cHash
  
EndProcedure
 

Define.i StartTime, EndTime, i

DisableDebugger

StartTime = ElapsedMilliseconds()
For i = 0 To 10
  iMD5FileFingerprint(#LargeFile$)
Next i
EndTime = ElapsedMilliseconds()
MessageRequester("Time", Str(EndTime - StartTime))

StartTime = ElapsedMilliseconds()
For i = 0 To 10
  MD5FileFingerprint(#LargeFile$)
Next i
EndTime = ElapsedMilliseconds()
MessageRequester("Time", Str(EndTime - StartTime))
and a 300MB file your 'home made' solution needs <27s
and the PB one needs >31s

But in my case it is independent of the FileBufferSize() command, so I removed it.
It has also no effect if I use FileBufferSize(#PB_Default, ...) infront of the PB version.

Also larger sizes for the buffer has no effect or a negative one.
Maybe this depends on the used harddisk (internal cache)

Maybe Fred should use the 'handmade' version :wink:

Bernd


P.S.: Your current listing does not work (*p should be *pDataChunk)
freak
PureBasic Team
PureBasic Team
Posts: 5940
Joined: Fri Apr 25, 2003 5:21 pm
Location: Germany

Re: Faster MD5 File Hashes.

Post by freak »

The PB version does the same thing. Buffer size is 1mb.
quidquid Latine dictum sit altum videtur
infratec
Always Here
Always Here
Posts: 7582
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: Faster MD5 File Hashes.

Post by infratec »

If I use

Code: Select all

FileBufferSize(#PB_Default, 1024 * 1024)
in front of all, the 'home made' version needs >37s and the PB version needs still > 31s.

With a small FileBufferSize() or the default one, the 'home made' version is definately faster.

Maybe it is not useful to use a large file buffer.

Bernd
Fred
Administrator
Administrator
Posts: 18162
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Re: Faster MD5 File Hashes.

Post by Fred »

When using a larger buffersize, the read data goes in the buffer first and then its copied to the supplied buffer. When the buffer is small and the ReadData() is larger, the read data is put directly in the supplied buffer, that's why it's faster. Buffered read are useful when doing a lot of small read, like with ReadByte() or ReadString()
Karl-Uwe Frank
User
User
Posts: 17
Joined: Sat Sep 03, 2011 12:33 am

Re: Faster MD5 File Hashes.

Post by Karl-Uwe Frank »

Just changed the code below and now it take only 0,4600000083 seconds for the 100MB file and 3,3949999809 seconds seconds for the 680MB file, which is faster than the md5 program of the OS. Is that possible?

Code: Select all

;-----------------------------------------------------------
;
; Calculate the MD5 digest of a file
;
;-----------------------------------------------------------
DisableDebugger

OpenConsole()

If (CountProgramParameters() < 1) ; Check If a Parameter is passed through
  PrintN("Please pass a file name.")
  End 1
EndIf

UseMD5Fingerprint()

#BufferSize = 16384 
*Buffer = AllocateMemory(#BufferSize)
Define.s FileName = ProgramParameter(0)
Define.w readBufferSize = #BufferSize
Define.q readByteRemain = 0
Define readByte.w = 0
Define MD5digest.s{32}

Define.d t1 = ElapsedMilliseconds()

If (ReadFile(0, FileName))
  If (*Buffer) And (StartFingerprint(0, #PB_Cipher_MD5))
    readByteRemain = FileSize(FileName)
  
    While readByteRemain > 0
      If (readBufferSize > readByteRemain) : readBufferSize = readByteRemain : EndIf
      
      readByte = ReadData(0, *Buffer, readBufferSize) 

      AddFingerprintBuffer(0, *Buffer, readByte)
      readByteRemain = readByteRemain - readByte
    Wend
    
    CloseFile(0)

    MD5digest = FinishFingerprint(0)
    FreeMemory(*Buffer)
  EndIf
Else
  PrintN("File: "+ FileName +" not found")
EndIf

Define.d t2 = ElapsedMilliseconds()
PrintN("Elapsed: "+ StrF((t2-t1)/1000) + " seconds")

PrintN(MD5digest)

CloseConsole()

End 0

; IDE Options = PureBasic 5.40 LTS (MacOS X - x64)
; ExecutableFormat = Console
; CursorPosition = 1
; EnableAsm
; EnableXP
; Executable = md5sum
; DisableDebugger
; Compiler = PureBasic 5.40 LTS (MacOS X - x64)
Cheers,
Karl-Uwe
Karl-Uwe Frank
User
User
Posts: 17
Joined: Sat Sep 03, 2011 12:33 am

Re: Faster MD5 File Hashes.

Post by Karl-Uwe Frank »

A bit of a streamlining of the inner loop by eliminating unnecessary calculations of the reamining byte, etc., which will give some fraction of a second more speed depending on the file size.

Code: Select all

If (ReadFile(0, FileName))
	If (*Buffer) And (StartFingerprint(0, #PB_Cipher_MD5))
		readByte = ReadData(0, *Buffer, #BufferSize) 
		
		While readByte > 0				
			AddFingerprintBuffer(0, *Buffer, readByte)
			readByte = ReadData(0, *Buffer, #BufferSize) 
		Wend

		CloseFile(0)

		MD5digest = FinishFingerprint(0)
		FreeMemory(*Buffer)
	EndIf
Else
	PrintN("File: "+ FileName +" not found")
EndIf
Cheers,
Karl-Uwe
Post Reply