To increase the speed of pre-comparison of files, I used the division of the file length into 32 sections and read the data byte 32 times. Now if a series consists of 200 series of the same size, then instead of calculating the md5 of large files with a total size of 100 GB, I read 32 bytes from each file. It happens 10 times faster. And only after that I calculate md5, if the preliminary comparison still gives a suspicion that the files are the same.
I added the source code with the prefix PseudoHash.
Code: Select all
DisableDebugger
EnableExplicit
UseMD5Fingerprint()
Define Path$, StartTime, Res.s, md5$
Procedure.s GetPseudoHash(Path$, Shift.q)
Protected res$, length, file_id
file_id = ReadFile(#PB_Any, Path$)
If file_id
length = Lof(file_id)
FileSeek(file_id, 4, #PB_Relative)
While Eof(file_id) = 0
res$ + Hex(ReadByte(file_id), #PB_Byte)
FileSeek(file_id, Shift, #PB_Relative)
Wend
FileSeek(file_id, length - 1, #PB_Absolute)
res$ + Hex(ReadByte(file_id), #PB_Byte)
CloseFile(file_id)
EndIf
ProcedureReturn res$
EndProcedure
Path$ = "path_to_video"
StartTime=ElapsedMilliseconds()
md5$ = GetPseudoHash(Path$, FileSize(Path$) / 31)
Res = "hash time = " + Str(ElapsedMilliseconds()-StartTime) + " ms"
MessageRequester("hash_0", md5$ + #LF$ + #LF$ + Res)
Path$ = "path_to_movie_of_the_same_size_but_different_hash"
StartTime=ElapsedMilliseconds()
md5$ = FileFingerprint(Path$, #PB_Cipher_MD5)
Res = "hash time md5 = " + Str(ElapsedMilliseconds()-StartTime) + " ms"
MessageRequester("md5", md5$ + #LF$ + #LF$ + Res)