Page 1 of 2
Image duplicate detection
Posted: Sun Feb 15, 2009 5:10 am
by pdwyer
I have a need and an idea for writing a simple app to detect the "distance" between two images that are of different formats and sizes (rotation doesn't matter). It's not going to be perfect but I think it will suit my needs.
ie a score of "0" between two images shows they are identical (as far as they were checked) and the higher the number the bigger the likely difference between them... something like that
Before I start though I was just wondering if I was reinvinting a wheel here and there was a known good algorithm (possibly done in PB already) that I could use
I've seen a couple of apps that can do this but I'd like to role my own to get the other functionality that I want to use this for.
Posted: Sun Feb 15, 2009 7:55 am
by Rings
i know that pcfreak has done something
similar (searching for picture dublettes of any size and
Format) and that it worked well here:
1. Load and Resize any pictures to 2x2 pixel
2. Grey them
3. you got 4 Bytes, thats your 'Fingerprint' of the pic.
4. For rotating, swap the bytes
5. Move the fingerprint with fileinformation into a linked list
6. Sort the List
7.Finding dublettes or nearly them by checking the bits that differs
8. For more details, increase the picture-resizing
but first do on 2x2 pixel, the results are amazing
Posted: Sun Feb 15, 2009 8:22 am
by pdwyer
Interesting idea, I guess a little larger than 2x2 would increase acuracy. I wonder how CPU intensive the resize would be if we were talking about 30,000 files...
I'm thinking of taking r,g,b values for a grid of points on an image (say 100 on 10x10 evenly spaced. take an average of the rgb of a single point and the sig would be 100 digits of xxx-xxx-xxx-xxx-xxx etc
The distance between two images would be abs(sig1[0]-sig2[0]) + abs(sig1[1]-sig2[1]) + ... giving a distance of (if 100 points are used which may be too many) 0-25500.
I'm going to think more about the resize idea though, that sounds interesting
Posted: Sun Feb 15, 2009 11:00 am
by idle
A lot of it depends on the source of the image data, though for a generalized method, I'd look into FFT for resolution independence with a hough transform to classify each target image, though it'd still need fuzzy matching and may lead to false positives.
Posted: Sun Feb 15, 2009 3:25 pm
by pdwyer
Here's my first attempt. A score of under about 400 in image distance seems to be a likely candidate for being the same image but resaved, stretched or resized. (not handling rotation). Stretched seems to work okay since it's a proportional placing of the points. You can change the point count used at the top
Using 5x5 (25) check points. Change the pic paths at the bottom.
Still needs more testing with a wider range of images
Code: Select all
UseJPEGImageDecoder()
UsePNGImageDecoder()
UseJPEG2000ImageDecoder()
#SigPtCount = 25;16
Structure ImgSig
ImgPath.s
PtAvg.w[#SigPtCount]
EndStructure
;================================================================
Procedure.l GetImgDist(*Img1.ImgSig, *Img2.ImgSig)
Distance.l = 0
For i = 0 To #SigPtCount -1
Distance = Distance + Abs(*Img1\PtAvg[i]-*Img2\PtAvg[i])
Next
ProcedureReturn Distance
EndProcedure
;================================================================
Procedure GetImgSig(*Img.ImgSig); path populated, pt's to be added
Debug *Img\ImgPath
Dim Colours.l(#SigPtCount)
Protected ImgWidth.l
Protected ImgHeight.l
Protected PointNo.l = 0
ImgID.l = LoadImage( #PB_Any , *Img\ImgPath)
ImgHeight = ImageHeight(ImgID)
ImgWidth = ImageWidth(ImgID)
Debug Str(ImgWidth) + " x " + Str(ImgHeight)
StartDrawing(ImageOutput(ImgID))
Ystep.l = Int(Round(ImgHeight/(Sqr(#SigPtCount) +1 ),#PB_Round_Nearest))
XStep.l = Int(Round(ImgWidth/(Sqr(#SigPtCount) +1),#PB_Round_Nearest))
For y = 1 To ImgHeight - 5
For x = 1 To ImgWidth - 5
If y % Ystep = 0 And x % XStep = 0
Colour.l = Point(x, y)
;Debug Str(PointNo) + " x= " + Str(x) + ", y= " + Str(y) + " Val: " + Str(Red(Colour) + Green(Colour) + Blue(Colour) / 3)
*Img\PtAvg[PointNo] = Red(Colour) + Green(Colour) + Blue(Colour) / 3
PointNo = PointNo + 1
EndIf
Next
Next
StopDrawing()
EndProcedure
;================================================================
Define Img1.ImgSig, Img2.ImgSig
IMG1\ImgPath = "F:\photos\12240004_1.jpg"
IMG2\ImgPath = "F:\photos\12240004.jpg"
GetImgSig(@Img1)
GetImgSig(@Img2)
Debug GetImgDist(@Img1,@Img2)
Posted: Sun Feb 15, 2009 5:17 pm
by Kaeru Gaman
just an idea for comparison:
- resize both images in question to a quite small size,
to let's say 1 pix makes 8x8 up to 32x32 pixels of the originals.
- draw one over the other using XOr
- count the amount of the bits left set
this will make up a value that directly counts the "difference" between both images.
sure, for this approach rotation matters.
Posted: Mon Feb 16, 2009 5:03 am
by pdwyer
Image resizing seems to be a consensus. I made a modification to mine to resize to the number of points being checked (eg 5x5) and there seems to be some more improvement.
@Yadoku,

The xor is an interesting idea, and considering the new file size probably not too CPU intensive. I'd need to test it as I can't picture the different impacts on different conditions in my mind with XOR
One thing I thought of but haven't tried yet, is rather than taking an average which I think would have similar results to grey scale for this, would be to take differences between rgb scores. One pic I found was originally the same pic but had been lightened. RGB values didn't move higher in exact proportion to eachother but the ratios between them was less impacted than the average of those score which did go up over all.
I think I'll try swapping out avg with abs(R-G)+abs(R-B)+abs(G-B) and see how that goes.
Posted: Mon Feb 16, 2009 12:31 pm
by PB
> 1. Load and Resize any pictures to 2x2 pixel
> 2. Grey them
Done that.
> 3. you got 4 Bytes, thats your 'Fingerprint' of the pic.
This part I can't do. Any tips? I can't find pcfreak's code after searching.

Posted: Mon Feb 16, 2009 12:42 pm
by Rings
pcfreak never post that in public.
if you have the 4 bytes (1 byte per grey pixel),
copy them to a long(4 Byte )
and you will have your fingerprint/Checksum
or whatever you named that

Posted: Mon Feb 16, 2009 1:00 pm
by Kaeru Gaman
erm... hello?
PULLERALARM!
this must be a JOKE!
you can't consider four 8bit grey pixel being a pictures' fingerprint.
Posted: Mon Feb 16, 2009 1:10 pm
by pdwyer
See Ring's Point 8 above.
It's not designed to be a unique fingerprint remember as we are trying to catch similarities too, if we were just after "identical" we wouldn't be working at a image level, an MD5 would be fine
Posted: Mon Feb 16, 2009 6:48 pm
by Demivec
You would probably want to also have an option in your image check to check for matching aspect ratios.
Posted: Tue Feb 17, 2009 1:11 am
by pdwyer
I was finding that with proportional checking points this was okay, I tried stretching an image one direction so as to screw the proportions up and the distance didn't change much.
Posted: Tue Feb 17, 2009 10:23 am
by Rings
well, i recode some stuff (with permission from pcfreak)
if you run the code, watch the greybytes Collumn.
if they close to each , the images are also closer.
pictures that are resized to say 100%,70%,50% or 30% have mostly identically greybytes.
ok, code:
Code: Select all
#IdWidth = 2
#IdHeight = 2
UseJPEGImageDecoder()
UsePNGImageDecoder()
#List=1
Structure strucImgList
Filename.s
FileSize.i
Width.l
Height.l
StructureUnion
col.b[4]
value.l
EndStructureUnion
sGrey.s
EndStructure
NewList ImgList.strucIMGList()
Procedure FileScan(FilePath.s, List ImgList.strucImgList())
If ExamineDirectory(1024, FilePath.s, "*.*")
Repeat
FileType = NextDirectoryEntry(1024)
FileName.s = DirectoryEntryName(1024)
If FileType = 1
iHdl.l = LoadImage(#PB_Any,FilePath.s + "\"+ Filename)
If iHdl <> 0
AddElement(ImgList())
ImgList()\Filename=Filename
ImgList()\Filesize=FileSize(FilePath.s + "\" + DirectoryEntryName(1024))
ImgList()\Width=ImageWidth(iHdl)
ImgList()\Height=ImageHeight(iHdl)
ResizeImage(iHdl, #IdWidth + 1, #IdHeight + 1, #PB_Image_Smooth)
hDC.l = StartDrawing(ImageOutput(iHdl))
If hDC
;R1
;oxo
;ooo
;ooo
rgb.l = Point(1, 0)
f.l = Int((Red(rgb) * 0.299) + (Green(rgb) * 0.587) + (Blue(rgb) * 0.114))
ImgList()\col[0]= f & $FF
;R2
;ooo
;xoo
;ooo
rgb.l = Point(0, 1)
f.l = Int((Red(rgb) * 0.299) + (Green(rgb) * 0.587) + (Blue(rgb) * 0.114))
ImgList()\col[1]= f & $FF
;R3
;ooo
;oox
;ooo
rgb.l = Point(2, 1)
f.l = Int((Red(rgb) * 0.299) + (Green(rgb) * 0.587) + (Blue(rgb) * 0.114))
ImgList()\col[2]= f & $FF
;R4
;ooo
;ooo
;oxo
rgb.l = Point(1, 2)
f.l = Int((Red(rgb) * 0.299) + (Green(rgb) * 0.587) + (Blue(rgb) * 0.114))
ImgList()\col[3] = f & $FF
;build the string(for sorting only)
ImgList()\sGrey =RSet(StrU(ImgList()\col[0]), 3, "0") + " | " + RSet(StrU(ImgList()\col[1]), 3, "0") + " | " + RSet(StrU(ImgList()\col[2]), 3, "0") + " | " + RSet(StrU(ImgList()\col[3]), 3, "0")
StopDrawing()
EndIf
FreeImage(iHdl)
EndIf
EndIf
Until FileType = 0
EndIf
EndProcedure
Directory.s="d:\Bilder\Bild"
OpenWindow(0, 0, 0, 800, 600, "Dir Scan..." + Directory, #PB_Window_SystemMenu | #PB_Window_ScreenCentered)
Flags=#PB_ListIcon_GridLines|#PB_ListIcon_FullRowSelect|#PB_ListIcon_AlwaysShowSelection
ListIconGadget(#List, 1 ,1, 800, 600, "File", 150, Flags)
AddGadgetColumn(#List, 1, "Size", 50)
AddGadgetColumn(#List, 2, "Width", 50)
AddGadgetColumn(#List, 3, "Height",50)
AddGadgetColumn(#List, 4, "Grey Bytes",180)
AddGadgetColumn(#List, 5, "Value",100)
FileScan(Directory,IMGList() )
SortStructuredList(IMGList(), #PB_Sort_Ascending, OffsetOf(strucImgList\sGrey), #PB_Sort_String) ;sort to see better results
ForEach ImgList()
AddGadgetItem(#List, -1, ImgList()\FileName + Chr(10) + Str(ImgList()\FileSize) + Chr(10) + Str(ImgList()\width) + Chr(10) + Str(ImgList()\height) + Chr(10) + ImgList()\sGrey + Chr(10) +Str(ImgList()\Value))
Next
Repeat
EventID = WaitWindowEvent()
If EventID = #PB_Event_Gadget
Select EventGadget()
EndSelect
EndIf
Until EventID = #PB_Event_CloseWindow
End
Posted: Thu Feb 19, 2009 9:29 am
by Rings
pdwyer : any results ?