Image duplicate detection
Image duplicate detection
I have a need and an idea for writing a simple app to detect the "distance" between two images that are of different formats and sizes (rotation doesn't matter). It's not going to be perfect but I think it will suit my needs.
ie a score of "0" between two images shows they are identical (as far as they were checked) and the higher the number the bigger the likely difference between them... something like that
Before I start though I was just wondering if I was reinvinting a wheel here and there was a known good algorithm (possibly done in PB already) that I could use
I've seen a couple of apps that can do this but I'd like to role my own to get the other functionality that I want to use this for.
ie a score of "0" between two images shows they are identical (as far as they were checked) and the higher the number the bigger the likely difference between them... something like that
Before I start though I was just wondering if I was reinvinting a wheel here and there was a known good algorithm (possibly done in PB already) that I could use
I've seen a couple of apps that can do this but I'd like to role my own to get the other functionality that I want to use this for.
Paul Dwyer
“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
i know that pcfreak has done something
similar (searching for picture dublettes of any size and
Format) and that it worked well here:
1. Load and Resize any pictures to 2x2 pixel
2. Grey them
3. you got 4 Bytes, thats your 'Fingerprint' of the pic.
4. For rotating, swap the bytes
5. Move the fingerprint with fileinformation into a linked list
6. Sort the List
7.Finding dublettes or nearly them by checking the bits that differs
8. For more details, increase the picture-resizing
but first do on 2x2 pixel, the results are amazing
similar (searching for picture dublettes of any size and
Format) and that it worked well here:
1. Load and Resize any pictures to 2x2 pixel
2. Grey them
3. you got 4 Bytes, thats your 'Fingerprint' of the pic.
4. For rotating, swap the bytes
5. Move the fingerprint with fileinformation into a linked list
6. Sort the List
7.Finding dublettes or nearly them by checking the bits that differs
8. For more details, increase the picture-resizing
but first do on 2x2 pixel, the results are amazing
SPAMINATOR NR.1
Interesting idea, I guess a little larger than 2x2 would increase acuracy. I wonder how CPU intensive the resize would be if we were talking about 30,000 files...
I'm thinking of taking r,g,b values for a grid of points on an image (say 100 on 10x10 evenly spaced. take an average of the rgb of a single point and the sig would be 100 digits of xxx-xxx-xxx-xxx-xxx etc
The distance between two images would be abs(sig1[0]-sig2[0]) + abs(sig1[1]-sig2[1]) + ... giving a distance of (if 100 points are used which may be too many) 0-25500.
I'm going to think more about the resize idea though, that sounds interesting
I'm thinking of taking r,g,b values for a grid of points on an image (say 100 on 10x10 evenly spaced. take an average of the rgb of a single point and the sig would be 100 digits of xxx-xxx-xxx-xxx-xxx etc
The distance between two images would be abs(sig1[0]-sig2[0]) + abs(sig1[1]-sig2[1]) + ... giving a distance of (if 100 points are used which may be too many) 0-25500.
I'm going to think more about the resize idea though, that sounds interesting
Paul Dwyer
“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
Here's my first attempt. A score of under about 400 in image distance seems to be a likely candidate for being the same image but resaved, stretched or resized. (not handling rotation). Stretched seems to work okay since it's a proportional placing of the points. You can change the point count used at the top
Using 5x5 (25) check points. Change the pic paths at the bottom.
Still needs more testing with a wider range of images
Using 5x5 (25) check points. Change the pic paths at the bottom.
Still needs more testing with a wider range of images
Code: Select all
UseJPEGImageDecoder()
UsePNGImageDecoder()
UseJPEG2000ImageDecoder()
#SigPtCount = 25;16
Structure ImgSig
ImgPath.s
PtAvg.w[#SigPtCount]
EndStructure
;================================================================
Procedure.l GetImgDist(*Img1.ImgSig, *Img2.ImgSig)
Distance.l = 0
For i = 0 To #SigPtCount -1
Distance = Distance + Abs(*Img1\PtAvg[i]-*Img2\PtAvg[i])
Next
ProcedureReturn Distance
EndProcedure
;================================================================
Procedure GetImgSig(*Img.ImgSig); path populated, pt's to be added
Debug *Img\ImgPath
Dim Colours.l(#SigPtCount)
Protected ImgWidth.l
Protected ImgHeight.l
Protected PointNo.l = 0
ImgID.l = LoadImage( #PB_Any , *Img\ImgPath)
ImgHeight = ImageHeight(ImgID)
ImgWidth = ImageWidth(ImgID)
Debug Str(ImgWidth) + " x " + Str(ImgHeight)
StartDrawing(ImageOutput(ImgID))
Ystep.l = Int(Round(ImgHeight/(Sqr(#SigPtCount) +1 ),#PB_Round_Nearest))
XStep.l = Int(Round(ImgWidth/(Sqr(#SigPtCount) +1),#PB_Round_Nearest))
For y = 1 To ImgHeight - 5
For x = 1 To ImgWidth - 5
If y % Ystep = 0 And x % XStep = 0
Colour.l = Point(x, y)
;Debug Str(PointNo) + " x= " + Str(x) + ", y= " + Str(y) + " Val: " + Str(Red(Colour) + Green(Colour) + Blue(Colour) / 3)
*Img\PtAvg[PointNo] = Red(Colour) + Green(Colour) + Blue(Colour) / 3
PointNo = PointNo + 1
EndIf
Next
Next
StopDrawing()
EndProcedure
;================================================================
Define Img1.ImgSig, Img2.ImgSig
IMG1\ImgPath = "F:\photos\12240004_1.jpg"
IMG2\ImgPath = "F:\photos\12240004.jpg"
GetImgSig(@Img1)
GetImgSig(@Img2)
Debug GetImgDist(@Img1,@Img2)
Paul Dwyer
“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
- Kaeru Gaman
- Addict

- Posts: 4826
- Joined: Sun Mar 19, 2006 1:57 pm
- Location: Germany
just an idea for comparison:
- resize both images in question to a quite small size,
to let's say 1 pix makes 8x8 up to 32x32 pixels of the originals.
- draw one over the other using XOr
- count the amount of the bits left set
this will make up a value that directly counts the "difference" between both images.
sure, for this approach rotation matters.
- resize both images in question to a quite small size,
to let's say 1 pix makes 8x8 up to 32x32 pixels of the originals.
- draw one over the other using XOr
- count the amount of the bits left set
this will make up a value that directly counts the "difference" between both images.
sure, for this approach rotation matters.
oh... and have a nice day.
Image resizing seems to be a consensus. I made a modification to mine to resize to the number of points being checked (eg 5x5) and there seems to be some more improvement.
@Yadoku,
The xor is an interesting idea, and considering the new file size probably not too CPU intensive. I'd need to test it as I can't picture the different impacts on different conditions in my mind with XOR
One thing I thought of but haven't tried yet, is rather than taking an average which I think would have similar results to grey scale for this, would be to take differences between rgb scores. One pic I found was originally the same pic but had been lightened. RGB values didn't move higher in exact proportion to eachother but the ratios between them was less impacted than the average of those score which did go up over all.
I think I'll try swapping out avg with abs(R-G)+abs(R-B)+abs(G-B) and see how that goes.
@Yadoku,
One thing I thought of but haven't tried yet, is rather than taking an average which I think would have similar results to grey scale for this, would be to take differences between rgb scores. One pic I found was originally the same pic but had been lightened. RGB values didn't move higher in exact proportion to eachother but the ratios between them was less impacted than the average of those score which did go up over all.
I think I'll try swapping out avg with abs(R-G)+abs(R-B)+abs(G-B) and see how that goes.
Paul Dwyer
“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
> 1. Load and Resize any pictures to 2x2 pixel
> 2. Grey them
Done that.
> 3. you got 4 Bytes, thats your 'Fingerprint' of the pic.
This part I can't do. Any tips? I can't find pcfreak's code after searching.
> 2. Grey them
Done that.
> 3. you got 4 Bytes, thats your 'Fingerprint' of the pic.
This part I can't do. Any tips? I can't find pcfreak's code after searching.
I compile using 5.31 (x86) on Win 7 Ultimate (64-bit).
"PureBasic won't be object oriented, period" - Fred.
"PureBasic won't be object oriented, period" - Fred.
- Kaeru Gaman
- Addict

- Posts: 4826
- Joined: Sun Mar 19, 2006 1:57 pm
- Location: Germany
See Ring's Point 8 above.
It's not designed to be a unique fingerprint remember as we are trying to catch similarities too, if we were just after "identical" we wouldn't be working at a image level, an MD5 would be fine
It's not designed to be a unique fingerprint remember as we are trying to catch similarities too, if we were just after "identical" we wouldn't be working at a image level, an MD5 would be fine
Paul Dwyer
“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
I was finding that with proportional checking points this was okay, I tried stretching an image one direction so as to screw the proportions up and the distance didn't change much.
Paul Dwyer
“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
well, i recode some stuff (with permission from pcfreak)
if you run the code, watch the greybytes Collumn.
if they close to each , the images are also closer.
pictures that are resized to say 100%,70%,50% or 30% have mostly identically greybytes.
ok, code:
if you run the code, watch the greybytes Collumn.
if they close to each , the images are also closer.
pictures that are resized to say 100%,70%,50% or 30% have mostly identically greybytes.
ok, code:
Code: Select all
#IdWidth = 2
#IdHeight = 2
UseJPEGImageDecoder()
UsePNGImageDecoder()
#List=1
Structure strucImgList
Filename.s
FileSize.i
Width.l
Height.l
StructureUnion
col.b[4]
value.l
EndStructureUnion
sGrey.s
EndStructure
NewList ImgList.strucIMGList()
Procedure FileScan(FilePath.s, List ImgList.strucImgList())
If ExamineDirectory(1024, FilePath.s, "*.*")
Repeat
FileType = NextDirectoryEntry(1024)
FileName.s = DirectoryEntryName(1024)
If FileType = 1
iHdl.l = LoadImage(#PB_Any,FilePath.s + "\"+ Filename)
If iHdl <> 0
AddElement(ImgList())
ImgList()\Filename=Filename
ImgList()\Filesize=FileSize(FilePath.s + "\" + DirectoryEntryName(1024))
ImgList()\Width=ImageWidth(iHdl)
ImgList()\Height=ImageHeight(iHdl)
ResizeImage(iHdl, #IdWidth + 1, #IdHeight + 1, #PB_Image_Smooth)
hDC.l = StartDrawing(ImageOutput(iHdl))
If hDC
;R1
;oxo
;ooo
;ooo
rgb.l = Point(1, 0)
f.l = Int((Red(rgb) * 0.299) + (Green(rgb) * 0.587) + (Blue(rgb) * 0.114))
ImgList()\col[0]= f & $FF
;R2
;ooo
;xoo
;ooo
rgb.l = Point(0, 1)
f.l = Int((Red(rgb) * 0.299) + (Green(rgb) * 0.587) + (Blue(rgb) * 0.114))
ImgList()\col[1]= f & $FF
;R3
;ooo
;oox
;ooo
rgb.l = Point(2, 1)
f.l = Int((Red(rgb) * 0.299) + (Green(rgb) * 0.587) + (Blue(rgb) * 0.114))
ImgList()\col[2]= f & $FF
;R4
;ooo
;ooo
;oxo
rgb.l = Point(1, 2)
f.l = Int((Red(rgb) * 0.299) + (Green(rgb) * 0.587) + (Blue(rgb) * 0.114))
ImgList()\col[3] = f & $FF
;build the string(for sorting only)
ImgList()\sGrey =RSet(StrU(ImgList()\col[0]), 3, "0") + " | " + RSet(StrU(ImgList()\col[1]), 3, "0") + " | " + RSet(StrU(ImgList()\col[2]), 3, "0") + " | " + RSet(StrU(ImgList()\col[3]), 3, "0")
StopDrawing()
EndIf
FreeImage(iHdl)
EndIf
EndIf
Until FileType = 0
EndIf
EndProcedure
Directory.s="d:\Bilder\Bild"
OpenWindow(0, 0, 0, 800, 600, "Dir Scan..." + Directory, #PB_Window_SystemMenu | #PB_Window_ScreenCentered)
Flags=#PB_ListIcon_GridLines|#PB_ListIcon_FullRowSelect|#PB_ListIcon_AlwaysShowSelection
ListIconGadget(#List, 1 ,1, 800, 600, "File", 150, Flags)
AddGadgetColumn(#List, 1, "Size", 50)
AddGadgetColumn(#List, 2, "Width", 50)
AddGadgetColumn(#List, 3, "Height",50)
AddGadgetColumn(#List, 4, "Grey Bytes",180)
AddGadgetColumn(#List, 5, "Value",100)
FileScan(Directory,IMGList() )
SortStructuredList(IMGList(), #PB_Sort_Ascending, OffsetOf(strucImgList\sGrey), #PB_Sort_String) ;sort to see better results
ForEach ImgList()
AddGadgetItem(#List, -1, ImgList()\FileName + Chr(10) + Str(ImgList()\FileSize) + Chr(10) + Str(ImgList()\width) + Chr(10) + Str(ImgList()\height) + Chr(10) + ImgList()\sGrey + Chr(10) +Str(ImgList()\Value))
Next
Repeat
EventID = WaitWindowEvent()
If EventID = #PB_Event_Gadget
Select EventGadget()
EndSelect
EndIf
Until EventID = #PB_Event_CloseWindow
End
SPAMINATOR NR.1

