Page 1 of 2

Image duplicate detection

Posted: Sun Feb 15, 2009 5:10 am
by pdwyer
I have a need and an idea for writing a simple app to detect the "distance" between two images that are of different formats and sizes (rotation doesn't matter). It's not going to be perfect but I think it will suit my needs.

ie a score of "0" between two images shows they are identical (as far as they were checked) and the higher the number the bigger the likely difference between them... something like that

Before I start though I was just wondering if I was reinvinting a wheel here and there was a known good algorithm (possibly done in PB already) that I could use

I've seen a couple of apps that can do this but I'd like to role my own to get the other functionality that I want to use this for.

Posted: Sun Feb 15, 2009 7:55 am
by Rings
i know that pcfreak has done something
similar (searching for picture dublettes of any size and
Format) and that it worked well here:

1. Load and Resize any pictures to 2x2 pixel
2. Grey them
3. you got 4 Bytes, thats your 'Fingerprint' of the pic.
4. For rotating, swap the bytes
5. Move the fingerprint with fileinformation into a linked list
6. Sort the List
7.Finding dublettes or nearly them by checking the bits that differs
8. For more details, increase the picture-resizing
but first do on 2x2 pixel, the results are amazing

Posted: Sun Feb 15, 2009 8:22 am
by pdwyer
Interesting idea, I guess a little larger than 2x2 would increase acuracy. I wonder how CPU intensive the resize would be if we were talking about 30,000 files...

I'm thinking of taking r,g,b values for a grid of points on an image (say 100 on 10x10 evenly spaced. take an average of the rgb of a single point and the sig would be 100 digits of xxx-xxx-xxx-xxx-xxx etc

The distance between two images would be abs(sig1[0]-sig2[0]) + abs(sig1[1]-sig2[1]) + ... giving a distance of (if 100 points are used which may be too many) 0-25500.

I'm going to think more about the resize idea though, that sounds interesting

Posted: Sun Feb 15, 2009 11:00 am
by idle
A lot of it depends on the source of the image data, though for a generalized method, I'd look into FFT for resolution independence with a hough transform to classify each target image, though it'd still need fuzzy matching and may lead to false positives.

Posted: Sun Feb 15, 2009 3:25 pm
by pdwyer
Here's my first attempt. A score of under about 400 in image distance seems to be a likely candidate for being the same image but resaved, stretched or resized. (not handling rotation). Stretched seems to work okay since it's a proportional placing of the points. You can change the point count used at the top

Using 5x5 (25) check points. Change the pic paths at the bottom.

Still needs more testing with a wider range of images

Code: Select all

UseJPEGImageDecoder()
UsePNGImageDecoder()
UseJPEG2000ImageDecoder()

#SigPtCount = 25;16

Structure ImgSig
	ImgPath.s
	PtAvg.w[#SigPtCount]
EndStructure

;================================================================

Procedure.l GetImgDist(*Img1.ImgSig, *Img2.ImgSig)

    Distance.l = 0
    
    For i = 0 To #SigPtCount -1
        Distance = Distance + Abs(*Img1\PtAvg[i]-*Img2\PtAvg[i])    
    Next
    
    ProcedureReturn Distance

EndProcedure

;================================================================

Procedure GetImgSig(*Img.ImgSig); path populated, pt's to be added

    Debug *Img\ImgPath
    
    Dim Colours.l(#SigPtCount)
    Protected ImgWidth.l
    Protected ImgHeight.l
    Protected PointNo.l = 0
        
    ImgID.l = LoadImage( #PB_Any , *Img\ImgPath)
    ImgHeight = ImageHeight(ImgID) 
    ImgWidth = ImageWidth(ImgID) 

    Debug Str(ImgWidth) + " x " + Str(ImgHeight)
    
    StartDrawing(ImageOutput(ImgID)) 
    Ystep.l = Int(Round(ImgHeight/(Sqr(#SigPtCount) +1 ),#PB_Round_Nearest))
    XStep.l = Int(Round(ImgWidth/(Sqr(#SigPtCount) +1),#PB_Round_Nearest))

        For y = 1 To ImgHeight - 5 
            For x = 1 To ImgWidth - 5 
                If y % Ystep = 0 And x % XStep = 0
                    Colour.l = Point(x, y) 
                    ;Debug Str(PointNo) + " x= " + Str(x) + ", y= " + Str(y) + " Val: " + Str(Red(Colour) + Green(Colour) + Blue(Colour) / 3)
                    *Img\PtAvg[PointNo] = Red(Colour) + Green(Colour) + Blue(Colour) / 3
                    PointNo = PointNo + 1
                EndIf
            Next
        Next
            
    StopDrawing() 
    
EndProcedure

;================================================================


Define Img1.ImgSig, Img2.ImgSig  
IMG1\ImgPath = "F:\photos\12240004_1.jpg"
IMG2\ImgPath = "F:\photos\12240004.jpg"

GetImgSig(@Img1)
GetImgSig(@Img2)

Debug GetImgDist(@Img1,@Img2)



Posted: Sun Feb 15, 2009 5:17 pm
by Kaeru Gaman
just an idea for comparison:

- resize both images in question to a quite small size,
to let's say 1 pix makes 8x8 up to 32x32 pixels of the originals.

- draw one over the other using XOr

- count the amount of the bits left set

this will make up a value that directly counts the "difference" between both images.
sure, for this approach rotation matters.

Posted: Mon Feb 16, 2009 5:03 am
by pdwyer
Image resizing seems to be a consensus. I made a modification to mine to resize to the number of points being checked (eg 5x5) and there seems to be some more improvement.

@Yadoku, :wink: The xor is an interesting idea, and considering the new file size probably not too CPU intensive. I'd need to test it as I can't picture the different impacts on different conditions in my mind with XOR

One thing I thought of but haven't tried yet, is rather than taking an average which I think would have similar results to grey scale for this, would be to take differences between rgb scores. One pic I found was originally the same pic but had been lightened. RGB values didn't move higher in exact proportion to eachother but the ratios between them was less impacted than the average of those score which did go up over all.

I think I'll try swapping out avg with abs(R-G)+abs(R-B)+abs(G-B) and see how that goes.

Posted: Mon Feb 16, 2009 12:31 pm
by PB
> 1. Load and Resize any pictures to 2x2 pixel
> 2. Grey them

Done that.

> 3. you got 4 Bytes, thats your 'Fingerprint' of the pic.

This part I can't do. Any tips? I can't find pcfreak's code after searching. :(

Posted: Mon Feb 16, 2009 12:42 pm
by Rings
pcfreak never post that in public.

if you have the 4 bytes (1 byte per grey pixel),
copy them to a long(4 Byte )
and you will have your fingerprint/Checksum
or whatever you named that :)

Posted: Mon Feb 16, 2009 1:00 pm
by Kaeru Gaman
erm... hello?

PULLERALARM!

this must be a JOKE!

you can't consider four 8bit grey pixel being a pictures' fingerprint.

Posted: Mon Feb 16, 2009 1:10 pm
by pdwyer
See Ring's Point 8 above.

It's not designed to be a unique fingerprint remember as we are trying to catch similarities too, if we were just after "identical" we wouldn't be working at a image level, an MD5 would be fine

Posted: Mon Feb 16, 2009 6:48 pm
by Demivec
You would probably want to also have an option in your image check to check for matching aspect ratios.

Posted: Tue Feb 17, 2009 1:11 am
by pdwyer
I was finding that with proportional checking points this was okay, I tried stretching an image one direction so as to screw the proportions up and the distance didn't change much.

Posted: Tue Feb 17, 2009 10:23 am
by Rings
well, i recode some stuff (with permission from pcfreak)
if you run the code, watch the greybytes Collumn.
if they close to each , the images are also closer.
pictures that are resized to say 100%,70%,50% or 30% have mostly identically greybytes.

ok, code:

Code: Select all

#IdWidth  = 2
#IdHeight = 2
 
 UseJPEGImageDecoder()
 UsePNGImageDecoder()
 #List=1
 

Structure strucImgList
 Filename.s
 FileSize.i
 Width.l
 Height.l
 StructureUnion
  col.b[4]
  value.l
 EndStructureUnion
 sGrey.s
EndStructure 
NewList ImgList.strucIMGList()

Procedure FileScan(FilePath.s, List ImgList.strucImgList()) 
  
  If ExamineDirectory(1024, FilePath.s, "*.*") 
    Repeat 
      FileType     = NextDirectoryEntry(1024) 
      FileName.s   = DirectoryEntryName(1024) 
      
      If FileType = 1 
         iHdl.l = LoadImage(#PB_Any,FilePath.s + "\"+ Filename)
         If iHdl <> 0
        
          AddElement(ImgList())
          ImgList()\Filename=Filename
          ImgList()\Filesize=FileSize(FilePath.s + "\" + DirectoryEntryName(1024)) 
          ImgList()\Width=ImageWidth(iHdl)
          ImgList()\Height=ImageHeight(iHdl)
          
          ResizeImage(iHdl, #IdWidth + 1, #IdHeight + 1, #PB_Image_Smooth)
          hDC.l = StartDrawing(ImageOutput(iHdl))
          If hDC
           ;R1
           ;oxo
           ;ooo
           ;ooo
            rgb.l = Point(1, 0)
            f.l = Int((Red(rgb) * 0.299) + (Green(rgb) * 0.587) + (Blue(rgb) * 0.114))
            ImgList()\col[0]=  f & $FF

          ;R2
          ;ooo
          ;xoo
          ;ooo
           rgb.l = Point(0, 1)
           f.l = Int((Red(rgb) * 0.299) + (Green(rgb) * 0.587) + (Blue(rgb) * 0.114))
           ImgList()\col[1]= f & $FF
  
          ;R3
          ;ooo
          ;oox
          ;ooo
           rgb.l = Point(2, 1)
           f.l = Int((Red(rgb) * 0.299) + (Green(rgb) * 0.587) + (Blue(rgb) * 0.114))
           ImgList()\col[2]= f & $FF
  
          ;R4
          ;ooo
          ;ooo
          ;oxo
           rgb.l = Point(1, 2)
           f.l = Int((Red(rgb) * 0.299) + (Green(rgb) * 0.587) + (Blue(rgb) * 0.114))
           ImgList()\col[3] = f & $FF
 
          ;build the string(for sorting only)
          ImgList()\sGrey =RSet(StrU(ImgList()\col[0]), 3, "0") + " | " + RSet(StrU(ImgList()\col[1]), 3, "0")  + " | " +  RSet(StrU(ImgList()\col[2]), 3, "0") + " | " +  RSet(StrU(ImgList()\col[3]), 3, "0")     
         
         
         StopDrawing()
        EndIf
        FreeImage(iHdl)
       EndIf
      EndIf 
    Until FileType = 0 
  EndIf 
EndProcedure 

Directory.s="d:\Bilder\Bild"

OpenWindow(0, 0, 0, 800, 600, "Dir Scan..." + Directory, #PB_Window_SystemMenu | #PB_Window_ScreenCentered) 
 Flags=#PB_ListIcon_GridLines|#PB_ListIcon_FullRowSelect|#PB_ListIcon_AlwaysShowSelection
 ListIconGadget(#List, 1 ,1, 800, 600, "File", 150, Flags) 
  AddGadgetColumn(#List, 1, "Size", 50) 
  AddGadgetColumn(#List, 2, "Width", 50) 
  AddGadgetColumn(#List, 3, "Height",50) 
  AddGadgetColumn(#List, 4, "Grey Bytes",180) 
  AddGadgetColumn(#List, 5, "Value",100) 

 
 
  FileScan(Directory,IMGList() )
 
  SortStructuredList(IMGList(), #PB_Sort_Ascending, OffsetOf(strucImgList\sGrey), #PB_Sort_String) ;sort to see better results
  
  ForEach ImgList()
   AddGadgetItem(#List, -1, ImgList()\FileName + Chr(10) + Str(ImgList()\FileSize) + Chr(10) + Str(ImgList()\width) + Chr(10) + Str(ImgList()\height) + Chr(10) + ImgList()\sGrey + Chr(10) +Str(ImgList()\Value)) 
  Next

  Repeat 
   EventID = WaitWindowEvent() 
    If EventID = #PB_Event_Gadget 
      Select EventGadget() 
    
      EndSelect 
    EndIf 
  Until EventID = #PB_Event_CloseWindow 

End 

Posted: Thu Feb 19, 2009 9:29 am
by Rings
pdwyer : any results ?