Image duplicate detection

Everything else that doesn't fall into one of the other PB categories.
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Image duplicate detection

Post by pdwyer »

I have a need and an idea for writing a simple app to detect the "distance" between two images that are of different formats and sizes (rotation doesn't matter). It's not going to be perfect but I think it will suit my needs.

ie a score of "0" between two images shows they are identical (as far as they were checked) and the higher the number the bigger the likely difference between them... something like that

Before I start though I was just wondering if I was reinvinting a wheel here and there was a known good algorithm (possibly done in PB already) that I could use

I've seen a couple of apps that can do this but I'd like to role my own to get the other functionality that I want to use this for.
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
User avatar
Rings
Moderator
Moderator
Posts: 1435
Joined: Sat Apr 26, 2003 1:11 am

Post by Rings »

i know that pcfreak has done something
similar (searching for picture dublettes of any size and
Format) and that it worked well here:

1. Load and Resize any pictures to 2x2 pixel
2. Grey them
3. you got 4 Bytes, thats your 'Fingerprint' of the pic.
4. For rotating, swap the bytes
5. Move the fingerprint with fileinformation into a linked list
6. Sort the List
7.Finding dublettes or nearly them by checking the bits that differs
8. For more details, increase the picture-resizing
but first do on 2x2 pixel, the results are amazing
SPAMINATOR NR.1
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

Interesting idea, I guess a little larger than 2x2 would increase acuracy. I wonder how CPU intensive the resize would be if we were talking about 30,000 files...

I'm thinking of taking r,g,b values for a grid of points on an image (say 100 on 10x10 evenly spaced. take an average of the rgb of a single point and the sig would be 100 digits of xxx-xxx-xxx-xxx-xxx etc

The distance between two images would be abs(sig1[0]-sig2[0]) + abs(sig1[1]-sig2[1]) + ... giving a distance of (if 100 points are used which may be too many) 0-25500.

I'm going to think more about the resize idea though, that sounds interesting
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
User avatar
idle
Always Here
Always Here
Posts: 6240
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Post by idle »

A lot of it depends on the source of the image data, though for a generalized method, I'd look into FFT for resolution independence with a hough transform to classify each target image, though it'd still need fuzzy matching and may lead to false positives.
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

Here's my first attempt. A score of under about 400 in image distance seems to be a likely candidate for being the same image but resaved, stretched or resized. (not handling rotation). Stretched seems to work okay since it's a proportional placing of the points. You can change the point count used at the top

Using 5x5 (25) check points. Change the pic paths at the bottom.

Still needs more testing with a wider range of images

Code: Select all

UseJPEGImageDecoder()
UsePNGImageDecoder()
UseJPEG2000ImageDecoder()

#SigPtCount = 25;16

Structure ImgSig
	ImgPath.s
	PtAvg.w[#SigPtCount]
EndStructure

;================================================================

Procedure.l GetImgDist(*Img1.ImgSig, *Img2.ImgSig)

    Distance.l = 0
    
    For i = 0 To #SigPtCount -1
        Distance = Distance + Abs(*Img1\PtAvg[i]-*Img2\PtAvg[i])    
    Next
    
    ProcedureReturn Distance

EndProcedure

;================================================================

Procedure GetImgSig(*Img.ImgSig); path populated, pt's to be added

    Debug *Img\ImgPath
    
    Dim Colours.l(#SigPtCount)
    Protected ImgWidth.l
    Protected ImgHeight.l
    Protected PointNo.l = 0
        
    ImgID.l = LoadImage( #PB_Any , *Img\ImgPath)
    ImgHeight = ImageHeight(ImgID) 
    ImgWidth = ImageWidth(ImgID) 

    Debug Str(ImgWidth) + " x " + Str(ImgHeight)
    
    StartDrawing(ImageOutput(ImgID)) 
    Ystep.l = Int(Round(ImgHeight/(Sqr(#SigPtCount) +1 ),#PB_Round_Nearest))
    XStep.l = Int(Round(ImgWidth/(Sqr(#SigPtCount) +1),#PB_Round_Nearest))

        For y = 1 To ImgHeight - 5 
            For x = 1 To ImgWidth - 5 
                If y % Ystep = 0 And x % XStep = 0
                    Colour.l = Point(x, y) 
                    ;Debug Str(PointNo) + " x= " + Str(x) + ", y= " + Str(y) + " Val: " + Str(Red(Colour) + Green(Colour) + Blue(Colour) / 3)
                    *Img\PtAvg[PointNo] = Red(Colour) + Green(Colour) + Blue(Colour) / 3
                    PointNo = PointNo + 1
                EndIf
            Next
        Next
            
    StopDrawing() 
    
EndProcedure

;================================================================


Define Img1.ImgSig, Img2.ImgSig  
IMG1\ImgPath = "F:\photos\12240004_1.jpg"
IMG2\ImgPath = "F:\photos\12240004.jpg"

GetImgSig(@Img1)
GetImgSig(@Img2)

Debug GetImgDist(@Img1,@Img2)


Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
User avatar
Kaeru Gaman
Addict
Addict
Posts: 4826
Joined: Sun Mar 19, 2006 1:57 pm
Location: Germany

Post by Kaeru Gaman »

just an idea for comparison:

- resize both images in question to a quite small size,
to let's say 1 pix makes 8x8 up to 32x32 pixels of the originals.

- draw one over the other using XOr

- count the amount of the bits left set

this will make up a value that directly counts the "difference" between both images.
sure, for this approach rotation matters.
oh... and have a nice day.
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

Image resizing seems to be a consensus. I made a modification to mine to resize to the number of points being checked (eg 5x5) and there seems to be some more improvement.

@Yadoku, :wink: The xor is an interesting idea, and considering the new file size probably not too CPU intensive. I'd need to test it as I can't picture the different impacts on different conditions in my mind with XOR

One thing I thought of but haven't tried yet, is rather than taking an average which I think would have similar results to grey scale for this, would be to take differences between rgb scores. One pic I found was originally the same pic but had been lightened. RGB values didn't move higher in exact proportion to eachother but the ratios between them was less impacted than the average of those score which did go up over all.

I think I'll try swapping out avg with abs(R-G)+abs(R-B)+abs(G-B) and see how that goes.
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
PB
PureBasic Expert
PureBasic Expert
Posts: 7581
Joined: Fri Apr 25, 2003 5:24 pm

Post by PB »

> 1. Load and Resize any pictures to 2x2 pixel
> 2. Grey them

Done that.

> 3. you got 4 Bytes, thats your 'Fingerprint' of the pic.

This part I can't do. Any tips? I can't find pcfreak's code after searching. :(
I compile using 5.31 (x86) on Win 7 Ultimate (64-bit).
"PureBasic won't be object oriented, period" - Fred.
User avatar
Rings
Moderator
Moderator
Posts: 1435
Joined: Sat Apr 26, 2003 1:11 am

Post by Rings »

pcfreak never post that in public.

if you have the 4 bytes (1 byte per grey pixel),
copy them to a long(4 Byte )
and you will have your fingerprint/Checksum
or whatever you named that :)
SPAMINATOR NR.1
User avatar
Kaeru Gaman
Addict
Addict
Posts: 4826
Joined: Sun Mar 19, 2006 1:57 pm
Location: Germany

Post by Kaeru Gaman »

erm... hello?

PULLERALARM!

this must be a JOKE!

you can't consider four 8bit grey pixel being a pictures' fingerprint.
oh... and have a nice day.
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

See Ring's Point 8 above.

It's not designed to be a unique fingerprint remember as we are trying to catch similarities too, if we were just after "identical" we wouldn't be working at a image level, an MD5 would be fine
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
User avatar
Demivec
Addict
Addict
Posts: 4283
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Post by Demivec »

You would probably want to also have an option in your image check to check for matching aspect ratios.
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

I was finding that with proportional checking points this was okay, I tried stretching an image one direction so as to screw the proportions up and the distance didn't change much.
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
User avatar
Rings
Moderator
Moderator
Posts: 1435
Joined: Sat Apr 26, 2003 1:11 am

Post by Rings »

well, i recode some stuff (with permission from pcfreak)
if you run the code, watch the greybytes Collumn.
if they close to each , the images are also closer.
pictures that are resized to say 100%,70%,50% or 30% have mostly identically greybytes.

ok, code:

Code: Select all

#IdWidth  = 2
#IdHeight = 2
 
 UseJPEGImageDecoder()
 UsePNGImageDecoder()
 #List=1
 

Structure strucImgList
 Filename.s
 FileSize.i
 Width.l
 Height.l
 StructureUnion
  col.b[4]
  value.l
 EndStructureUnion
 sGrey.s
EndStructure 
NewList ImgList.strucIMGList()

Procedure FileScan(FilePath.s, List ImgList.strucImgList()) 
  
  If ExamineDirectory(1024, FilePath.s, "*.*") 
    Repeat 
      FileType     = NextDirectoryEntry(1024) 
      FileName.s   = DirectoryEntryName(1024) 
      
      If FileType = 1 
         iHdl.l = LoadImage(#PB_Any,FilePath.s + "\"+ Filename)
         If iHdl <> 0
        
          AddElement(ImgList())
          ImgList()\Filename=Filename
          ImgList()\Filesize=FileSize(FilePath.s + "\" + DirectoryEntryName(1024)) 
          ImgList()\Width=ImageWidth(iHdl)
          ImgList()\Height=ImageHeight(iHdl)
          
          ResizeImage(iHdl, #IdWidth + 1, #IdHeight + 1, #PB_Image_Smooth)
          hDC.l = StartDrawing(ImageOutput(iHdl))
          If hDC
           ;R1
           ;oxo
           ;ooo
           ;ooo
            rgb.l = Point(1, 0)
            f.l = Int((Red(rgb) * 0.299) + (Green(rgb) * 0.587) + (Blue(rgb) * 0.114))
            ImgList()\col[0]=  f & $FF

          ;R2
          ;ooo
          ;xoo
          ;ooo
           rgb.l = Point(0, 1)
           f.l = Int((Red(rgb) * 0.299) + (Green(rgb) * 0.587) + (Blue(rgb) * 0.114))
           ImgList()\col[1]= f & $FF
  
          ;R3
          ;ooo
          ;oox
          ;ooo
           rgb.l = Point(2, 1)
           f.l = Int((Red(rgb) * 0.299) + (Green(rgb) * 0.587) + (Blue(rgb) * 0.114))
           ImgList()\col[2]= f & $FF
  
          ;R4
          ;ooo
          ;ooo
          ;oxo
           rgb.l = Point(1, 2)
           f.l = Int((Red(rgb) * 0.299) + (Green(rgb) * 0.587) + (Blue(rgb) * 0.114))
           ImgList()\col[3] = f & $FF
 
          ;build the string(for sorting only)
          ImgList()\sGrey =RSet(StrU(ImgList()\col[0]), 3, "0") + " | " + RSet(StrU(ImgList()\col[1]), 3, "0")  + " | " +  RSet(StrU(ImgList()\col[2]), 3, "0") + " | " +  RSet(StrU(ImgList()\col[3]), 3, "0")     
         
         
         StopDrawing()
        EndIf
        FreeImage(iHdl)
       EndIf
      EndIf 
    Until FileType = 0 
  EndIf 
EndProcedure 

Directory.s="d:\Bilder\Bild"

OpenWindow(0, 0, 0, 800, 600, "Dir Scan..." + Directory, #PB_Window_SystemMenu | #PB_Window_ScreenCentered) 
 Flags=#PB_ListIcon_GridLines|#PB_ListIcon_FullRowSelect|#PB_ListIcon_AlwaysShowSelection
 ListIconGadget(#List, 1 ,1, 800, 600, "File", 150, Flags) 
  AddGadgetColumn(#List, 1, "Size", 50) 
  AddGadgetColumn(#List, 2, "Width", 50) 
  AddGadgetColumn(#List, 3, "Height",50) 
  AddGadgetColumn(#List, 4, "Grey Bytes",180) 
  AddGadgetColumn(#List, 5, "Value",100) 

 
 
  FileScan(Directory,IMGList() )
 
  SortStructuredList(IMGList(), #PB_Sort_Ascending, OffsetOf(strucImgList\sGrey), #PB_Sort_String) ;sort to see better results
  
  ForEach ImgList()
   AddGadgetItem(#List, -1, ImgList()\FileName + Chr(10) + Str(ImgList()\FileSize) + Chr(10) + Str(ImgList()\width) + Chr(10) + Str(ImgList()\height) + Chr(10) + ImgList()\sGrey + Chr(10) +Str(ImgList()\Value)) 
  Next

  Repeat 
   EventID = WaitWindowEvent() 
    If EventID = #PB_Event_Gadget 
      Select EventGadget() 
    
      EndSelect 
    EndIf 
  Until EventID = #PB_Event_CloseWindow 

End 
SPAMINATOR NR.1
User avatar
Rings
Moderator
Moderator
Posts: 1435
Joined: Sat Apr 26, 2003 1:11 am

Post by Rings »

pdwyer : any results ?
SPAMINATOR NR.1
Post Reply