Finding Duplicate Images Made Easy

December 6, 2011 / Sebastian Trüg

It is a typical problem: we downloaded images from a camera, maybe did not delete them from the camera instantly, then downloaded the same images again next time, maybe created an album by copying images into sub-folders (without Nepomuk Digikam can only do so much ;), and so on. Essentially there are a lot of duplicate photos lying around.

But never fear. Just let Nepomuk index all of them and then gather all the duplicates via:

select distinct ?u1 ?u2 where { 
  ?f1 a nexif:Photo . 
  ?f2 a nexif:Photo . 
  ?f1 nfo:hasHash ?h . 
  ?f2 nfo:hasHash ?h . 
  ?f1 nie:url ?u1 . 
  ?f2 nie:url ?u2 . 
  filter(?f1!=?f2) .
}

Quick explanation: the query does select all nexif:Photo resources which have the same hash value but are not the same. This of course can be tweaked by adding something like

?f1 nfo:fileName ?fn .
?f2 nfo:fileName ?fn .

to make sure that we only catch the ones that we downloaded more than once. Or we add

?f1 nie:contentCreated ?cc .
?f2 nie:contentCreated ?cc .

to ensure that the photo was actually taken at the same time – although I suppose the probability that two different photos have the same hash value is rather small.

Maybe one last little detail. In theory it would be more correct to do the following:

?f1 nfo:hasHash ?h1 .
?f2 nfo:hasHash ?h2 .
?h1 nfo:hashValue ?h .
?h2 nfo:hashValue ?h .

However, with the introduction of the Data Management Service in KDE 4.7 similar hash resources are merged into one. Thus, the slightly simpler query above. Still, to be sure to also properly handle pre-KDE-4.7 data the above addition might be prudent.

Of course this should be hidden in some application which does the work for you. The point is that Nepomuk has a lot of power that only reveals itself at second glance. :)

12 thoughts on “Finding Duplicate Images Made Easy”

Phaneendra Hegde (@PhaneendraNH)

December 6, 2011 at 12:33

Duplication removal will be one of the main feature of my resource browser project but I was struggling to implement it. Now I can do it for ,not only images, all type of files. Thanks a lot :)

Reply
Ignacio Serantes

December 6, 2011 at 13:25

Very useful entry! I will add command –findduplicateimages in Nepoogle.

Reply
Francesco Riosa

December 6, 2011 at 13:52

For the images managed in DigiKam there is a &Tools -> Find &Duplicates, it work on haar fingerprints, not plain hash, so you’re able to find similar images or even sketch something and find images that match your drawing.

cheers

Reply
- Ignacio Serantes
  
  December 6, 2011 at 15:35
  
  I think Sebastian wanted to show the power of Nepomuk and not trying to compete with a specialized tool such as Digikam :).
  
  And not all your images are in Digikam, in my case this trick was useful to locate duplicate album covers.
  
  By the way, Digikam is a terrific tool.
  
  Reply
  - Sebastian Trüg
    
    December 6, 2011 at 15:44
    
    Indeed. :)
    
    Reply
Ignacio Serantes

December 6, 2011 at 15:28

I added the new command to Nepoogle and changes are available in git in case anyone is interested, http://github.com/serantes/nepoogle.

I have adapted Sebastian query a little bit to suit Nepoogle and has been as follows:

SELECT DISTINCT ?h AS ?id
WHERE {
…
}
ORDER BY ?h

because I want to group all duplicate files in one single entry.

Luckily today is a holiday in Spain :).

Reply
- Sebastian Trüg
  
  December 6, 2011 at 15:43
  
  Yet another little trick: avoid getting each duplicate twice by adding something like:
  
  FILTER(STR(?f1) > STR(?f2)) .
  
  Reply
  - Ignacio Serantes
    
    December 7, 2011 at 02:10
    
    Coming home I was thinking about this query and I thought there was a simpler way to integrate it into Nepoogle using COUNT() and the result is the following two queries:
    
    SELECT DISTINCT ?hash AS ?id
    WHERE {
    ?r0 nfo:hasHash ?hash .
    ?r0 rdf:type nexif:Photo .
    }
    HAVING (COUNT(?r0) > 1)
    ORDER BY ?hash
    
    this is the equivalent to Sebastian one and if you remove rdf:type filter you can search for any kind of duplicates:
    
    SELECT DISTINCT ?hash AS ?id
    WHERE {
    ?r0 nfo:hasHash ?hash .
    }
    HAVING (COUNT(?r0) > 1)
    ORDER BY ?hash
    
    Obviously my queries were thinking to use with Nepoogle and don’t display file urls like Sebastian one.
    
    I add a new command called –findduplicates associated to the second query and renamed –findduplicateimages to –findduplicatephotos because is a better name.
    
    This was funny :).
    
    Reply
Anon

December 7, 2011 at 08:57

This…is a really solid use case for nepomuk! I surely can make use of it. It needs more advertisement of this kind of feature. Just this alone convinced me to start using it by next SC release!

Reply
XGS

November 27, 2013 at 16:10

This was an old post. Did something change since then? I am trying to find duplicate files and it does not work, ie:

I have copied an image and rated both to maximum:
> kioclient list ‘nepomuksearch:/rating=10’
843458834_n.jpg
843458834_n 1.jpg

Using sparql I can find all images (fast), including these two (I ignore how to search by file name):
> kioclient list ‘nepomuksearch:/?sparql=select * where { ?f a nexif:Photo . }’ | grep 843458834
843458834_n.jpg
843458834_n 1.jpg

But these return nothing:

> kioclient list ‘nepomuksearch:/?sparql=select distinct ?u1 ?u2 where { ?f1 a nexif:Photo . ?f2 a nexif:Photo . ?f1 nfo:hasHash ?h . ?f2 nfo:hasHash ?h . ?f1 nie:url ?u1 . ?f2 nie:url ?u2 . filter(?f1!=?f2) . }’

> kioclient list ‘nepomuksearch:/?sparql=select distinct ?u1 ?u2 where { ?f1 a nexif:Photo . ?f2 a nexif:Photo . ?f1 nfo:hasHash ?h . ?f2 nfo:hasHash ?h . ?f1 nie:url ?u1 . ?f2 nie:url ?u2 . }’

> kioclient list ‘nepomuksearch:/?sparql=select DISTINCT ?hash AS ?id WHERE { ?r0 nfo:hasHash ?hash . } HAVING (COUNT(?r0) > 1) ORDER BY ?hash’

Is maybe nfo:hasHash deprecated?

Reply
- Sebastian Trüg
  
  November 27, 2013 at 16:40
  
  AFAIK nfo:hasHash is not calculated by default anymore, or even not at all due to the possible performance impact of creating the hash for large files. So sadly this tip does not work as is anymore.
  
  Reply
  - XGS
    
    November 27, 2013 at 19:26
    
    Oh, thanks, I was going crazy. Do you know where it is documented (to see if I could turn it on?) and learn more about these queries?
    
    What would be a similar query for finding identical file names and file sizes?
    
    Reply

Trueg's Blog

Semantic Webbiness, some authentication, and a whole lot of ACLs

Finding Duplicate Images Made Easy

12 thoughts on “Finding Duplicate Images Made Easy”

Leave a comment Cancel reply

Share this:

Related

12 thoughts on “Finding Duplicate Images Made Easy”

Leave a comment Cancel reply