Finding Duplicate Images Made Easy


It is a typical problem: we downloaded images from a camera, maybe did not delete them from the camera instantly, then downloaded the same images again next time, maybe created an album by copying images into sub-folders (without Nepomuk Digikam can only do so much ;), and so on. Essentially there are a lot of duplicate photos lying around.

But never fear. Just let Nepomuk index all of them and then gather all the duplicates via:

select distinct ?u1 ?u2 where { 
  ?f1 a nexif:Photo . 
  ?f2 a nexif:Photo . 
  ?f1 nfo:hasHash ?h . 
  ?f2 nfo:hasHash ?h . 
  ?f1 nie:url ?u1 . 
  ?f2 nie:url ?u2 . 
  filter(?f1!=?f2) .
}

Quick explanation: the query does select all nexif:Photo resources which have the same hash value but are not the same. This of course can be tweaked by adding something like

?f1 nfo:fileName ?fn .
?f2 nfo:fileName ?fn .

to make sure that we only catch the ones that we downloaded more than once. Or we add

?f1 nie:contentCreated ?cc .
?f2 nie:contentCreated ?cc .

to ensure that the photo was actually taken at the same time – although I suppose the probability that two different photos have the same hash value is rather small.

Maybe one last little detail. In theory it would be more correct to do the following:

?f1 nfo:hasHash ?h1 .
?f2 nfo:hasHash ?h2 .
?h1 nfo:hashValue ?h .
?h2 nfo:hashValue ?h .

However, with the introduction of the Data Management Service in KDE 4.7 similar hash resources are merged into one. Thus, the slightly simpler query above. Still, to be sure to also properly handle pre-KDE-4.7 data the above addition might be prudent.

Of course this should be hidden in some application which does the work for you. The point is that Nepomuk has a lot of power that only reveals itself at second glance. :)

12 thoughts on “Finding Duplicate Images Made Easy

  1. For the images managed in DigiKam there is a &Tools -> Find &Duplicates, it work on haar fingerprints, not plain hash, so you’re able to find similar images or even sketch something and find images that match your drawing.

    cheers

  2. I added the new command to Nepoogle and changes are available in git in case anyone is interested, http://github.com/serantes/nepoogle.

    I have adapted Sebastian query a little bit to suit Nepoogle and has been as follows:

    SELECT DISTINCT ?h AS ?id
    WHERE {

    }
    ORDER BY ?h

    because I want to group all duplicate files in one single entry.

    Luckily today is a holiday in Spain :).

      • Coming home I was thinking about this query and I thought there was a simpler way to integrate it into Nepoogle using COUNT() and the result is the following two queries:

        SELECT DISTINCT ?hash AS ?id
        WHERE {
        ?r0 nfo:hasHash ?hash .
        ?r0 rdf:type nexif:Photo .
        }
        HAVING (COUNT(?r0) > 1)
        ORDER BY ?hash

        this is the equivalent to Sebastian one and if you remove rdf:type filter you can search for any kind of duplicates:

        SELECT DISTINCT ?hash AS ?id
        WHERE {
        ?r0 nfo:hasHash ?hash .
        }
        HAVING (COUNT(?r0) > 1)
        ORDER BY ?hash

        Obviously my queries were thinking to use with Nepoogle and don’t display file urls like Sebastian one.

        I add a new command called –findduplicates associated to the second query and renamed –findduplicateimages to –findduplicatephotos because is a better name.

        This was funny :).

  3. This…is a really solid use case for nepomuk! I surely can make use of it. It needs more advertisement of this kind of feature. Just this alone convinced me to start using it by next SC release!

  4. This was an old post. Did something change since then? I am trying to find duplicate files and it does not work, ie:

    I have copied an image and rated both to maximum:
    > kioclient list ‘nepomuksearch:/rating=10’
    843458834_n.jpg
    843458834_n 1.jpg

    Using sparql I can find all images (fast), including these two (I ignore how to search by file name):
    > kioclient list ‘nepomuksearch:/?sparql=select * where { ?f a nexif:Photo . }’ | grep 843458834
    843458834_n.jpg
    843458834_n 1.jpg

    But these return nothing:

    > kioclient list ‘nepomuksearch:/?sparql=select distinct ?u1 ?u2 where { ?f1 a nexif:Photo . ?f2 a nexif:Photo . ?f1 nfo:hasHash ?h . ?f2 nfo:hasHash ?h . ?f1 nie:url ?u1 . ?f2 nie:url ?u2 . filter(?f1!=?f2) . }’

    > kioclient list ‘nepomuksearch:/?sparql=select distinct ?u1 ?u2 where { ?f1 a nexif:Photo . ?f2 a nexif:Photo . ?f1 nfo:hasHash ?h . ?f2 nfo:hasHash ?h . ?f1 nie:url ?u1 . ?f2 nie:url ?u2 . }’

    > kioclient list ‘nepomuksearch:/?sparql=select DISTINCT ?hash AS ?id WHERE { ?r0 nfo:hasHash ?hash . } HAVING (COUNT(?r0) > 1) ORDER BY ?hash’

    Is maybe nfo:hasHash deprecated?

    • AFAIK nfo:hasHash is not calculated by default anymore, or even not at all due to the possible performance impact of creating the hash for large files. So sadly this tip does not work as is anymore.

      • Oh, thanks, I was going crazy. Do you know where it is documented (to see if I could turn it on?) and learn more about these queries?

        What would be a similar query for finding identical file names and file sizes?

Leave a comment