Finding Duplicate Images Made Easy

It is a typical problem: we downloaded images from a camera, maybe did not delete them from the camera instantly, then downloaded the same images again next time, maybe created an album by copying images into sub-folders (without Nepomuk Digikam can only do so much ;), and so on. Essentially there are a lot of duplicate photos lying around.

But never fear. Just let Nepomuk index all of them and then gather all the duplicates via:

select distinct ?u1 ?u2 where {
  ?f1 a nexif:Photo .
  ?f2 a nexif:Photo .
  ?f1 nfo:hasHash ?h .
  ?f2 nfo:hasHash ?h .
  ?f1 nie:url ?u1 .
  ?f2 nie:url ?u2 .
  filter(?f1!=?f2) .
}

Quick explanation: the query does select all nexif:Photo resources which have the same hash value but are not the same. This of course can be tweaked by adding something like

?f1 nfo:fileName ?fn .
?f2 nfo:fileName ?fn .

to make sure that we only catch the ones that we downloaded more than once. Or we add

?f1 nie:contentCreated ?cc .
?f2 nie:contentCreated ?cc .

to ensure that the photo was actually taken at the same time – although I suppose the probability that two different photos have the same hash value is rather small.

Maybe one last little detail. In theory it would be more correct to do the following:

?f1 nfo:hasHash ?h1 .
?f2 nfo:hasHash ?h2 .
?h1 nfo:hashValue ?h .
?h2 nfo:hashValue ?h .

However, with the introduction of the Data Management Service in KDE 4.7 similar hash resources are merged into one. Thus, the slightly simpler query above. Still, to be sure to also properly handle pre-KDE-4.7 data the above addition might be prudent.

Of course this should be hidden in some application which does the work for you. The point is that Nepomuk has a lot of power that only reveals itself at second glance. :)

Advertisement

9 thoughts on “Finding Duplicate Images Made Easy

  1. Duplication removal will be one of the main feature of my resource browser project but I was struggling to implement it. Now I can do it for ,not only images, all type of files. Thanks a lot :)

  2. Very useful entry! I will add command –findduplicateimages in Nepoogle.

  3. For the images managed in DigiKam there is a &Tools -> Find &Duplicates, it work on haar fingerprints, not plain hash, so you’re able to find similar images or even sketch something and find images that match your drawing.

    cheers

    • I think Sebastian wanted to show the power of Nepomuk and not trying to compete with a specialized tool such as Digikam :).

      And not all your images are in Digikam, in my case this trick was useful to locate duplicate album covers.

      By the way, Digikam is a terrific tool.

      • Indeed. :)

  4. I added the new command to Nepoogle and changes are available in git in case anyone is interested, http://github.com/serantes/nepoogle.

    I have adapted Sebastian query a little bit to suit Nepoogle and has been as follows:

    SELECT DISTINCT ?h AS ?id
    WHERE {

    }
    ORDER BY ?h

    because I want to group all duplicate files in one single entry.

    Luckily today is a holiday in Spain :).

    • Yet another little trick: avoid getting each duplicate twice by adding something like:

      FILTER(STR(?f1) > STR(?f2)) .

      • Coming home I was thinking about this query and I thought there was a simpler way to integrate it into Nepoogle using COUNT() and the result is the following two queries:

        SELECT DISTINCT ?hash AS ?id
        WHERE {
        ?r0 nfo:hasHash ?hash .
        ?r0 rdf:type nexif:Photo .
        }
        HAVING (COUNT(?r0) > 1)
        ORDER BY ?hash

        this is the equivalent to Sebastian one and if you remove rdf:type filter you can search for any kind of duplicates:

        SELECT DISTINCT ?hash AS ?id
        WHERE {
        ?r0 nfo:hasHash ?hash .
        }
        HAVING (COUNT(?r0) > 1)
        ORDER BY ?hash

        Obviously my queries were thinking to use with Nepoogle and don’t display file urls like Sebastian one.

        I add a new command called –findduplicates associated to the second query and renamed –findduplicateimages to –findduplicatephotos because is a better name.

        This was funny :).

  5. This…is a really solid use case for nepomuk! I surely can make use of it. It needs more advertisement of this kind of feature. Just this alone convinced me to start using it by next SC release!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s