It is a typical problem: we downloaded images from a camera, maybe did not delete them from the camera instantly, then downloaded the same images again next time, maybe created an album by copying images into sub-folders (without Nepomuk Digikam can only do so much ;), and so on. Essentially there are a lot of duplicate photos lying around.
But never fear. Just let Nepomuk index all of them and then gather all the duplicates via:
select distinct ?u1 ?u2 where {
?f1 a nexif:Photo .
?f2 a nexif:Photo .
?f1 nfo:hasHash ?h .
?f2 nfo:hasHash ?h .
?f1 nie:url ?u1 .
?f2 nie:url ?u2 .
filter(?f1!=?f2) .
}
Quick explanation: the query does select all nexif:Photo resources which have the same hash value but are not the same. This of course can be tweaked by adding something like
?f1 nfo:fileName ?fn . ?f2 nfo:fileName ?fn .
to make sure that we only catch the ones that we downloaded more than once. Or we add
?f1 nie:contentCreated ?cc . ?f2 nie:contentCreated ?cc .
to ensure that the photo was actually taken at the same time – although I suppose the probability that two different photos have the same hash value is rather small.
Maybe one last little detail. In theory it would be more correct to do the following:
?f1 nfo:hasHash ?h1 . ?f2 nfo:hasHash ?h2 . ?h1 nfo:hashValue ?h . ?h2 nfo:hashValue ?h .
However, with the introduction of the Data Management Service in KDE 4.7 similar hash resources are merged into one. Thus, the slightly simpler query above. Still, to be sure to also properly handle pre-KDE-4.7 data the above addition might be prudent.
Of course this should be hidden in some application which does the work for you. The point is that Nepomuk has a lot of power that only reveals itself at second glance. :)

