Finding Duplicate Images Made Easy

It is a typical problem: we downloaded images from a camera, maybe did not delete them from the camera instantly, then downloaded the same images again next time, maybe created an album by copying images into sub-folders (without Nepomuk Digikam can only do so much ;), and so on. Essentially there are a lot of duplicate photos lying around.

But never fear. Just let Nepomuk index all of them and then gather all the duplicates via:

select distinct ?u1 ?u2 where { 
  ?f1 a nexif:Photo . 
  ?f2 a nexif:Photo . 
  ?f1 nfo:hasHash ?h . 
  ?f2 nfo:hasHash ?h . 
  ?f1 nie:url ?u1 . 
  ?f2 nie:url ?u2 . 
  filter(?f1!=?f2) .
}

Quick explanation: the query does select all nexif:Photo resources which have the same hash value but are not the same. This of course can be tweaked by adding something like

?f1 nfo:fileName ?fn .
?f2 nfo:fileName ?fn .

to make sure that we only catch the ones that we downloaded more than once. Or we add

?f1 nie:contentCreated ?cc .
?f2 nie:contentCreated ?cc .

to ensure that the photo was actually taken at the same time – although I suppose the probability that two different photos have the same hash value is rather small.

Maybe one last little detail. In theory it would be more correct to do the following:

?f1 nfo:hasHash ?h1 .
?f2 nfo:hasHash ?h2 .
?h1 nfo:hashValue ?h .
?h2 nfo:hashValue ?h .

However, with the introduction of the Data Management Service in KDE 4.7 similar hash resources are merged into one. Thus, the slightly simpler query above. Still, to be sure to also properly handle pre-KDE-4.7 data the above addition might be prudent.

Of course this should be hidden in some application which does the work for you. The point is that Nepomuk has a lot of power that only reveals itself at second glance. :)

Manually Forcing the (Re-)Indexing of Folders is Easy

Ever since the unicode bug in Virtuoso 6.1.3 many of us have broken unicode strings in our Nepomuk databases. Completely re-creating the database is IMHO not an option since that would mean loosing all manual annotations and things like download source URLs. One solution would be restoring a backup but I simply do not trust the Nepomuk backup until I had a deeper look into it. The perfect solution would be if Nepomuk could simply fix the data automatically. While that is of course my goal and I am looking into that it will take a while.

In the meantime I threw together a small desktop file which adds two new actions to the context menu of folders.

  1. (Re-)index Folder contents will make the indexer update all the files in the folder indifferent of their state in Nepomuk. This includes fixed unicode strings.
  2. (Re-)index Folder contents recursive does the same as the above except that it also recurses into sub folders.

Simply put the following into a file called “nepomuk-index-folder.desktop” and save it in “~/.kde/share/kde4/services/ServiceMenus”. At the next start of Dolphin or Konqueror the two new actions will be available.

[Desktop Entry]
Type=Service
X-KDE-ServiceTypes=KonqPopupMenu/Plugin,inode/directory
Actions=indexFolder;indexFolderRecursive;
X-KDE-Submenu=Desktop Search
Icon=nepomuk

[Desktop Action indexFolder]
Name=(Re-)index Folder contents
Icon=nepomuk
Exec=qdbus org.kde.nepomuk.services.nepomukfileindexer /nepomukfileindexer org.kde.nepomuk.FileIndexer.indexFolder %f 0 1

[Desktop Action indexFolderRecursive]
Name=(Re-)index Folder contents recursive
Icon=nepomuk
Exec=qdbus org.kde.nepomuk.services.nepomukfileindexer /nepomukfileindexer org.kde.nepomuk.FileIndexer.indexFolder %f 1 1

Update: The code above does only work for KDE 4.8 since we renamed the “strigi service” to “file indexing service”. So in order to make this work in KDE 4.7 and before replace “nepomukfileindexer” with “nepomukstrigiservice” and “FileIndexer” with “Strigi”.

Nepomuk Fundraiser – Badamm (Or Some Other Really Clever and Funny Title I Cannot Think of at the Moment)

It happened. Alf Rustad donated the missing 356€ which broken the magical barrier of 9000€ in the Nepomuk Fundraiser I started nearly three months ago.

While the actual goal – securing long-term funding for Nepomuk – has not been reached yet this is a great opportunity to thank Alexander, Alvar, Andreas, Andre, Andrew, Angelo, Anton, Antonio-J, Ardy123, arkub, Baltasar, Bernd, Bernhard, Calogero, Carl, Ceferino, Christopher, Christoph, Claude, Cristiano, Daniel, David, dunkelschorsch, Eduard, Efthymia, Elias, the two Enriques, Fabio, Felix, Florian, Francisco, Friedhelm, Fux, Gael, Giacomo, Giorgio, Guillaume, Günter, Hans, Han, Hartmut, Hector, Hendy, Huftis, Jaroslav, Jérôme, Jesus, Josep, Jos, Jramskov, Juan, Juanjo, Junichi, Kai, Kenneth, Kevin, Kilian, Kulomi, Leopoldo, Linopolus, Luca, Luis, Luiz, Maik, Manoel, Manuel, Marco, Marc, the three Markusses and Martins, Maxime, Mguel, the two Michaels, Mikael, Mike, Morgan, Nicolas, Olaf, Olivier, Orestes, the two Pauls, Paulo, the two Peters, Philipp, Pierre-Hugues, Régis, Robert and Robert, Rodrigo, samtuke, the Sebastians, Simone, Sören, Stefano, Steffen, Stian, tanghus, Thiago, Thomas, Thomas, and Thomas, Tiago, Timothy, Tommi, Tuukka, Ulrich, Wakeley, Xavier, Yaroslav, and all the anonymous doners for their support. You have given me time to keep looking.

A special thanks goes to Carl Symons for his great dot article, his many tips and continuous encouragement.

Thank you also to Peter, George, Ivan, Vishesh, Christian, Andrew, Martin, and Laura for their great developer comments on Nepomuk.

And last but not least thanks for all the positive feedback on my blog articles, the translations into strange and exotic languages such as spanish :P and all the encouraging words which showed how many actually get what the semantic desktop is all about and want Nepomuk to go on and change the way we work with information today.

The Different Places Something Can Go Wrong

This is just a little blog entry about the impact that the ontologies can have on functionality.

The ontologies are a set of vocabularies describing the types of resources stored in Nepomuk, the possible relations between these types, and the possible annotations. We have for example a type for local files, one for an address book entry, one for a person, one for music content and so on. We also have relations that describe that some person is the author or some piece of content and so on.

These ontologies are maintained in the Shared-Desktop-Ontologies project – to my knowledge the only real open-source project developing RDF ontologies.

Now to the actual topic. There once was a bug. Like so many other bugs it talked about file indexing in Nepomuk and like so many other bugs it said that some file could not be indexed. First it was Nepomuk’s fault, then it was the fault of libstreamanalyzer, but in the end I realized: there was a bug in the ontologies. More specificly in NMM – the Nepomuk MultiMedia ontology. (Granted this was not really the source of the hang the bug talks about but it was the reason the file could not be indexed.)

The problem was the domain of the nmm:setSize property. Each property has a domain and a range – the domain defines on which type of resource the property can be set, the range defines the type of the value. In other words they are defining the subject and object type of the triple. The domain is always a resource type (rdfs:Class), the range a resource or a literal type (typically one defined in the XML schema). In this case the domain of nmm:setSIze was set as nmm:MusicPiece whereas it should have been nmm:MusicAlbum. Thus, Nepomuk rejected the data generated by libstreamanalyzer as being invalid due to using an invalid domain. (Update: Nepomuk treats RDF data in a closed-world fashion. In comparison to the open-world approach which is typical for RDF/S resource types are not inferred from their relations. In an open-world situation the resource would simply end up being both a nmm:MusicPiece and a nmm:MusicAlbum.)

The solution is shared-desktop-ontologies 0.8.1 with the fixed domain. Installing it will make Nepomuk re-parse the changed ontology and indexing the mp3 files in question will finally work.

Well, this was pretty verbose for a rather small issue. Still it gave a little introduction into how the ontologies are used in Nepomuk. One more thing to take care of in the “Nepomuk universe”.

And as always:

Click here to lend your support to: Nepomuk - The semantic desktop on KDE and make a donation at www.pledgie.com !

Update on Bugs And Stuff

There was not that much activity last week. That is simply because I took a few days off to spend time with my family on a farm – mostly so that my daughter could ride a pony every day.

Still, there are a few words I can say regarding Nepomuk bugs: 4 crashes have been reported for KDE 4.7.3. One of them I already fixed, one has a patch which is awaiting testing, one is very confusing and made me contact the mighty Dario for help, and the last one is the akonadi feeder which still has a memory leak. Apart from that blogging about my inotify usage had a very nice side-effect: Marcel Wiesweg from Digikam wants to use KInotfy and stumbled upon a serious bug which for some stupid reason I never caught. That bug means that files with umlauts and friends in their paths never keep their meta-data when moved. Urgh!

So KDE 4.7.4 will again contain a bunch of fixes. The hunt is not over yet. At least now all bugs are nicely categorized which makes triaging much easier.

And in other news: thank you so much for the amazing fundraiser. Only about 800€ to the goal which I feared was reaching for the stars. This is given me the energy to go on and try to find the required funding. No one stepped up so far but I am still hopeful as some are still “in discussion”. Keep your fingers crossed (and send ideas my way).

Click here to lend your support to: Nepomuk - The semantic desktop on KDE and make a donation at www.pledgie.com !

KDE 4.7.3 – The (First) Nepomuk Stability Release

Now that KDE 4.7.3 has been released let me look back onto the work that I put into it over the last weeks.

  • I fixed four actual crashes in 4.7.3. On first glance this might not sound like much but these four crash fixes entail 38 duplicates.
  • I finally managed to close the memory leak in the file watcher service.
  • I significantly improved the file indexer:
    • Exclusion filters are now also correctly taken into account for folders.
    • .xsession-errors is now always excluded from indexing.
    • Rapidly changing files are only indexed once closed. This results in a lot less IO.
    • The previous change also results in torrent downloads being indexed after finished.
    • Files that are written over and over (like IRC logs for example) are only re-indexed once every five seconds.
    • Nepomuk now always extracts the plain text from PDF files via pdftotext. This is a hack to make sure that we can at least search all PDFs by content. The next step will be to extract meta-data like title and author via poppler. (This is required since the PDF analyzer in libstreamanalyzer/Strigi is not powerful enough yet.)
    • Symbolic links now have the correct mime type which means better search results.
    • In case the indexer gets stuck (runs forever on one file) it is killed after a period of time.
    • Jos van den Oever fixed a bunch of issues in libstreamanalyzer (Strigi) which results in less crashes and less endless PDF indexing. Stay tuned for Strigi 0.7.7.
  • With Soprano 2.7.3 Nepomuk will now restart the storage if Virtuoso goes down due to a crash or a third-party kill.
  • A running Virtuoso instance which was not shut down due to a crashed or killed Nepomuk will now gracefully be shut down before starting a new instance. This solves some startup issues.
  • A small query performance improvement based on a pointless UNION.
  • Smit Shah backported his patch which gets rid of the flickering Nepomuk indexer icon in the system tray. It now only becomes active if the indexer has been working for a certain period of time.

All in all 15 bugs are marked with “FIXED-IN: 4.7.3″. This does not include the fixes and improvements I made which did not have matching reports.

Today the next round of Nepomuk stability and performance begins. If all goes well KDE 4.7.4 should be rock-solid when it comes to Nepomuk. Thanks a lot for your continued support. I am still hopeful that I will find a more permanent solution soon:

Click here to lend your support to: Nepomuk - The semantic desktop on KDE and make a donation at www.pledgie.com !
Click here to donate to Nepomukvia Moneybookers

Memory Leaks in Nepomuk – Nah!

OK – this is the fifth time I start the first sentence. So now I will just write like it comes to mind… backspace, backspace, mark/delete…. Oh, damn it. I just pushed a fix to master and 4.7 which fixes the memory leak in the filewatch service. The root of it was the same as in the file indexer service: no event loop in the work thread means DBus events piling up without ever being garbage collected.

Well, that is fixed now and the filewatch service will not steal all your memory anymore.

I have three days left until KDE 4.7.3 and I want to make them count!

As always:

Click here to lend your support to: Nepomuk - The semantic desktop on KDE and make a donation at www.pledgie.com !
Click here to donate to Nepomukvia Moneybookers

A Word (or Two) on Removable Storage Media Handling in Nepomuk

While fixing existing Nepomuk bugs and trying to close them as they come in I also look into other things. Last week it was the improved file indexer scheduling and file modification handling. This week it is about another improvement in the handling of queries which involve removable media. Ignacio Serantes already found one bug in the URL encoding before. This time he wanted to search through all mounted removable storage media and realized that he could not. I just fixed that. In order to understand how I did that we need to go into detail about how Nepomuk handles removable media.

Removable Storage Media in Nepomuk

Files on removable storage media are a problem when it comes to meta data stored in Nepomuk. As long as the medium is mounted we can simply identify the files through their local file path. But as soon as it is unmounted the paths are no longer valid. To make things worse we could mount the medium at another mount point the next time or mount another medium (which obviously does not contain the files in question) at the same mount point. So we need a way around that problem. Ever since 4.7 Nepomuk has a rather fancy way of doing that.

Internally Nepomuk uses a stack of Soprano::FilterModels which perform several operations on the data that passes through them. One of these models is the RemovableStorageModel. This model does one thing: it converts the local file URLs of files and folders on removable media into mount-path-independent URLs and vice versa. Currently it supports removable disks like USB keys or external hard disks (any storage that has a UUID), optical media, NFS and Samba mounts. The nice thing about it is that this conversion happens transparently to the client. Thus, a client simply uses the local file URLs according to the current mount path and does not care about anything else. It will always get the correct results.

To understand this better we should look at an example. Imagine we have a USB key inserted with UUID “xyz” which is mounted at /media/disk. Now if we add information about a file /media/disk/myfile.txt to Nepomuk the following happens: The RemovableStorageModel will convert the URL file:///media/disk/myfile.txt into filex://xyz/myfile.txt. This is a custom URL scheme which consists of the device UUID and the relative path. When querying the file the model does the conversion in the other direction. So far so simple.

Queries are where it gets a little more complicated. Imagine we want to query all files in a certain directory on the removable medium (ideally the SPARQL would be hidden by the Nepomuk query API). We would then perform a query like the following simplified one.

select ?r where {
  ?r nie:isPartOf ?p . 
  ?p nie:url <file:///media/disk/somefolder> . }

If we would pass this query on to Virtuoso we would not get any results since there is no resource with nie:url <file:///media/disk/somefolder>. So the RemovableStorageModel steps in again and does some query tweaking (rather primitive tweaking seeing that we do not have a SPARQL parser in Nepomuk). The query is converted into

select ?r where {
  ?r nie:isPartOf ?p .
  ?p nie:url <filex://xyz/somefolder> . }

And suddenly we get the expected results.

Of course this is still rather simple. It gets more complicated when SPARQL REGEX filters are involved. Imagine we wanted to look for all files in some sub-tree on a removable medium. We would then use a query along the lines of the following:

select ?r where {
  ?r nie:url ?url .
  FILTER(REGEX(STR(?url), '^file:///media/disk/somefolder/')) . }

As before passing this query directly on to Virtuoso would not yield any results. The RemovableStorageModel needs to do its magic first:

select ?r where {
  ?r nie:url ?url .
  FILTER(REGEX(STR(?url), '^filex://xyz/somefolder/')) . }

This is what the model did before Ignacio wanted to query all his removable media mounted somewhere under /media at once. Obviously he did something like:

select ?r where {
  ?r nie:url ?url .
  FILTER(REGEX(STR(?url), '^file:///media/')) . }

The result, however, was empty. This is simply because there was no exact match to any mount path of any of the removable media and RemovablStorageModel did not replace anything. The solution was to include additional filters for all the candidates in addition to the already existing filter. We need to keep the existing filter in case there is anything else under /media which is not a removable medium and, thus, has normal local file:/ URLs.

If we imagine that we have an additional mounted removable medium with UUIDfoobar” then the query would be converted into something like the following.

select ?r where {
  ?r nie:url ?url .
  FILTER((REGEX(STR(?url), '^file:///media/') ||
          REGEX(STR(?url), '^filex://xyz/') || 
          REGEX(STR(?url), '^filex://foobar/'))) . }

This way we get the expected results. (The additional brackets are necessary in case the filter already contains more than one term.)

Well, I personally think this is a very clean solution where clients only have to consider filex:/ and its friends nfs:/, smb:/, and optical:/ if the media are not mounted. One way of handling that I already drafted a while back. But that will be perfected another day. ;)

For now let me, as always, close with the hint that development like this is still running on your donations:

Click here to lend your support to: Nepomuk - The semantic desktop on KDE and make a donation at www.pledgie.com !
Click here to donate to Nepomukvia Moneybookers

Taking a Break From Crash Fixing For Usability

Fixing bugs is actually more fun than I thought. It is rewarding and seeing the bug count go down feels great. But after a few weeks of mostly hunting crashes I needed to do something different for a change.

Thus, I went after the file indexer for optimizations. As discussed in the comments of this very blog downloads and rapidly changing files in general have always been a problem – they are indexed way too often. This is a clear waste of resources. In a discussion the very good idea of introducing a delay after which to re-index a changed file was born. This is what I actually did. However, while doing that I found that it can be even more improved:

In addition to file modification events the file system can tell us when a file is closed after having been opened for writing. Thus, Nepomuk now uses that event instead. For downloads that means they will be indexed only a single time: when they are done.

So now Nepomuk will only re-index files that have actually been modified (modification event) and that have been closed (close after write event). And in addition the re-indexing is delayed for 5 seconds to ensure that we do not re-index rapidly changing files all the time.

All in all this is a great improvement for IO in Nepomuk. Thanks a lot for pointing it out and getting me on the right track. (I even backported this to KDE 4.7.3.)

In other news my former GSoC student Smit Shah added another delay: the nepomukcontroller icon in your system tray will now stop flickering in and out of activity whenever a small file is indexed. Instead it will wait a short while to ensure that some longer indexing operation is in progress. Another nice usability thingi that he will also backport to 4.7.3.

And now it is back to crash fixing for me. :)

In the meantime let me mention again that I am still looking for Nepomuk funding. So far no company has given a positive answer (funnily enough I did not get a negative yet either). I am still interested in your proposals and as always your support (which has been amazing – thank you so much):

Click here to lend your support to: Nepomuk - The semantic desktop on KDE and make a donation at www.pledgie.com !
Click here to donate to Nepomukvia Moneybookers