Something Dry: Change Notifications

Ignoring the fact that I did not blog in nearly two months I will simply get some developer information out there. Getting notified about changes in the Nepomuk database has always been a problem. All we had for a long time where the ugly statementAdded and statementRemoved signals from Soprano which, when actually used, would slow down the whole system as one would have to check each single statement for the information one needed.

Thus, with the introduction of the Data Management Service a while back we also gave birth to the ResourceWatcher which can be used to watch resources, properties, and types for changes. The concept is simple. Just create an instance of the watcher and tell it which resources or which types of resources you want to watch for changes. In addition you can restrict it to specific properties. Then you get nice signals which inform you about the changes when they happen.

Nepomuk::ResourceWatcher *watcher = new Nepomuk::ResourceWatcher(this);
watcher->addType(NCO::Contact());
connect(watcher, SIGNAL(resourceCreated(Nepomuk::Resource, QList<QUrl>)),
        this, SLOT(slotCreated(Nepomuk::Resource, QList<QUrl>)));
watcher->start();

The problem with this has been that it only works with data manipulation which happens through the Data Management Service and libnepomuk did not use that for a long time. Now we finally fixed that (sadly I did not manage to push it in time for 4.8 but it will be in 4.8.1) and the change notifications become really useful. I also implemented a bunch of unit tests and made sure the most important types of notifications actually work.

So all in all an important step for developers using Nepomuk which was overdue.

Symbolic Links in Nepomuk – A Solution

Until now symbolic links were not handled in Nepomuk. Today I commited the last patch for the new symlink support in Nepomuk. The solution I chose is not the theoretically perfect one. That would have taken way to much effort while introducing all kinds of possible bugs, regressions, API incompatibilities, and so on. But the solution is nice and clean and simple.

Essentially each direct symlink is indexed as a separate file using the content of its target file. (This is necessary since a direct symlink might have a different file name than the target file.) The interesting part are the indirect symlinks. Indirect symlinks are files in a folder which is a symlink to another folder. An example:

/home/trueg/
|-- subdir/
   |-- thefile.txt
|-- link/ -> subdir/
   |-- thefile.txt

Here I have a folder “subdir” which contains a file “thefile.txt”. The folder “link” is a direct symlink to “subdir” whereas “link/thefile.txt” is an indirect symlink to “subdir/thefile.txt”.

Indirect symlinks are simply stored as alternative URLs on the target file resources using the kext:altUrl property. (The property is not defined in NIE since it is not theoretically sound with respect to the design of NIE. It needs to be considered a beautiful hack.)

The only situation in which the alternative URLs are actually needed is when searching in a specific folder. Imagine searching in “/home/trueg/link” only. Since there are no nie:url values which match that prefix we need to search the kext:altUrls, too.

The result of all this is that nearly no additional space is required except for the kext:altUrl properties, files are not indexed more than once, and files in symlinked folders are found in addition to “normal” files.

In my tests everything seems to work nicely but I urge you to test the nepomuk/symlinkHandling branches in kdelibs and kde-runtime and report any problems back to me. The more testing I get the quicker I can merge both into KDE 4.8.

Lastly the pledgie campaign is done but the search for funds goes on:

Finding Duplicate Images Made Easy

It is a typical problem: we downloaded images from a camera, maybe did not delete them from the camera instantly, then downloaded the same images again next time, maybe created an album by copying images into sub-folders (without Nepomuk Digikam can only do so much ;), and so on. Essentially there are a lot of duplicate photos lying around.

But never fear. Just let Nepomuk index all of them and then gather all the duplicates via:

select distinct ?u1 ?u2 where {
  ?f1 a nexif:Photo .
  ?f2 a nexif:Photo .
  ?f1 nfo:hasHash ?h .
  ?f2 nfo:hasHash ?h .
  ?f1 nie:url ?u1 .
  ?f2 nie:url ?u2 .
  filter(?f1!=?f2) .
}

Quick explanation: the query does select all nexif:Photo resources which have the same hash value but are not the same. This of course can be tweaked by adding something like

?f1 nfo:fileName ?fn .
?f2 nfo:fileName ?fn .

to make sure that we only catch the ones that we downloaded more than once. Or we add

?f1 nie:contentCreated ?cc .
?f2 nie:contentCreated ?cc .

to ensure that the photo was actually taken at the same time – although I suppose the probability that two different photos have the same hash value is rather small.

Maybe one last little detail. In theory it would be more correct to do the following:

?f1 nfo:hasHash ?h1 .
?f2 nfo:hasHash ?h2 .
?h1 nfo:hashValue ?h .
?h2 nfo:hashValue ?h .

However, with the introduction of the Data Management Service in KDE 4.7 similar hash resources are merged into one. Thus, the slightly simpler query above. Still, to be sure to also properly handle pre-KDE-4.7 data the above addition might be prudent.

Of course this should be hidden in some application which does the work for you. The point is that Nepomuk has a lot of power that only reveals itself at second glance. :)

Manually Forcing the (Re-)Indexing of Folders is Easy

Ever since the unicode bug in Virtuoso 6.1.3 many of us have broken unicode strings in our Nepomuk databases. Completely re-creating the database is IMHO not an option since that would mean loosing all manual annotations and things like download source URLs. One solution would be restoring a backup but I simply do not trust the Nepomuk backup until I had a deeper look into it. The perfect solution would be if Nepomuk could simply fix the data automatically. While that is of course my goal and I am looking into that it will take a while.

In the meantime I threw together a small desktop file which adds two new actions to the context menu of folders.

  1. (Re-)index Folder contents will make the indexer update all the files in the folder indifferent of their state in Nepomuk. This includes fixed unicode strings.
  2. (Re-)index Folder contents recursive does the same as the above except that it also recurses into sub folders.

Simply put the following into a file called “nepomuk-index-folder.desktop” and save it in “~/.kde/share/kde4/services/ServiceMenus”. At the next start of Dolphin or Konqueror the two new actions will be available.

[Desktop Entry]
Type=Service
X-KDE-ServiceTypes=KonqPopupMenu/Plugin,inode/directory
Actions=indexFolder;indexFolderRecursive;
X-KDE-Submenu=Desktop Search
Icon=nepomuk

[Desktop Action indexFolder]
Name=(Re-)index Folder contents
Icon=nepomuk
Exec=qdbus org.kde.nepomuk.services.nepomukfileindexer /nepomukfileindexer org.kde.nepomuk.FileIndexer.indexFolder %f 0 1

[Desktop Action indexFolderRecursive]
Name=(Re-)index Folder contents recursive
Icon=nepomuk
Exec=qdbus org.kde.nepomuk.services.nepomukfileindexer /nepomukfileindexer org.kde.nepomuk.FileIndexer.indexFolder %f 1 1

Update: The code above does only work for KDE 4.8 since we renamed the “strigi service” to “file indexing service”. So in order to make this work in KDE 4.7 and before replace “nepomukfileindexer” with “nepomukstrigiservice” and “FileIndexer” with “Strigi”.

Nepomuk Fundraiser – Badamm (Or Some Other Really Clever and Funny Title I Cannot Think of at the Moment)

It happened. Alf Rustad donated the missing 356€ which broken the magical barrier of 9000€ in the Nepomuk Fundraiser I started nearly three months ago.

While the actual goal – securing long-term funding for Nepomuk – has not been reached yet this is a great opportunity to thank Alexander, Alvar, Andreas, Andre, Andrew, Angelo, Anton, Antonio-J, Ardy123, arkub, Baltasar, Bernd, Bernhard, Calogero, Carl, Ceferino, Christopher, Christoph, Claude, Cristiano, Daniel, David, dunkelschorsch, Eduard, Efthymia, Elias, the two Enriques, Fabio, Felix, Florian, Francisco, Friedhelm, Fux, Gael, Giacomo, Giorgio, Guillaume, Günter, Hans, Han, Hartmut, Hector, Hendy, Huftis, Jaroslav, Jérôme, Jesus, Josep, Jos, Jramskov, Juan, Juanjo, Junichi, Kai, Kenneth, Kevin, Kilian, Kulomi, Leopoldo, Linopolus, Luca, Luis, Luiz, Maik, Manoel, Manuel, Marco, Marc, the three Markusses and Martins, Maxime, Mguel, the two Michaels, Mikael, Mike, Morgan, Nicolas, Olaf, Olivier, Orestes, the two Pauls, Paulo, the two Peters, Philipp, Pierre-Hugues, Régis, Robert and Robert, Rodrigo, samtuke, the Sebastians, Simone, Sören, Stefano, Steffen, Stian, tanghus, Thiago, Thomas, Thomas, and Thomas, Tiago, Timothy, Tommi, Tuukka, Ulrich, Wakeley, Xavier, Yaroslav, and all the anonymous doners for their support. You have given me time to keep looking.

A special thanks goes to Carl Symons for his great dot article, his many tips and continuous encouragement.

Thank you also to Peter, George, Ivan, Vishesh, Christian, Andrew, Martin, and Laura for their great developer comments on Nepomuk.

And last but not least thanks for all the positive feedback on my blog articles, the translations into strange and exotic languages such as spanish :P and all the encouraging words which showed how many actually get what the semantic desktop is all about and want Nepomuk to go on and change the way we work with information today.

The Different Places Something Can Go Wrong

This is just a little blog entry about the impact that the ontologies can have on functionality.

The ontologies are a set of vocabularies describing the types of resources stored in Nepomuk, the possible relations between these types, and the possible annotations. We have for example a type for local files, one for an address book entry, one for a person, one for music content and so on. We also have relations that describe that some person is the author or some piece of content and so on.

These ontologies are maintained in the Shared-Desktop-Ontologies project – to my knowledge the only real open-source project developing RDF ontologies.

Now to the actual topic. There once was a bug. Like so many other bugs it talked about file indexing in Nepomuk and like so many other bugs it said that some file could not be indexed. First it was Nepomuk’s fault, then it was the fault of libstreamanalyzer, but in the end I realized: there was a bug in the ontologies. More specificly in NMM – the Nepomuk MultiMedia ontology. (Granted this was not really the source of the hang the bug talks about but it was the reason the file could not be indexed.)

The problem was the domain of the nmm:setSize property. Each property has a domain and a range – the domain defines on which type of resource the property can be set, the range defines the type of the value. In other words they are defining the subject and object type of the triple. The domain is always a resource type (rdfs:Class), the range a resource or a literal type (typically one defined in the XML schema). In this case the domain of nmm:setSIze was set as nmm:MusicPiece whereas it should have been nmm:MusicAlbum. Thus, Nepomuk rejected the data generated by libstreamanalyzer as being invalid due to using an invalid domain. (Update: Nepomuk treats RDF data in a closed-world fashion. In comparison to the open-world approach which is typical for RDF/S resource types are not inferred from their relations. In an open-world situation the resource would simply end up being both a nmm:MusicPiece and a nmm:MusicAlbum.)

The solution is shared-desktop-ontologies 0.8.1 with the fixed domain. Installing it will make Nepomuk re-parse the changed ontology and indexing the mp3 files in question will finally work.

Well, this was pretty verbose for a rather small issue. Still it gave a little introduction into how the ontologies are used in Nepomuk. One more thing to take care of in the “Nepomuk universe”.

And as always:

Click here to lend your support to: Nepomuk - The semantic desktop on KDE and make a donation at www.pledgie.com !

Update on Bugs And Stuff

There was not that much activity last week. That is simply because I took a few days off to spend time with my family on a farm – mostly so that my daughter could ride a pony every day.

Still, there are a few words I can say regarding Nepomuk bugs: 4 crashes have been reported for KDE 4.7.3. One of them I already fixed, one has a patch which is awaiting testing, one is very confusing and made me contact the mighty Dario for help, and the last one is the akonadi feeder which still has a memory leak. Apart from that blogging about my inotify usage had a very nice side-effect: Marcel Wiesweg from Digikam wants to use KInotfy and stumbled upon a serious bug which for some stupid reason I never caught. That bug means that files with umlauts and friends in their paths never keep their meta-data when moved. Urgh!

So KDE 4.7.4 will again contain a bunch of fixes. The hunt is not over yet. At least now all bugs are nicely categorized which makes triaging much easier.

And in other news: thank you so much for the amazing fundraiser. Only about 800€ to the goal which I feared was reaching for the stars. This is given me the energy to go on and try to find the required funding. No one stepped up so far but I am still hopeful as some are still “in discussion”. Keep your fingers crossed (and send ideas my way).

Click here to lend your support to: Nepomuk - The semantic desktop on KDE and make a donation at www.pledgie.com !

KDE 4.7.3 – The (First) Nepomuk Stability Release

Now that KDE 4.7.3 has been released let me look back onto the work that I put into it over the last weeks.

  • I fixed four actual crashes in 4.7.3. On first glance this might not sound like much but these four crash fixes entail 38 duplicates.
  • I finally managed to close the memory leak in the file watcher service.
  • I significantly improved the file indexer:
    • Exclusion filters are now also correctly taken into account for folders.
    • .xsession-errors is now always excluded from indexing.
    • Rapidly changing files are only indexed once closed. This results in a lot less IO.
    • The previous change also results in torrent downloads being indexed after finished.
    • Files that are written over and over (like IRC logs for example) are only re-indexed once every five seconds.
    • Nepomuk now always extracts the plain text from PDF files via pdftotext. This is a hack to make sure that we can at least search all PDFs by content. The next step will be to extract meta-data like title and author via poppler. (This is required since the PDF analyzer in libstreamanalyzer/Strigi is not powerful enough yet.)
    • Symbolic links now have the correct mime type which means better search results.
    • In case the indexer gets stuck (runs forever on one file) it is killed after a period of time.
    • Jos van den Oever fixed a bunch of issues in libstreamanalyzer (Strigi) which results in less crashes and less endless PDF indexing. Stay tuned for Strigi 0.7.7.
  • With Soprano 2.7.3 Nepomuk will now restart the storage if Virtuoso goes down due to a crash or a third-party kill.
  • A running Virtuoso instance which was not shut down due to a crashed or killed Nepomuk will now gracefully be shut down before starting a new instance. This solves some startup issues.
  • A small query performance improvement based on a pointless UNION.
  • Smit Shah backported his patch which gets rid of the flickering Nepomuk indexer icon in the system tray. It now only becomes active if the indexer has been working for a certain period of time.

All in all 15 bugs are marked with “FIXED-IN: 4.7.3″. This does not include the fixes and improvements I made which did not have matching reports.

Today the next round of Nepomuk stability and performance begins. If all goes well KDE 4.7.4 should be rock-solid when it comes to Nepomuk. Thanks a lot for your continued support. I am still hopeful that I will find a more permanent solution soon:

Click here to lend your support to: Nepomuk - The semantic desktop on KDE and make a donation at www.pledgie.com !
Click here to donate to Nepomukvia Moneybookers

Memory Leaks in Nepomuk – Nah!

OK – this is the fifth time I start the first sentence. So now I will just write like it comes to mind… backspace, backspace, mark/delete…. Oh, damn it. I just pushed a fix to master and 4.7 which fixes the memory leak in the filewatch service. The root of it was the same as in the file indexer service: no event loop in the work thread means DBus events piling up without ever being garbage collected.

Well, that is fixed now and the filewatch service will not steal all your memory anymore.

I have three days left until KDE 4.7.3 and I want to make them count!

As always:

Click here to lend your support to: Nepomuk - The semantic desktop on KDE and make a donation at www.pledgie.com !
Click here to donate to Nepomukvia Moneybookers