Files on Removable Media – Step 1

So far Nepomuk only handled annotations for local files and did not care about mount points and the like. With KDE SC 4.4 that is about to change. (What I present here today is the first step in supporting removable media like USB keys or external hard drives. The next step will have to wait until 4.5.)

As always my blog will have two parts: the user visible one and the technical one which discusses implementation details.

Imagine you had a USB key with a file on it which you annotated, say with a rating of 6:

As long as the key is mounted searching for files with a rating of 6 will always return our image the way we know it:

But now we unmount the key. Thus, the file will not be accessible anymore. But the file still shows up in the search:

However, there is a slight change in the name: now it contains a hint to the USB key (this hint did not make it into 4.4). Opening the file still works since the key is automatically mounted.

But what happens if we remove the key completely and thus, auto-mounting is not an option anymore? Well, the search does still return the file but trying to open it gives us an error. Well, Gwenview does not handle KIO errors properly, thus we do not get an error message. But in theory we would get the message “Please insert the removable medium ’1,9 GiB Removable Media’ to access this file.”. Just to show that I am not lying let me present the Okular error dialog (which could use some improvement, too):

Here you see the whole ugly Nepomuk query URL and all at the bottom the unformatted error message. Hopefully in KDE SC 4.5 we will have worked out these problems.

So if we would see this error message we would know where to find the file if we want to access it (well, giving proper names to USB keys would help, too).

This is already pretty nice. Step 2 will then be to export the annotations to the removable storage and sync them again as soon as the medium is mounted (remember how I blogged about my experiments with that already? Well, as you can see I did not get that finished, yet.)

The Technical Part

OK, now we know how it looks to the user (or how it should look). Let us have a look under the hood. Basically three players are involved in this process:

The removable storage service

The removable storage service (code in kdebase) uses the great power of Solid to act on newly inserted, mounted, and unmounted removable storage devices. The simple part is that it tells the Strigi service to index the files on the device on mount.

More interesting, however, is what happens on unmount. The service converts all absolute URLs of files on the unmounted device to relative ones. These relative URLs use the filex:/ scheme I made up and consist of two part: the UUID of the storage device and the relative path. In our example above the URL is filex://fc30-3da9/thepic.JPG. In addition a new nfo:Filesystem resource is created storing the description and the UUID of the unmounted device. The files are then related to this new nfo:Filesystem resource via nie:isPartOf.

libnepomuk

Nepomuk::Resource can now transparently handle relative filex:/ URLs. Thus, annotating the file in the example after remounting will store the annotations with the same resource although that uses a filex:/ URL.

The nepomuk:/ KIO slave

The nepomuk:/ KIO slave (code in kdebase) does the rest of the work. The nepomuksearch:/ KIO slave creates the virtual query folders but uses the nepomuk:/ KIO slave to stat all resources (at least the ones with a nepomuk:/ scheme URI).

So as soon as a relative filex:/ URL is encountered it is converted to a local URL if possible:

Solid::StorageAccess* storageFromUUID( const QString& uuid ) {
    QString solidQuery = QString::fromLatin1( "[ StorageVolume.usage=='FileSystem' AND StorageVolume.uuid=='%1' ]" ).arg( uuid.toLower() );
    QList<Solid::Device> devices = Solid::Device::listFromQuery( solidQuery );
    if ( !devices.isEmpty() )
        return devices.first().as<Solid::StorageAccess>();
    else
        return 0;
}

KUrl convertRemovableMediaFileUrl( const KUrl& url, bool evenMountIfNecessary = false ) {
    Solid::StorageAccess* storage = storageFromUUID( url.host() );
    if ( storage &&
         ( storage->isAccessible() ||
           ( evenMountIfNecessary && mountAndWait( storage ) ) ) ) {
        return storage->filePath() + QLatin1String( "/" ) + url.path();
    }
    else {
        return KUrl();
    }
}

And here you can already see the auto-mounting code being called. (I do not show it here since this is enough to read already. If you are interested have a look at the full source code.) The converted URL is then simply passed to KIO::ForwardingSlaveBase which handles the rest. In case the URL cannot be converted (the medium is not mounted and auto-mounting is not used) all information is read from the Nepomuk database to create a proper KIO::UDSEntry.

Virtuoso – for real!

used-bckend-virtuoso

Soprano 2.3.63 – that is the magic version number you need to look out for.

And then once you have updated your kdebase copy to the latest trunk you run your favorite text editor on ~/.kde/share/config/nepomukserverrc. In there you set Soprano Backend=virtuosobackend in the [Basic Settings] section. After that you simply restart Nepomuk as described in the corresponding howto. You can also logout and log back in again but then you won’t be able to provide as nice bug reports.

Once done Nepomuk will convert your database. This can take a loooong time if strigi is enabled. But it will finish. :)

BTW: You need a recent snapshot of Virtuoso 5.0.12 for this to work.

Fixing Bugs is Fun

Yes, sometimes it is. And sometimes it is a good thing that David Faure does not answer your pings because it makes you write test cases. And sometimes these test cases actually reveale the bug you have been hunting for months. And sometimes searching for the bug makes you refactor and simplify code in the process. This is exactly what happend with the annoying “reload bug” of the Nepomuk query KIO slave. It was responsible for results sometimes not showing up before hitting F5 a few times. Well, that is history. The present brings a better design using a QWaitCondition instead of a local event loop (which was ugly anyway and I have no idea what made me using it in the first place) which as a side effect also fixes the bug and simlifies the code. (And I mean “simplify”, not “making it simple”. The code is still far from simple.)

That’s already it. Just wanted to share that. More search goodies when I blog about Adam’s GSoC work.

Reblog this post [with Zemanta]

What Nepomuk Can do and How You Should Use it (as a Developer)

Nepomuk has been around for quite a while but the functionality exposed in KDE 4.3 is still not that impressive. This does not mean that there does not exist cool stuff. It only means that there is not enough developer power to get it all stable and integrated perfectly. Let me give you an overview of what already exists in playground and how it can be used (and how you should use it).

The Basics

For starters there is the Nepomuk API in kdelibs which you should get familiar with.Most importantly (we will use it quite a lot later on) there is Nepomuk::Resource which gives access to arbitrary resources in Nepomuk.

Nepomuk::Resource file( myFilePath );
file.addTag( Nepomuk::Tag( “Fancy stuff” ) );
QString desc = file.description();
QList<Nepomuk::Tag> allTags = Nepomuk::Tag::allTags();

Resource allows simple manipulation of data in Nepomuk. Using some fancy cmake magic through the new NepomukAddOntologyClasses macro in kdelibs data manipulation gets even simpler. The second basic thing you should get familliar with is Soprano and SPARQL. As a quickstart the following code shows how I typically create queries using Soprano:

using namespace Soprano;

Model* model = Nepomuk::ResourceManager::instance()->mainModel();
QString query = QString( “prefix nao:%1 “
                         “select ?x where { “
                         “%2 nao:hasTag ?t . “
                         “?r nao:hasTag ?t . }” )
        .arg(Node::resourceToN3(Vocabulary::NAO::naoNamespace()))
        .arg(Node::resourceToN3(file.resourceUri()));
QueryResultIterator it
        = model->executeQuery( query, Query::QueryLanguageSparql );

As you can see there is always a lot of QString::arg involved to prevent hard-coding of URIs (again Soprano provides some cmake magic for generating Vocabulary namespaces).

These are the basics. Without these basics you cannot use Nepomuk.

Debugging Nepomuk Data

Now before we dive into the unstable, experimental, and really cool stuff let me mention sopranocmd.

sopranocmd is a command line tool that comes with Soprano and allows to perform virtually any operation possible on the Nepomuk RDF database. It has an exhaustive help output and you should use it to debug your data, test your queries and the like (if anyone is interested in creating a graphical version, please step up).

The Nepomuk database (hosting only a single Soprano model called “main”) can be accessed though D-Bus as follows:

sopranocmd --dbus org.kde.NepomukStorage --model main \
      query "select ?r where { ?r ?p ?o . }"

The Good Stuff

There is quite a lot of experimental stuff in the playground but I want to focus on the annotation framework and Scribo.

The central idea of the annotation framework is the annotation suggestion which is encapsulated in the Annotation class (Hint: run “make apidox” in the annotationplugin folder). Instead of the user manually annotating resources (adding tags or relating things to other things) the system proposes annotations which the user then simply acknowledges or discards. These Annotation instances are normally created by AnnotationPlugin instances (although it is perfectly possible to create them some other way) which are trigged through an AnnotationRequest.

Before I continue a short piece of code for the impatient:

Resource res = getResource();

AnnotationPluginWrapper* wrapper = new AnnotationPluginWrapper();
wrapper.setPlugins( AnnotationPluginFactory::instance()
   ->getPluginsSupportingAnnotationForResource( res.resourceUri() ) );
connect( wrapper, SIGNAL(newAnnotation(Nepomuk::Annotation*)),
         this, SLOT(addNewAnnotation(Nepomuk::Annotation*)) );
connect( wrapper, SIGNAL(finished()),
         this, SLOT(slotFinished()) );

AnnotationRequest req;
req.setResource( res );
req.setFilter( filter );
wrapper->getPossibleAnnotations( req );

The AnnotationPluginWrapper is just a convenience class which prevents us from connecting to each plugin separately. It reproduces the same signals the plugins emit.

The interesting part is the AnnotationRequest. At the moment (the framework is under development. This also means that your ideas, patches, and even refactoring actions are very welcome) it has three parameters, all of which are optional:

  1. A resource – The resource for which the annotation should be created. This parameter is a bit tricky as the Annotation::create method allows to create an annotation on an arbitrary resource but in some cases it makes perfect sense to only create annotation suggestions for only one resource.
  2. A filter string – A filter is supposed to be a short string entered by the user which triggers an auto-completion via annotations. Plugins should also take the resource into account if it is set.
  3. A text – An arbitrary long text which is to be analyzed by plugins. Plugins would typically extract keywords or concepts from it. Plugins should also take resource and filter into account if possible. This is where the Scribo system comes in (more later).

Plugins that I already created include very simple ones like the tag plugin which matches the filter to existing tag names and also excludes tags already set on the resource. Way more interesting are other plugins like the pimotype plugin which matches the filter to pimo types and proposed to use that type or the pimo relation plugin which allows to create relations via a very simple syntax: “author:trueg“. The latter will match author to existing properties and trueg to a value based on the property range. One step further goes the geonames annotation plugin which matches the filter or the resource label to cities or countries using the geonames web service. It will then propose to set a location or (in case the resource label was matched) to convert the resource into a city or country linking to the geonames resource.

A picture says more than a thousand words. Thus, here goes:

annotations-english

What do we see here? The user entered the text Paris in the AnnotationWidget (a class available in the framework) and the framework then created a set of suggested annotations. The most likely one is Paris, the city in France as sugested by the geonames plugin. The latter also proposes a few not so likely places. The pimotype plugin proposes to create a new type named Paris and the tag plugin proposes to create a new tag named Paris. Here I see room for improvement: if we can relate to the city Paris there is no need for the tag. Thus, some more sophisticated rating and comparision may be in order.

Now let us bring Scribo into play. Scribo is another framework in the playground which provides an API for text analysis and keyword extraction. It is tied into the annotation framework through a dedicated plugin which uses the TextAnnotation class to create annotations on specific text positions. The TextAnnotation class is supposed to be used to annotate text documents. It will create a new nfo:TextDocument and make it a nie:isPartOf the main document. Then the new resource is annotated according to the implementation.

The Scribo framework will extract keywords and entities from the text (specified via the AnnotationRequest text field) via plugins which will then be used to create annotation suggestions. There currently exist three plugins for Scribo: the datetime plugin extracts dates and times, the pimo plugin matches words in the text to things in the Nepomuk database, and the OpenCalais plugin will use the OpenCalais webservice to extract entities from the text.

You can try the Scribo framework by using the scriboshell which can be found in the playground, too:

scriboshell3

Paste the text to analyze in the left view and press the “Start” button. The right panel will then show all found entities and keywords including the text position and relevance.

The other possibility is to directly use the resourceeditor which is part of the annotation framework and bundles all gui elements the latter has to offer in one widget. Call it on a text file and you will get a window similar to the following:

resourceeditor

At the top you have the typical things: editable label and description, the rating, and the tags. Below that you have the exisiting properties and annotations. In the picture these are only properties extracted by Strigi. Then comes the interesting part: the suggestions. Here you can see three different Scribo plugins in action. First the pimo plugin matched the word “Brein” to an event I already had in my Nepomuk database. Then there is the OpenCalais plugin which extracted the “Commission of European Communities” (so far the plugin ignores the additional semantic information provided by OpenCalais) and proposes to tag the text with it.

The last suggested annotation that we can see is “Create Event“. This is a very interesting hack I did. The Scribo plugin detected the mentioning of a project, a date, and persons and thus, proposes to create an event which has as its topic the project and takes place at the extracted time. Since it is a hack created specifically for a demo its results will not be very great in many situations. But it shows the direction which I would like to take.

Below the suggestions you can see the AnnotationWidget again which allows to manually annotate the file.

How to Write an AnnotationPlugin

This is a Howto in three sentences: Derive from AnnotationPlugin and implement doGetPossibleAnnotations. In that method trigger the creation of annotations. Your annotations can be instances of SimpleAnnotation or be based on Annotation and implement at least doCreate, exists, and equals .

class MyAnnotationPlugin : pubic Nepomuk::AnnotationPlugin
{
public:
    MyAnnotationPlugin(QObject* parent, const QVariantList&);
protected:
    void doGetPossibleAnnotations(const Nepomuk::AnnotationRequest&);
};

void MyAnnotationPlugin::doGetPossibleAnnotations(
      const Nepomuk::AnnotationRequest& request
)
{
    // MyFancyAnnotation can do all sorts of crazy things like creating
    // whole graphs of data or even openeing another GUI
    addNewAnnotation(new MyFancyAnnotation(request));

    // SimpleAnnotation can be used to create simple key/value pairs
    Nepomuk::Types::Property property(Soprano::Vocabulary::NAO::prefLabel());
    Nepomuk::SimpleAnnotation* anno = new Nepomuk::SimpleAnnotation();
    anno->setProperty(property);
    anno->setValue("Hello World");
    // currently only the comment is used in the existing GUIs
    anno->setComment("Set label to 'Hello World'");
    addNewAnnotation(anno);

    // tell the framework that we are done. All this could also
    // be async
    emitFinsihed();
}

And Now?

At the Nepomuk workshop Tom Albers already experimented with integrating the annotation suggestions into Mailody. It is rather simple to do that but the framework still needs polishing. More importantly, however, the created data needs to be presented to the user in a more appealing way. In short: I need help with all this!

Integrate it into your applications, improve it, come up with new ways of presenting the information, write new plugins. Jump on board of the semantic desktop train.

Thanks for reading.

Gran Canaria Desktop Summit 2009 – The Nepomuk Perspective

Now I can look back and say: “I have been there. I have witnessed the first joint conference of the two biggest players in open-source desktop software: KDE and Gnome.” And I can tell you: it was good. I personally think it makes a lot of sense to have the conference together. After all it helps a lot in creating inter-desktop relationships and projects. A topic that is especially interesting to me since the Tracker project is also going towards the semantic desktop these days (and dare I say it is as a result of our work?) and carries in its wake projects like Beagle or Zeitgeist. Thus, the GCDS was an opportunity to meet with people from Tracker to discuss some of the many topics we deal with every day. I also had the opportunity to meet some KDE newbies, eager youngsters wanting to work on KDE and also Nepomuk. I was, of course, delighted. One idea floting around at a party in the evening was the smarter caching of thumbnails using Nepomuk information.

Anyway, I am a technical blogger, a geek struggling to deliver the cold facts (and failing later down this very page). So let’s look at the GCDS from a more technical point of view:

Shared/Desktop/Nepomuk Ontologies – In Agreement

I meantioned already that I met with a bunch of the Tracker guys and had some nice chats (this is really not a cold hard fact now, is it?). We discussed a possible shared D-Bus API for metadata storage on the desktop which looks like a project for next year and we also discussed and agreed on the shared ontologies project. Now, the ontologies have been a problem for a long time. I already blogged about how the Nepomuk vs. Xesam discussion was settled. We discussed the problem in length on the Xesam mailing list. We created the OSCAF project on sourceforge and even moved over the Nepomuk ontologies from the Nepomuk dev server. But there was no consensus so far. No real development took place. At the GCDS we fixed that. We agreed on a layout for the subversion repository, on a packaging structure, on a file format, and even on a name for the package (we will call it shared-ontologies to match the already existing shared-mimetypes package). After we settle the last tidbits we can start tackling those problems we already have in the bug tracker and get the Nepomuk ontologies ready for shared usage in KDE and Gnome.

A Talk About Practical Nepomuk

At Monday evening I had my talk about Nepomuk. I tried to give a more practical talk which followed a simple outline of three main points:

  1. Why should I use Nepomuk?
  2. What should I use Nepomuk for?
  3. How should I use Nepomuk?

I think this went quite well except for the fact that I am pretty sure that I was robbed of a couple of minutes. I don’t think that I spoke for half an hour. So we did not have much time for questions in the end which was a pity. I am pretty sure we could have had a nice discussion. Well, there will be more possibilities for that soon (see below). Apart from my talk Laura Dragan gave two lightning talks about Konduit and Semnotes and a talk on her new project: semantic context menus in KDE.

The Place of Nepomuk in KDE’s Future

Nepomuk was mentioned in quite a few talks. It seems people are slowly adapting to the ideas and wanting to integrate Nepomuk features (not only the almighty Akonadi-front). This was also backed by Sebastian Kügler’s keynote on The Momentum of KDE. He talked about the achievements of the KDE project in the last year and what is targeted for the coming year. He mentioned a survey Aaron made and which included that roughly half of the KDE projects he interviewed have plans to integrate Nepomuk in some form. And then suddenly there was one slide which had only two things on it: 1. the Nepomuk logo and 2. the words “Semantic Desktop”. And as if that was not already enough of a booster for my ego he asked me to stand up so everyone could see me. Now this is the part where I derive from my righteous path of hard cold facts and wander into a realm of egomania: it felt good.

Well, come back down here brain. Let us tell the nice readers the really good part: there is funding for one or two more Nepomuk workshops this year! I am very excited about that and cannot wait to get started with the planning for the next one (probably sometime in September taking place in Paris). So if you missed the first workshop or thought it was not for you, maybe the second one is!

And finally: the social aspect

I told you I would fail. I simply have to tell you how nice it has been to see so many familiar faces, to meet so many new people and to finally have a face connected to an irc nick or an email address. I am sorry that I could not properly say goodbye to all I would have liked to. So please know that I was happy I was to see you and hope to see you again soon. Maybe at the next Nepomuk workshop.

Reblog this post [with Zemanta]

The First Nepomuk Workshop – It’s a Wrap

The first Nepomuk workshop and the first KDE workshop held in Freiburg ever is over. It was great but short. I could have worked on with these guys for much longer. It was a lot of fun to explain the Nepomuk ideas directly and having people not only listening but also understanding and realizing them.

On Friday we started out slowly. Due to different travel times and also some stupidity from travel agencies and German bus drivers we were only complete at around sixish. To get in the mood I had everyone explain what they wanted to achieve over the weekend or what they thought could be interesting to work on with Nepomuk. The beginning was not easy, at least I feared that we would have trouble to actually getting to work. After all, you do not start to work with Nepomuk just like that. It is too confusing and different for that. But the ideas were very good and the people very interested and eager.

So on Saturday my fears of me not being able to handle it were vanished. I explained about PIMO (just to confuse everyone for real) and showed what I had done with respect to NLP in the Scribo project (Tom already mentioned it in his blog although he confused it with PIMO. No big deal, there are way too many project and technology names to get mixed up). Sebastian Faubel showed his very interesting work on a replacement for the Gnome open and save file dialog. He also uses RDF to store meta data and then based on that decides on a location for the documents in a fixed (not really fixed but based on a template) folder tree. After that coding began.

What did we do?

Well, I did not really code anything. There was no time for that. I was too busy discussing with and helping the others. And they did do cool stuff. Let me start by mentioning my hero of the weekend: Tobias König aka tokoe. He wanted to improve the performance of the Akonadi Nepomuk feeder agents which export contact and email meta data from Akonadi to Nepomuk. He did that by introducing a new fast mode into the nepomuk resource generator. Now he would not be tokoe if he would not have been shocked by the hack that is rcgen. So over the weekend he cleaned it up. He cursed, he sweated, he nearly went mad, but he did it! And since we defined it as a bug fix it is even in 4.3. Great work, Tobias.

Then there was Tom Albers. Now he already blogged about what he did himself. But I will summarize it anyway: he actually dared and integrated the Scribo-based annotation suggestions into Mailody. I will blog about details on that later. But the idea is that the email body is analysed and a plugin system generates possible annotations such as dates, cities, persons, and also possible events that are mentioned in the email. Not only did he integrate it into Mailody, he also found two bugs that would have been showstoppers for the Mandriva Scribo demo yesterday. So thanks a lot, Tom.

Raptor. Now Raptor is a cool project. Raptor sets out to replace or provide an alternative to Kickoff. And their idea for Nepomuk integration is to remember the launches of applications. For starters “only” when an application was launched. This then allows to show more frequently used applications with bigger icons. But it will not stop there. Application launches can be linked to the current context (or the current Plasma activity). I might use KPresenter at work all the time but never at home. Also it would be possible to link files to the application launch that was used to open them. And so on. Alessandro, Francesco, and Lukas quickly understood how to create and ontology and use it in KDE. They now have their dedicated application launch ontology and use it via the Nepomuk libs in Raptor. I hope that the ontology can at some point be made into somewhat of a standard in the desktop ontology project.

Daniel, while normally being an Amarok developer (he did the Nepomuk integration in the GSoC last year), was very eager on making Nepomuk really useful on the basic level. So we discussed handling of removable storage and nicer resource URI design a lot. In the end we decided:

  1. All files will have a random URI that never changes.
  2. All file systems will be represented in Nepomuk with their mount point and their mount status (Daniel already started working on the service that handles that. Also the tracker guys are working on something similar. Matching the ontologies should be fairly easy as the concepts are the same.)
  3. All file URLs will be relative.
  4. All files will have a link to their storing file system.

This helps in solving a bunch of problems. Files on removable storages like USB sticks which can be mounted in different places are handled the exact same way as files on local file systems. Moving a file in most cases only means to update one property: the relative URL. Only if the file is moved to another file system that link has to be changed, too. Through the mount state flag on the file system in Nepomuk it is very simple to see if a file is currently available or not. A search client can simply tell the user that they have to mount the file system in question to access the file. I think this is a fine solution and since I tried to design everything without relying on the file resource’s URI being the file’s URL the transition should be fairly simply. A goal for 4.4.

George already blogged about his Nepomuk integration work with Telepathy. What is so great about his work is that he actually uses PIMO in a productive way: one PIMO Person represents one person that can be contacted via Telepathy. And this one pimo:Person has a set of occurrences being the actual contacts like jabber accounts and so on. Thus, if you want to chat with a person you simply click their icon and Telepathy will open one of the available systems, depending on their online status. One could even think of email as a fallback. I find this especially interesting since it so clearly uses the two different layers of information defined via the PIMO ontology: on the lower level we have all the desktop resources like files and emails and jabber accounts and so on. And on a higher level we have all the real world entities like the actual person or a project or a city represented by PIMO concepts. I hope that we will see more integration like this is the future. (I know, I know, I need to write better documentation on this.)

Marcel worked on the Nepomuk integration in Digikam. Since the Digikam team does not want to entirely rely on Nepomuk yet (with it being optional and all) he created a Nepomuk service that keeps the Digikam database and Nepomuk in sync. So rating and tagging your images in Digikam would directly be reflected in Nepomuk and the other way around. Very nice. I hope to see this hitting a stable release soon.

I had hoped to have more time to work with Peter on the meta data display in Dolphin. But sadly that dropped under the table a bit. But I was at least able to show him my crappy formatting rule system I drafted a while back and we cleaned up the display a bit: nicer labels and less useless properties shown.

What did we look like?

What about the future?

I hope that we can do this again soon. I think it was really worth it and am very happy to have done it. Thanks again to all of you. You made this a successful event.

PS: I wanted to blog earlier but first I had to sleep for two days straight. I am too old for this shit! ;)

Nepomuk in the Google Summer of Code 2009

Like last year and the years before that KDE participates in the Google Summer of Code. And like before project ideas are gathered independent from the Google process on techbase. And like last year Nepomuk will be part of the KDE projects proposing ideas. This is where you come in! I already posted two ideas in the list (scroll all the way to the bottom). That is not enough. I will try to come up with more but I think you have way more ideas up your sleeves (I noticed that much in Oslo where we had a lively discussion about the topic).

So, this is the opportunity to comment on this blog post and give your ideas for GSoC Nepomuk projects. You can, of course, even post the ideas directly on the techbase page (but then you should be available to mentor, too ;)

Thank you for your input.

Akonepomuk, Neponadi – Friendly Takeover or Real Love Marriage

Meeting with developers is always great. Not only does one get challanged and dragged into interesting discussions, often the result is an actual plan or a concrete idea. In this case my trip to Oslo and the vistit of the Trolltech (ah, sorry, Nokia) offices resulted in at least three results: 1. a lively discussion about the possibilities that the Semantic Desktop could offer (including a promise from a bunch of developers to ad wishes describing these ideas into the KDE bug database); 2. the plan for a Nepomuk developer sprint in the next months (more about that in the next days), 3. a very very fruitful discussion with Tobias König (currently doing an internship at Nokia) about the integration of Nepomuk into Akonadi or vice versa. Here I will present the results of the latter discussion.

The current situation

Currently (meaning: in KDE 4.2) both Akonadi and Nepomuk use their own databases: Akonadi starts its own mysql server and Nepomuk runs a java based RDF backend. Many people have problems with both mysql and java. I personally can understand the latter. In any case this separation of data means that at least parts of Akonadi data needs to be mirrored in Nepomuk. Otherwise we cannot search for PIM data, we cannot link to PIM items, and we cannot create relations between PIM items.

This mirroring of data is an ugly thing since it means that we convert the data in Akonadi (stored as VCards, ICal, or other formats) into RDF graphs. This is currently done by two Akonadi agents which can be found in the kdepim package. Whenever data in Akonadi changes, the data in Nepomuk has to be synced. I personally can see no advantage of this situation.

A possible solution worth discussing

The possible solution Tobias and myself came up with looks as follows:

Akonadi and Nepomuk will share one database – a Virtuoso SQL/RDF server. (For a discussion of the new Virtuoso support in Soprano see my last blog entry.) This is achieved as follows: the current database schema of Akonadi will be kept except for the parts tables which at the moment store the actual mime data of the items. Instead the item table will get a new column which points to an RDF resource in Virtuoso’s own RDF tables. The RDF resource will them represent the item in a real object-oriented form: contacts are encoded using the NCO Ontology, emails are encoded using the NMO Ontology, and so on. Akonadi will then use an RDF encoding (we propose turtle) for all data serialization. The advantages are numerous:

  1. All PIM data does exist as nicely encoded semantic data in Nepomuk which can directly be used to relate to and from.
  2. No syncing is necessary.
  3. Serialization plugins for Akonadi can be created automatically from an ontology using a code generator. Thus, an application developer who wants to store data in Akonadi only needs to create their own ontology. A task that is necessary for Nepomuk support anyway. And let’s face it: this is what each application should include in the future ;).
  4. Only one database server running: Virtuoso.
  5. Simpler mapping from database objects to convenience objects in C++: the data in Nepomuk is already represented using an object-oriented approach.

The Virtuoso server could be started using the same “process-singleton” approach libakonadi is currently using to start Akonadi if it is not running. This would keep both Nepomuk and Akonadi free of dependancies on each other.

Possible Problems

The one possible problem remaining might be performance: It stays to be tested if reading a vcard from SQL and parsing it is much faster than querying an NCO contact resource. Volker, we need your help. Plus: I see you guys at the KDE-PIM meeting in Berlin. Lots to do. ;)

A New Blog and The Possible End to the Java Dependancy in Nepomuk-KDE

I changed the blog system again. Why? It is rather simple: since the update of the blogging system of kdedevelopers.org it is virtually unusable. All the nice features are gone and I am told, one needs to have an account to comment. I need something spiffy that works nicely. WordPress was recommended to me.

Now back to the real content: A new backend for Soprano which could finally let us drop the sesame2 backend which needs a JVM.

Today I committed the new Virtuoso Soprano backend. Virtuoso is a powerful SQL/RDF DB server created by OpenLink Software. OpenLink provides an open-source version of their database server released under the GPL. The official description from their homepage reads:

At core, Virtuoso is a high-performance object-relational SQL database. As a database, it provides transactions, a smart SQL compiler, powerful stored-procedure language with optional Java and .Net server-side hosting, hot backup, SQL-99 support and more. It has all major data-access interfaces, such as ODBC, JDBC, ADO .Net and OLE/DB.
[...]
OpenLink Virtuoso supports SPARQL embedded into SQL for querying RDF data stored in Virtuoso’s database. SPARQL benefits from low-level support in the engine itself, such as SPARQL-aware type-casting rules and a dedicated IRI data type. This is the newest and fastest developing area in Virtuoso.

Virtuoso not only frees us from the shackles of the java dependency. It also provides a bunch of features that Sesame2 does not. Most importantly Virtuoso features full text indexing which can be used within SPARQL queries:

select ?foo where { ?foo rdfs:label ?label . ?label bif:contains "bar" . }

At the moment (using the sesame2 backend) we need to make use of the CLucene based full-text-indexing layer which I implemented for Soprano. While this works very well, it forces us to split queries into fulltext and graph part and then merge the results. That tends to be slow.

Apart from the Virtuoso also allows to do quite a lot of SPARQL magic with nested queries or even embedded SQL. Plus it supports SPARUL, the Sparql Update Language which will make a lot of code simpler and faster.

The nice guys at OpenLink were very open to the idea of using their server in KDE. With the release of Virtuoso 5.0.10 they introduced a new lite mode which trims the server down to our needs. The memory usage goes down to a minimum: IMHO roughly 80M is acceptable. (Especially since the JVM easily goes up to 10 times that value!) That in combination with disabling pretty much all features except for SPARQL we have a decent desktop DB solution.

Over the next weeks I will try to get the new backend into shape to replace the sesame2 backend. Hopefully by then distributions will have created packages for Virtuoso. If you want to give it a spin yourself without waiting or even to help me ;) this is what you need to do:

  • Get Virtuoso 5.0.10 from the Sourceforge download page
  • Install Virtuoso (obviously)
  • Start Virtuoso with a config file similar to the one below
  • Change the Nepomuk server config file (~/.kde4/share/config/nepomukserverrc) – Add "Soprano Backend=virtuosobackend” to the Basic Settings section.
  • Restart Nepomuk

Now Nepomuk should convert all existing data to the new backend. This process can take a while. There might even be errors. But it allows to test the new features such as the integrated fulltext indexing (which still has to be enabled manually).

And now the virtuoso.ini file:

[Database]
DatabaseFile = soprano-virtuoso.db
ErrorLogFile = soprano-virtuoso.log
TransactionFile = soprano-virtuoso.trx
xa_persistent_file = soprano-virtuoso.pxa
ErrorLogLevel = 7
FileExtend = 100
MaxCheckpointRemap = 1000
Striping = 0
TempStorage = TempDatabase

[TempDatabase]
DatabaseFile = soprano-virtuoso-temp.db
TransactionFile = soprano-virtuoso-temp.trx
MaxCheckpointRemap = 1000
Striping = 0

[Parameters]
LiteMode = 1
ServerPort = 1111
DisableUnixSocket = 0
O_DIRECT = 0
CaseMode = 1
CheckpointAuditTrail = 0
AllowOSCalls = 0
DirsAllowed = .
PrefixResultNames = 0
ServerThreads = 5 ; down from 10
CheckpointInterval = 10 ; down from 60
MaxDirtyBuffers = 50 ; down from 1200
SchedulerInterval = 5 ; down from 10
FreeTextBatchSize = 1000

[HTTPServer]
DavRoot = DAV
EnabledDavVSP = 0
HTTPProxyEnabled = 0
TempASPXDir = 0
Charset = UTF-8
ServerThreads = 2 ; down from 5
KeepAliveTimeout = 5 ; down from 10
HTTPThreadSize = 10000 ; down from 280000

[AutoRepair]
BadParentLinks = 0

[Client]
SQL_PREFETCH_ROWS = 10
SQL_PREFETCH_BYTES = 4096
SQL_QUERY_TIMEOUT = 0
SQL_TXN_TIMEOUT = 0

[VDB]
ArrayOptimization = 0
NumArrayParameters = 10
VDBDisconnectTimeout = 1000
KeepConnectionOnFixedThread = 0

[Replication]
ServerEnable = 0