Aggregating Nepomuk

Recently there have been some posts on Nepomuk in KDE. Tobias König blogged about how to Pimp my Nepomuk. He explains how for many users redland is still the default backend and how to change that. He gives the most important pointers on how to enable the java-based sesame2 Soprano backend. Thomas McGuire gives a very good introduction into what Soprano, Nepomuk, Strigi, and Akonadi are and how they relate. This was a much needed post. Thank you for that, Thomas! And finally mat69 gives his ideas on how to improve the desktop search experience with Nepomuk. He has some good ideas that should really be implemented.

Can somebody please tell me how to get 40 hours out of the work-day? That would really help! ;)

A Bit Of Nepomuk Goodness On Your Developer Fingertips

And yet another technical blog entry. This time it concerns the latest improvements in sopranocmd (Soprano >= 2.3.61). For starters there is an improved NRLModel which provides automatic query prefix expansion. What does that mean? Well, it means that debugging Nepomuk data is simpler now as you can simply use all ontologies stored in the Nepomuk database without defining their prefixes. With sopranocmd (I am using nepomukcmd, a little alias I introduce in the Nepomuk Tips and Tricks) this feature is enabled via the –nrl parameter. Thus, querying all tags becomes:

nepomukcmd --nrl query "select ?r where { ?r a nao:Tag . }"

The second new thing is an improved import command. Again enabled with the –nrl parameter it creates a new named graph of type nrl:KnowledgeBase and puts all new statements (which are not in a graph yet) into it. As described in Nepomuk Data Layout it also adds a metadata graph and the creation date. It actually makes use of NRLModel::createGraph.

The reason I did this was to be able to migrate the tmo:Task instances I had created on the laptop to my desktop machine. Just as an example I will show the procedure here:

First I export all the tasks and their properties on the laptop:

nepomukcmd --nrl export "describe ?r where { ?r a tmo:Task . }" \

And then on the desktop I simply import them into Nepomuk:

nepomukcmd --nrl import /tmp/task-dump.n4

Update: You can also use query prefixes for statement listing now. Thus the following is now possible:

nepomukcmd --nrl list "" a tmo:Task

(Even the “a” keyword now maps to rdf:type.)

What Nepomuk Can do and How You Should Use it (as a Developer)

Nepomuk has been around for quite a while but the functionality exposed in KDE 4.3 is still not that impressive. This does not mean that there does not exist cool stuff. It only means that there is not enough developer power to get it all stable and integrated perfectly. Let me give you an overview of what already exists in playground and how it can be used (and how you should use it).

The Basics

For starters there is the Nepomuk API in kdelibs which you should get familiar with.Most importantly (we will use it quite a lot later on) there is Nepomuk::Resource which gives access to arbitrary resources in Nepomuk.

Nepomuk::Resource file( myFilePath );
file.addTag( Nepomuk::Tag( “Fancy stuff” ) );
QString desc = file.description();
QList<Nepomuk::Tag> allTags = Nepomuk::Tag::allTags();

Resource allows simple manipulation of data in Nepomuk. Using some fancy cmake magic through the new NepomukAddOntologyClasses macro in kdelibs data manipulation gets even simpler. The second basic thing you should get familliar with is Soprano and SPARQL. As a quickstart the following code shows how I typically create queries using Soprano:

using namespace Soprano;

Model* model = Nepomuk::ResourceManager::instance()->mainModel();
QString query = QString( “prefix nao:%1 “
                         “select ?x where { “
                         “%2 nao:hasTag ?t . “
                         “?r nao:hasTag ?t . }” )
QueryResultIterator it
        = model->executeQuery( query, Query::QueryLanguageSparql );

As you can see there is always a lot of QString::arg involved to prevent hard-coding of URIs (again Soprano provides some cmake magic for generating Vocabulary namespaces).

These are the basics. Without these basics you cannot use Nepomuk.

Debugging Nepomuk Data

Now before we dive into the unstable, experimental, and really cool stuff let me mention sopranocmd.

sopranocmd is a command line tool that comes with Soprano and allows to perform virtually any operation possible on the Nepomuk RDF database. It has an exhaustive help output and you should use it to debug your data, test your queries and the like (if anyone is interested in creating a graphical version, please step up).

The Nepomuk database (hosting only a single Soprano model called “main”) can be accessed though D-Bus as follows:

sopranocmd --dbus org.kde.NepomukStorage --model main \
      query "select ?r where { ?r ?p ?o . }"

The Good Stuff

There is quite a lot of experimental stuff in the playground but I want to focus on the annotation framework and Scribo.

The central idea of the annotation framework is the annotation suggestion which is encapsulated in the Annotation class (Hint: run “make apidox” in the annotationplugin folder). Instead of the user manually annotating resources (adding tags or relating things to other things) the system proposes annotations which the user then simply acknowledges or discards. These Annotation instances are normally created by AnnotationPlugin instances (although it is perfectly possible to create them some other way) which are trigged through an AnnotationRequest.

Before I continue a short piece of code for the impatient:

Resource res = getResource();

AnnotationPluginWrapper* wrapper = new AnnotationPluginWrapper();
wrapper.setPlugins( AnnotationPluginFactory::instance()
   ->getPluginsSupportingAnnotationForResource( res.resourceUri() ) );
connect( wrapper, SIGNAL(newAnnotation(Nepomuk::Annotation*)),
         this, SLOT(addNewAnnotation(Nepomuk::Annotation*)) );
connect( wrapper, SIGNAL(finished()),
         this, SLOT(slotFinished()) );

AnnotationRequest req;
req.setResource( res );
req.setFilter( filter );
wrapper->getPossibleAnnotations( req );

The AnnotationPluginWrapper is just a convenience class which prevents us from connecting to each plugin separately. It reproduces the same signals the plugins emit.

The interesting part is the AnnotationRequest. At the moment (the framework is under development. This also means that your ideas, patches, and even refactoring actions are very welcome) it has three parameters, all of which are optional:

  1. A resource – The resource for which the annotation should be created. This parameter is a bit tricky as the Annotation::create method allows to create an annotation on an arbitrary resource but in some cases it makes perfect sense to only create annotation suggestions for only one resource.
  2. A filter string – A filter is supposed to be a short string entered by the user which triggers an auto-completion via annotations. Plugins should also take the resource into account if it is set.
  3. A text – An arbitrary long text which is to be analyzed by plugins. Plugins would typically extract keywords or concepts from it. Plugins should also take resource and filter into account if possible. This is where the Scribo system comes in (more later).

Plugins that I already created include very simple ones like the tag plugin which matches the filter to existing tag names and also excludes tags already set on the resource. Way more interesting are other plugins like the pimotype plugin which matches the filter to pimo types and proposed to use that type or the pimo relation plugin which allows to create relations via a very simple syntax: “author:trueg“. The latter will match author to existing properties and trueg to a value based on the property range. One step further goes the geonames annotation plugin which matches the filter or the resource label to cities or countries using the geonames web service. It will then propose to set a location or (in case the resource label was matched) to convert the resource into a city or country linking to the geonames resource.

A picture says more than a thousand words. Thus, here goes:


What do we see here? The user entered the text Paris in the AnnotationWidget (a class available in the framework) and the framework then created a set of suggested annotations. The most likely one is Paris, the city in France as sugested by the geonames plugin. The latter also proposes a few not so likely places. The pimotype plugin proposes to create a new type named Paris and the tag plugin proposes to create a new tag named Paris. Here I see room for improvement: if we can relate to the city Paris there is no need for the tag. Thus, some more sophisticated rating and comparision may be in order.

Now let us bring Scribo into play. Scribo is another framework in the playground which provides an API for text analysis and keyword extraction. It is tied into the annotation framework through a dedicated plugin which uses the TextAnnotation class to create annotations on specific text positions. The TextAnnotation class is supposed to be used to annotate text documents. It will create a new nfo:TextDocument and make it a nie:isPartOf the main document. Then the new resource is annotated according to the implementation.

The Scribo framework will extract keywords and entities from the text (specified via the AnnotationRequest text field) via plugins which will then be used to create annotation suggestions. There currently exist three plugins for Scribo: the datetime plugin extracts dates and times, the pimo plugin matches words in the text to things in the Nepomuk database, and the OpenCalais plugin will use the OpenCalais webservice to extract entities from the text.

You can try the Scribo framework by using the scriboshell which can be found in the playground, too:


Paste the text to analyze in the left view and press the “Start” button. The right panel will then show all found entities and keywords including the text position and relevance.

The other possibility is to directly use the resourceeditor which is part of the annotation framework and bundles all gui elements the latter has to offer in one widget. Call it on a text file and you will get a window similar to the following:


At the top you have the typical things: editable label and description, the rating, and the tags. Below that you have the exisiting properties and annotations. In the picture these are only properties extracted by Strigi. Then comes the interesting part: the suggestions. Here you can see three different Scribo plugins in action. First the pimo plugin matched the word “Brein” to an event I already had in my Nepomuk database. Then there is the OpenCalais plugin which extracted the “Commission of European Communities” (so far the plugin ignores the additional semantic information provided by OpenCalais) and proposes to tag the text with it.

The last suggested annotation that we can see is “Create Event“. This is a very interesting hack I did. The Scribo plugin detected the mentioning of a project, a date, and persons and thus, proposes to create an event which has as its topic the project and takes place at the extracted time. Since it is a hack created specifically for a demo its results will not be very great in many situations. But it shows the direction which I would like to take.

Below the suggestions you can see the AnnotationWidget again which allows to manually annotate the file.

How to Write an AnnotationPlugin

This is a Howto in three sentences: Derive from AnnotationPlugin and implement doGetPossibleAnnotations. In that method trigger the creation of annotations. Your annotations can be instances of SimpleAnnotation or be based on Annotation and implement at least doCreate, exists, and equals .

class MyAnnotationPlugin : pubic Nepomuk::AnnotationPlugin
    MyAnnotationPlugin(QObject* parent, const QVariantList&);
    void doGetPossibleAnnotations(const Nepomuk::AnnotationRequest&);

void MyAnnotationPlugin::doGetPossibleAnnotations(
      const Nepomuk::AnnotationRequest& request
    // MyFancyAnnotation can do all sorts of crazy things like creating
    // whole graphs of data or even openeing another GUI
    addNewAnnotation(new MyFancyAnnotation(request));

    // SimpleAnnotation can be used to create simple key/value pairs
    Nepomuk::Types::Property property(Soprano::Vocabulary::NAO::prefLabel());
    Nepomuk::SimpleAnnotation* anno = new Nepomuk::SimpleAnnotation();
    anno->setValue("Hello World");
    // currently only the comment is used in the existing GUIs
    anno->setComment("Set label to 'Hello World'");

    // tell the framework that we are done. All this could also
    // be async

And Now?

At the Nepomuk workshop Tom Albers already experimented with integrating the annotation suggestions into Mailody. It is rather simple to do that but the framework still needs polishing. More importantly, however, the created data needs to be presented to the user in a more appealing way. In short: I need help with all this!

Integrate it into your applications, improve it, come up with new ways of presenting the information, write new plugins. Jump on board of the semantic desktop train.

Thanks for reading.

Sharing my Brain – Another Result of the Nepomuk Workshop

If I learned anything at the Nepomuk workshop it is that too much information is just in my head and nowhere else. I tried to share it by writing API documentation and tutorials and blogs. But it never is enough. So today comes another dump from my brain: Nepomuk tips and tricks, a new chapter in the Nepomuk tutorial series. I hope it helps you to make more of the technologies Nepomuk provides.

Nepomuk And Some CMake Magic

And now for some technical build scripting. Some of you maybe know nepomuk-rcgen. It can be used to generate subclasses of Nepomuk::Resource which provide nice wrapper methods around Nepomuk::Resource::property and Nepomuk::Resource::setProperty. Although it has been in need of improvement for quite some time it can be very useful as code gets way more readable.

So far the cmake integration was, well, let’s face it, bad. It was the result of my first attempts at creating a bit more complex cmake macros. Funnily enough the kdepim developers did not find a real solution to the problem of creating source files at cmake-time either. They would have a long list of classes that would be generated in the CMakeLists.txt file.

Today I finally sat down and fixed it. The new macro (currently only available in playground due to the KDE 4.3 feature freeze) does a much better job. It creates the list of source and header files to be generated at cmake-time while the files themselves are generated at build-time. This way not all files need to be recompiled every time cmake is run. This was the biggest problem of the previous macro.

Usage is fairly simple and in align with known KDE macros such as kde4_add_ui_files:

     [ONTOLOGIES] <onto-file1> [<onto-file2> ...]
     [TEMPLATES <template1> [<template2> ...]]

A typical example looks as follows:

kde4_add_library(nie STATIC ${nie_SOURCES})
target_link_libraries(nie ${NEPOMUK_LIBRARIES})

I will probably also add a SERIALIZATION parameter so we do not have to rely on the crude auto-detection based on the file extension.

Another tool which is probably even less known is Soprano’s onto2vocabularyclass. It also generates code. (Actually not classes but namespaces. Thus the name is misleading. ) And is also uses an ontology file as input. The result is a namespace like Soprano::Vocabulary::NAO which gives access to static QUrl objects representing the classes and properties defined in the ontology. A very useful tool when one has to create many queries and does not want to hardcode all the RDF namespaces.

Just to show how one would typically use such a namespace here is a SPARQL query using Soprano:

QString query = QString("select ?r where { ?r a %1 . }")
      .arg( Soprano::Node::resourceToN3( Soprano::Vocabulary::NAO::Tag() ) );

And just like before I created a little cmake macro that creates these namespaces for you. It has a bit more parameters than the one above but is still quite simple:

     <namespace to use>
     [VISIBILITY <visibility-name>])

One simply has to specify the ontology file, the name of the C++ namespace (NAO in the example above), the parent namespace (Soprano::Vocabulary in the example above), the serialization of the ontology file, and optionally the visibility in case the namespace should be exported publicly (see the onto2vocabularyclass documentation for details).

Like before I will close with a little example taken from playground (simplified):

   VISIBILITY "nepomuk")
kde4_add_library(pimo SHARED ${pimo_LIB_SRCS})

Redland 1.0.9 breaks Nepomuk

This is how Soprano (and with it Nepomuk) goes down when run against Redland 1.0.9:

symbol lookup error: /usr/lib64/redland/ undefined symbol: librdf_storage_register_factory

Redland 1.0.8 works without problems. This is mainly for informational purposes and to propose to distributors to downgrade. I will also post a bug report with Redland and try to figure out what the problem is. But in the meantime it might make sense to not live on the edge for once. ;)

“Are We There Yet?” – The Long Road To a Stable Soprano Virtuoso Backend

The first time I blogged about the Virtuoso backend for Soprano testing it was still a bit bumpy: one had to manually start a server locally and then connect to it. Now the situation has changed. I just commited the changes that allow the Soprano Virtuoso backend to spawn a local instance of the Virtuoso server. Thus, using the Virtuoso backend in Nepomuk is now as simple as specifying it in the Nepomuk server configuration file ~/.kde/share/config/nepomukserverrc as mentioned before:

[Basic Settings]
Configured repositories=main
Soprano Backend=virtuoso
Start Nepomuk=true

(Notice that the backend name is now “virtuoso” without the “backend” suffix. Actually Soprano can now handle both for convinience.)

The backend is still not 100% stable. Some queries still fail due to encoding problems. which may include conversion of the data.

Apart from the spawning of a local instance the backend can now handle indices (for improved query speed) and the state of the full text index. Both are controlled through backend options as explained in the Soprano Virtuoso backend documentation.

Virtuoso Packaging

Now that is established let’s go to a few packaging questions. There were arguments that Virtuoso uses too much disk space. Well, fact is that the Virtuoso server as used by Soprano only needs a very small portion of the whole Virtuoso installation: the “virtuoso-t” binary and the “” ODBC driver. Combined they come to a size of roughly 9 MB. I think that is no big problem (as a comparison the whole Virtuoso installation is about 150 MB). :)

Packagers should split the Virtuoso package into small parts. I recommend to provide a package for the server binary, one for the ODBC drivers, one for the VAD files, and so on. Thus, the harddisk usage will be minimal.

Akonepomuk, Neponadi – Friendly Takeover or Real Love Marriage

Meeting with developers is always great. Not only does one get challanged and dragged into interesting discussions, often the result is an actual plan or a concrete idea. In this case my trip to Oslo and the vistit of the Trolltech (ah, sorry, Nokia) offices resulted in at least three results: 1. a lively discussion about the possibilities that the Semantic Desktop could offer (including a promise from a bunch of developers to ad wishes describing these ideas into the KDE bug database); 2. the plan for a Nepomuk developer sprint in the next months (more about that in the next days), 3. a very very fruitful discussion with Tobias König (currently doing an internship at Nokia) about the integration of Nepomuk into Akonadi or vice versa. Here I will present the results of the latter discussion.

The current situation

Currently (meaning: in KDE 4.2) both Akonadi and Nepomuk use their own databases: Akonadi starts its own mysql server and Nepomuk runs a java based RDF backend. Many people have problems with both mysql and java. I personally can understand the latter. In any case this separation of data means that at least parts of Akonadi data needs to be mirrored in Nepomuk. Otherwise we cannot search for PIM data, we cannot link to PIM items, and we cannot create relations between PIM items.

This mirroring of data is an ugly thing since it means that we convert the data in Akonadi (stored as VCards, ICal, or other formats) into RDF graphs. This is currently done by two Akonadi agents which can be found in the kdepim package. Whenever data in Akonadi changes, the data in Nepomuk has to be synced. I personally can see no advantage of this situation.

A possible solution worth discussing

The possible solution Tobias and myself came up with looks as follows:

Akonadi and Nepomuk will share one database – a Virtuoso SQL/RDF server. (For a discussion of the new Virtuoso support in Soprano see my last blog entry.) This is achieved as follows: the current database schema of Akonadi will be kept except for the parts tables which at the moment store the actual mime data of the items. Instead the item table will get a new column which points to an RDF resource in Virtuoso’s own RDF tables. The RDF resource will them represent the item in a real object-oriented form: contacts are encoded using the NCO Ontology, emails are encoded using the NMO Ontology, and so on. Akonadi will then use an RDF encoding (we propose turtle) for all data serialization. The advantages are numerous:

  1. All PIM data does exist as nicely encoded semantic data in Nepomuk which can directly be used to relate to and from.
  2. No syncing is necessary.
  3. Serialization plugins for Akonadi can be created automatically from an ontology using a code generator. Thus, an application developer who wants to store data in Akonadi only needs to create their own ontology. A task that is necessary for Nepomuk support anyway. And let’s face it: this is what each application should include in the future ;).
  4. Only one database server running: Virtuoso.
  5. Simpler mapping from database objects to convenience objects in C++: the data in Nepomuk is already represented using an object-oriented approach.

The Virtuoso server could be started using the same “process-singleton” approach libakonadi is currently using to start Akonadi if it is not running. This would keep both Nepomuk and Akonadi free of dependancies on each other.

Possible Problems

The one possible problem remaining might be performance: It stays to be tested if reading a vcard from SQL and parsing it is much faster than querying an NCO contact resource. Volker, we need your help. Plus: I see you guys at the KDE-PIM meeting in Berlin. Lots to do. ;)

A New Blog and The Possible End to the Java Dependancy in Nepomuk-KDE

I changed the blog system again. Why? It is rather simple: since the update of the blogging system of it is virtually unusable. All the nice features are gone and I am told, one needs to have an account to comment. I need something spiffy that works nicely. WordPress was recommended to me.

Now back to the real content: A new backend for Soprano which could finally let us drop the sesame2 backend which needs a JVM.

Today I committed the new Virtuoso Soprano backend. Virtuoso is a powerful SQL/RDF DB server created by OpenLink Software. OpenLink provides an open-source version of their database server released under the GPL. The official description from their homepage reads:

At core, Virtuoso is a high-performance object-relational SQL database. As a database, it provides transactions, a smart SQL compiler, powerful stored-procedure language with optional Java and .Net server-side hosting, hot backup, SQL-99 support and more. It has all major data-access interfaces, such as ODBC, JDBC, ADO .Net and OLE/DB.
OpenLink Virtuoso supports SPARQL embedded into SQL for querying RDF data stored in Virtuoso’s database. SPARQL benefits from low-level support in the engine itself, such as SPARQL-aware type-casting rules and a dedicated IRI data type. This is the newest and fastest developing area in Virtuoso.

Virtuoso not only frees us from the shackles of the java dependency. It also provides a bunch of features that Sesame2 does not. Most importantly Virtuoso features full text indexing which can be used within SPARQL queries:

select ?foo where { ?foo rdfs:label ?label . ?label bif:contains "bar" . }

At the moment (using the sesame2 backend) we need to make use of the CLucene based full-text-indexing layer which I implemented for Soprano. While this works very well, it forces us to split queries into fulltext and graph part and then merge the results. That tends to be slow.

Apart from the Virtuoso also allows to do quite a lot of SPARQL magic with nested queries or even embedded SQL. Plus it supports SPARUL, the Sparql Update Language which will make a lot of code simpler and faster.

The nice guys at OpenLink were very open to the idea of using their server in KDE. With the release of Virtuoso 5.0.10 they introduced a new lite mode which trims the server down to our needs. The memory usage goes down to a minimum: IMHO roughly 80M is acceptable. (Especially since the JVM easily goes up to 10 times that value!) That in combination with disabling pretty much all features except for SPARQL we have a decent desktop DB solution.

Over the next weeks I will try to get the new backend into shape to replace the sesame2 backend. Hopefully by then distributions will have created packages for Virtuoso. If you want to give it a spin yourself without waiting or even to help me ;) this is what you need to do:

  • Get Virtuoso 5.0.10 from the Sourceforge download page
  • Install Virtuoso (obviously)
  • Start Virtuoso with a config file similar to the one below
  • Change the Nepomuk server config file (~/.kde4/share/config/nepomukserverrc) – Add "Soprano Backend=virtuosobackend” to the Basic Settings section.
  • Restart Nepomuk

Now Nepomuk should convert all existing data to the new backend. This process can take a while. There might even be errors. But it allows to test the new features such as the integrated fulltext indexing (which still has to be enabled manually).

And now the virtuoso.ini file:

DatabaseFile = soprano-virtuoso.db
ErrorLogFile = soprano-virtuoso.log
TransactionFile = soprano-virtuoso.trx
xa_persistent_file = soprano-virtuoso.pxa
ErrorLogLevel = 7
FileExtend = 100
MaxCheckpointRemap = 1000
Striping = 0
TempStorage = TempDatabase

DatabaseFile = soprano-virtuoso-temp.db
TransactionFile = soprano-virtuoso-temp.trx
MaxCheckpointRemap = 1000
Striping = 0

LiteMode = 1
ServerPort = 1111
DisableUnixSocket = 0
CaseMode = 1
CheckpointAuditTrail = 0
AllowOSCalls = 0
DirsAllowed = .
PrefixResultNames = 0
ServerThreads = 5 ; down from 10
CheckpointInterval = 10 ; down from 60
MaxDirtyBuffers = 50 ; down from 1200
SchedulerInterval = 5 ; down from 10
FreeTextBatchSize = 1000

DavRoot = DAV
EnabledDavVSP = 0
HTTPProxyEnabled = 0
TempASPXDir = 0
Charset = UTF-8
ServerThreads = 2 ; down from 5
KeepAliveTimeout = 5 ; down from 10
HTTPThreadSize = 10000 ; down from 280000

BadParentLinks = 0


ArrayOptimization = 0
NumArrayParameters = 10
VDBDisconnectTimeout = 1000
KeepConnectionOnFixedThread = 0

ServerEnable = 0