Small Things are Happening…

I have to admit: I am a sucker for nice APIs. And, yes, I am sort of in love with some of my own creations. Well, at least until I find the flaws and cannot remove them due to binary compatibility issues (see Soprano). This may sound a bit egomaniac but let’s face it: we almost never get credit for good API or good API documentation. So we need to congratulate ourselves.

My new pet is the Nepomuk Query API. As its name says it can be used to query Nepomuk resources and sets out to replace as many hard coded SPARQL queries as possible. It started out rather simple: matching a set of resources with different types of terms. But then Dario Freddi and his relatively complex telepathy queries came along. So the challenge began. I tried to bend the existing API as much as possible to fit in the features he requested. One thing let to another, I suddenly found myself in need of optional terms and a few days later things were not as simple as they started out any more.

ComparisonTerm already was the most complex term in the query API. But that did not mean much. Basically you could set a property and a sub term and that was it. On Monday it became a bit more beastly. Now you can invert it, force the name of the variable used, give it a sort weight, change the sort order, and even set an aggregate function. And all that on only one type of term. At least I implemented optional terms separately.

To explain what all this is good for I will try to illustrate it with a few examples:

Say you want to find all tags a specifc file has (and ignore the fact that there is a Nepomuk::Resource::tags method). This was not possible before inverted comparison terms came along. Now all you have to do is:

Nepomuk::Query::ComparisonTerm tagTerm(
    Soprano::Vocabulary::NAO::hasTag(),
    Nepomuk::Query::ResourceTerm(myFile)
);
tagTerm.setInverted(true);
Nepomuk::Query::Query(tagTerm);

What happens is that subject and object change places in the ComparisonTerm, thus, the resulting SPARQL query looks something like the following:

select ?r where { <myFile> nao:hasTag ?r . }

Simple but effective and confusing. It gets better. Previously we only had the clumsy Query::addRequestProperty to get additional bindings from the query. It is very restricted as it only allows to query direct relations from the results. With ComparisonTerm::setVariableName we now have the generic counterpart. By setting a variable name this variable is included in the bindings of the final query and can be retrieved via Result::additionalBinding. This allows to retrieve any detail from any part of the query. Again we use the most simple example to illustrate:

Nepomuk::Query::ComparisonTerm labelTerm(
    Soprano::Vocabulary::NAO::prefLabel(),
    Nepomuk::Query::Term() );
labelTerm.setVariableName( "label" );
Nepomuk::Query::ComparisonTerm tagTerm(
    Soprano::Vocabulary::NAO::hasTag(),
    labelTerm );
tagTerm.setInverted(true);
Nepomuk::Query::Query query( tagTerm );

This query lists all tags including their labels. Again the resulting SPARQL query would look something like the following:

select ?r ?label where { <myFile> nao:hasTag ?v1 . ?v1 nao:prefLabel ?label . }

And silently I used another little gimmick that I introduced: ComparisonTerm can now handle invalid properties and invalid sub terms which will simply act as wildcards (or be represented by a variable in SPARQL terms).

Now on to the next feature: sort weights. The idea is simple: you can sort the search results using any value matched by a ComparisonTerm. So let’s extend the above query by sorting the tags according to their labels.

labelTerm.setSortWeight( 1 );

And the resulting SPARQL query will reflect the sorting:

select ?r ?label where { <myFile> nao:hasTag ?v1 . ?v1 nao:prefLabel ?label . } order by ?label

Here I used a sort weight of 1 since I only have the one term that includes sorting. But in theory you can include any number of ComparisonTerms in the sorting. The higher the weight the more important the sort order.

We are nearly done. Only one feature is left: aggregate functions. The Virtuoso SPARQL extensions (and SPARQL 1.1, too) support aggregate functions like count or max. These are now supported in ComparisonTerm. They are only useful in combination with a forced variable name in which case they will be included in the additional bindings or with a sort weight. If we go back to our tags we could for example count the number of tags each file has attached:

Nepomuk::Query::ComparisonTerm tagTerm(
    Soprano::Vocabulary::NAO::hasTag(),
    Nepomuk::Query::ResourceTypeTerm(Soprano::Vocabulary::NAO::Tag())
);
tagTerm.setAggregateFunction(
    Nepomuk::Query::ComparisonTerm::Count
);
tagTerm.setVariableName("cnt");
Nepomuk::Query::Query(tagTerm);

And the resulting SPARQL query will be along the lines of:

select ?r count(?v1) as ?cnt where { ?r nao:hasTag ?v1 . ?v1 a nao:Tag . }

Now we can of course sort by number of tags:

tagTerm.setSortWeight(1, Qt::DescendingOrder);

And we get:

select ?r count(?v1) as ?cnt where { ?r nao:hasTag ?v1 . ?v1 a nao:Tag . } order by desc ?cnt

And just because it is fun, let us make the tagging optional so we also get files with zero tags (be aware that normally one should have at least one non-optional term in the query to get useful results. In this case we are on the safe side since we are using a FileQuery):

Nepomuk::Query::ComparisonTerm tagTerm(
    Soprano::Vocabulary::NAO::hasTag(),
    Nepomuk::Query::ResourceTypeTerm(Soprano::Vocabulary::NAO::Tag())
);
tagTerm.setAggregateFunction(
    Nepomuk::Query::ComparisonTerm::Count
);
tagTerm.setVariableName("cnt");
tagTerm.setSortWeight(1, Qt::DescendingOrder);
Nepomuk::Query::FileQuery(
    Nepomuk::Query::OptionalTerm::optionalizeTerm(tagTerm)
);

And with the SPARQL result of this beauty I finish my little session of self-congratulations:

select ?r count(?v1) as ?cnt where { ?r a nfo:FileDataObject . OPTIONAL { ?r nao:hasTag ?v1 . ?v1 a nao:Tag . } . } order by desc ?cnt

Virtuoso – Once More With Feeling

The Virtuoso backend for Soprano and, thus, Nepomuk can be seen as rather stable now. So now the big tests can begin as the goal is to make it the standard in KDE 4.4. Let me summarize the important information again:

Step 1

Get Virtuoso 5.0.12 from the Sourceforge download page. Virtuoso 6 is NOT supported. (not yet anyway)

Step 2

Hints for packagers: Soprano only needs two files: the virtuoso-t binary and the virtodbc_r(.so) ODBC driver. Everything else is optional. (For the self-compiling folks out there: –disable-all-vads is your friend.)

Step 3

Install libiodbc which is what the Soprano build will look for (Virtuoso is simply a run-time dependency.)

Step 4

Rebuild Soprano from current svn trunk (Remember: Redland is still mandatory. Its memory storage is used all over Nepomuk!)

Step 5

Edit ${KDEHOME}/share/config/nepomukserverrc with your favorite editor. In the “[Basic Settings]“ section add “Soprano Backend=virtuosobackend”. Do not touch the main repository settings!

Step 6

Restart Nepomuk. I propose the following procedure to gather debugging information in case something goes wrong:
Shutdown Nepomuk completely:

 # qdbus org.kde.NepomukServer /nepomukserver org.kde.NepomukServer.quit

Restart it by piping the output into a temporary file (bash syntax):

 # nepomukserver 2> /tmp/nepomuk.stderr

Step 7

Wait for Nepomuk to convert your data. If you are running KDE trunk you even get a nice progress bar in the notification area (BTW: does anyone know why it won’t show the title?)

And Now?

That is already it. Now you can enjoy the new Virtuoso backend.

The development has taken a long time. But I want to thank OpenLink and especially Patrick van Kleef who helped a lot by fixing the last little tidbits in Virtuoso 5 for my unit tests to pass. Next step is Virtuoso 6.

And Yet Another Post About Virtuoso

Today nearly all problems are solved. OpenLink provided a patch that makes inserting very large literals (more than 1 metabyte in size) lightning fast, even with a very low buffer count. Also I worked around the issue of URI encoding. Now the Soprano Virtuoso backend simply percent-encodes all non-unreserved characters and all reserved characters that are not used in their special meaning in URIs used in queries. Man, that is a mouth full. Well, it seems to work fine although I can always use more testing with weird file URLs (weird means containing weird characters like brackets and the likes). I also fixed some error handling bugs.

So what is left? Well, there are a few hacks in the Virtuoso backend which are rather ugly. One example is the detection of query result types. To determine if the result is boolean, bindings, or a graph it actually checks the name and number of result columns. Urgh! It would be nicer to check for the type of the result. Seems like graph results are BLOBs.

Anyway, enough for tonight. I am tired. Here is the patch to make Virtuoso not hang when Strigi adds nie:PlainTextContent literals of big files:

Index: sqlrcomp.c
===================================================================
RCS file: virtuoso-opensource/libsrc/Wi/sqlrcomp.c,v
retrieving revision 1.9
diff -u -r1.9 sqlrcomp.c
--- sqlrcomp.c  20 Aug 2009 17:47:22 -0000      1.9
+++ sqlrcomp.c  13 Oct 2009 16:11:49 -0000
@@ -65,7 +65,7 @@
 {
 va_list list;
 char temp[2000];
-  int ret;
+  int ret, rest_sz, copybytes;
 va_start (list, string);
 ret = vsnprintf (temp, sizeof (temp), string, list);
 #ifndef NDEBUG
@@ -75,11 +75,16 @@
 va_end (list);
 #ifndef NDEBUG
 if (*fill + strlen (temp) > len - 1)
-    GPF_T1 ("overflow in strncpy");
+    GPF_T1 ("overflow in memcpy");
 #endif
-  strncpy (&text[*fill], temp, len - *fill - 1);
+  rest_sz = (len - fill[0]);
+  if (ret >= rest_sz)
+    copybytes = ((rest_sz > 0) ? rest_sz : 0);
+  else
+    copybytes = ret+1;
+  memcpy (text+fill[0], temp, copybytes);
 text[len - 1] = 0;
-  *fill += (int) strlen (temp);
+  fill[0] += ret;
 }

Aggregating Nepomuk

Recently there have been some posts on Nepomuk in KDE. Tobias König blogged about how to Pimp my Nepomuk. He explains how for many users redland is still the default backend and how to change that. He gives the most important pointers on how to enable the java-based sesame2 Soprano backend. Thomas McGuire gives a very good introduction into what Soprano, Nepomuk, Strigi, and Akonadi are and how they relate. This was a much needed post. Thank you for that, Thomas! And finally mat69 gives his ideas on how to improve the desktop search experience with Nepomuk. He has some good ideas that should really be implemented.

Can somebody please tell me how to get 40 hours out of the work-day? That would really help! ;)

A Bit Of Nepomuk Goodness On Your Developer Fingertips

And yet another technical blog entry. This time it concerns the latest improvements in sopranocmd (Soprano >= 2.3.61). For starters there is an improved NRLModel which provides automatic query prefix expansion. What does that mean? Well, it means that debugging Nepomuk data is simpler now as you can simply use all ontologies stored in the Nepomuk database without defining their prefixes. With sopranocmd (I am using nepomukcmd, a little alias I introduce in the Nepomuk Tips and Tricks) this feature is enabled via the –nrl parameter. Thus, querying all tags becomes:

nepomukcmd --nrl query "select ?r where { ?r a nao:Tag . }"

The second new thing is an improved import command. Again enabled with the –nrl parameter it creates a new named graph of type nrl:KnowledgeBase and puts all new statements (which are not in a graph yet) into it. As described in Nepomuk Data Layout it also adds a metadata graph and the creation date. It actually makes use of NRLModel::createGraph.

The reason I did this was to be able to migrate the tmo:Task instances I had created on the laptop to my desktop machine. Just as an example I will show the procedure here:

First I export all the tasks and their properties on the laptop:

nepomukcmd --nrl export "describe ?r where { ?r a tmo:Task . }" \
     /tmp/task-dump.n4

And then on the desktop I simply import them into Nepomuk:

nepomukcmd --nrl import /tmp/task-dump.n4

Update: You can also use query prefixes for statement listing now. Thus the following is now possible:

nepomukcmd --nrl list "" a tmo:Task

(Even the “a” keyword now maps to rdf:type.)

What Nepomuk Can do and How You Should Use it (as a Developer)

Nepomuk has been around for quite a while but the functionality exposed in KDE 4.3 is still not that impressive. This does not mean that there does not exist cool stuff. It only means that there is not enough developer power to get it all stable and integrated perfectly. Let me give you an overview of what already exists in playground and how it can be used (and how you should use it).

The Basics

For starters there is the Nepomuk API in kdelibs which you should get familiar with.Most importantly (we will use it quite a lot later on) there is Nepomuk::Resource which gives access to arbitrary resources in Nepomuk.

Nepomuk::Resource file( myFilePath );
file.addTag( Nepomuk::Tag( “Fancy stuff” ) );
QString desc = file.description();
QList<Nepomuk::Tag> allTags = Nepomuk::Tag::allTags();

Resource allows simple manipulation of data in Nepomuk. Using some fancy cmake magic through the new NepomukAddOntologyClasses macro in kdelibs data manipulation gets even simpler. The second basic thing you should get familliar with is Soprano and SPARQL. As a quickstart the following code shows how I typically create queries using Soprano:

using namespace Soprano;

Model* model = Nepomuk::ResourceManager::instance()->mainModel();
QString query = QString( “prefix nao:%1 “
                         “select ?x where { “
                         “%2 nao:hasTag ?t . “
                         “?r nao:hasTag ?t . }” )
        .arg(Node::resourceToN3(Vocabulary::NAO::naoNamespace()))
        .arg(Node::resourceToN3(file.resourceUri()));
QueryResultIterator it
        = model->executeQuery( query, Query::QueryLanguageSparql );

As you can see there is always a lot of QString::arg involved to prevent hard-coding of URIs (again Soprano provides some cmake magic for generating Vocabulary namespaces).

These are the basics. Without these basics you cannot use Nepomuk.

Debugging Nepomuk Data

Now before we dive into the unstable, experimental, and really cool stuff let me mention sopranocmd.

sopranocmd is a command line tool that comes with Soprano and allows to perform virtually any operation possible on the Nepomuk RDF database. It has an exhaustive help output and you should use it to debug your data, test your queries and the like (if anyone is interested in creating a graphical version, please step up).

The Nepomuk database (hosting only a single Soprano model called “main”) can be accessed though D-Bus as follows:

sopranocmd --dbus org.kde.NepomukStorage --model main \
      query "select ?r where { ?r ?p ?o . }"

The Good Stuff

There is quite a lot of experimental stuff in the playground but I want to focus on the annotation framework and Scribo.

The central idea of the annotation framework is the annotation suggestion which is encapsulated in the Annotation class (Hint: run “make apidox” in the annotationplugin folder). Instead of the user manually annotating resources (adding tags or relating things to other things) the system proposes annotations which the user then simply acknowledges or discards. These Annotation instances are normally created by AnnotationPlugin instances (although it is perfectly possible to create them some other way) which are trigged through an AnnotationRequest.

Before I continue a short piece of code for the impatient:

Resource res = getResource();

AnnotationPluginWrapper* wrapper = new AnnotationPluginWrapper();
wrapper.setPlugins( AnnotationPluginFactory::instance()
   ->getPluginsSupportingAnnotationForResource( res.resourceUri() ) );
connect( wrapper, SIGNAL(newAnnotation(Nepomuk::Annotation*)),
         this, SLOT(addNewAnnotation(Nepomuk::Annotation*)) );
connect( wrapper, SIGNAL(finished()),
         this, SLOT(slotFinished()) );

AnnotationRequest req;
req.setResource( res );
req.setFilter( filter );
wrapper->getPossibleAnnotations( req );

The AnnotationPluginWrapper is just a convenience class which prevents us from connecting to each plugin separately. It reproduces the same signals the plugins emit.

The interesting part is the AnnotationRequest. At the moment (the framework is under development. This also means that your ideas, patches, and even refactoring actions are very welcome) it has three parameters, all of which are optional:

  1. A resource – The resource for which the annotation should be created. This parameter is a bit tricky as the Annotation::create method allows to create an annotation on an arbitrary resource but in some cases it makes perfect sense to only create annotation suggestions for only one resource.
  2. A filter string – A filter is supposed to be a short string entered by the user which triggers an auto-completion via annotations. Plugins should also take the resource into account if it is set.
  3. A text – An arbitrary long text which is to be analyzed by plugins. Plugins would typically extract keywords or concepts from it. Plugins should also take resource and filter into account if possible. This is where the Scribo system comes in (more later).

Plugins that I already created include very simple ones like the tag plugin which matches the filter to existing tag names and also excludes tags already set on the resource. Way more interesting are other plugins like the pimotype plugin which matches the filter to pimo types and proposed to use that type or the pimo relation plugin which allows to create relations via a very simple syntax: “author:trueg“. The latter will match author to existing properties and trueg to a value based on the property range. One step further goes the geonames annotation plugin which matches the filter or the resource label to cities or countries using the geonames web service. It will then propose to set a location or (in case the resource label was matched) to convert the resource into a city or country linking to the geonames resource.

A picture says more than a thousand words. Thus, here goes:

annotations-english

What do we see here? The user entered the text Paris in the AnnotationWidget (a class available in the framework) and the framework then created a set of suggested annotations. The most likely one is Paris, the city in France as sugested by the geonames plugin. The latter also proposes a few not so likely places. The pimotype plugin proposes to create a new type named Paris and the tag plugin proposes to create a new tag named Paris. Here I see room for improvement: if we can relate to the city Paris there is no need for the tag. Thus, some more sophisticated rating and comparision may be in order.

Now let us bring Scribo into play. Scribo is another framework in the playground which provides an API for text analysis and keyword extraction. It is tied into the annotation framework through a dedicated plugin which uses the TextAnnotation class to create annotations on specific text positions. The TextAnnotation class is supposed to be used to annotate text documents. It will create a new nfo:TextDocument and make it a nie:isPartOf the main document. Then the new resource is annotated according to the implementation.

The Scribo framework will extract keywords and entities from the text (specified via the AnnotationRequest text field) via plugins which will then be used to create annotation suggestions. There currently exist three plugins for Scribo: the datetime plugin extracts dates and times, the pimo plugin matches words in the text to things in the Nepomuk database, and the OpenCalais plugin will use the OpenCalais webservice to extract entities from the text.

You can try the Scribo framework by using the scriboshell which can be found in the playground, too:

scriboshell3

Paste the text to analyze in the left view and press the “Start” button. The right panel will then show all found entities and keywords including the text position and relevance.

The other possibility is to directly use the resourceeditor which is part of the annotation framework and bundles all gui elements the latter has to offer in one widget. Call it on a text file and you will get a window similar to the following:

resourceeditor

At the top you have the typical things: editable label and description, the rating, and the tags. Below that you have the exisiting properties and annotations. In the picture these are only properties extracted by Strigi. Then comes the interesting part: the suggestions. Here you can see three different Scribo plugins in action. First the pimo plugin matched the word “Brein” to an event I already had in my Nepomuk database. Then there is the OpenCalais plugin which extracted the “Commission of European Communities” (so far the plugin ignores the additional semantic information provided by OpenCalais) and proposes to tag the text with it.

The last suggested annotation that we can see is “Create Event“. This is a very interesting hack I did. The Scribo plugin detected the mentioning of a project, a date, and persons and thus, proposes to create an event which has as its topic the project and takes place at the extracted time. Since it is a hack created specifically for a demo its results will not be very great in many situations. But it shows the direction which I would like to take.

Below the suggestions you can see the AnnotationWidget again which allows to manually annotate the file.

How to Write an AnnotationPlugin

This is a Howto in three sentences: Derive from AnnotationPlugin and implement doGetPossibleAnnotations. In that method trigger the creation of annotations. Your annotations can be instances of SimpleAnnotation or be based on Annotation and implement at least doCreate, exists, and equals .

class MyAnnotationPlugin : pubic Nepomuk::AnnotationPlugin
{
public:
    MyAnnotationPlugin(QObject* parent, const QVariantList&);
protected:
    void doGetPossibleAnnotations(const Nepomuk::AnnotationRequest&);
};

void MyAnnotationPlugin::doGetPossibleAnnotations(
      const Nepomuk::AnnotationRequest& request
)
{
    // MyFancyAnnotation can do all sorts of crazy things like creating
    // whole graphs of data or even openeing another GUI
    addNewAnnotation(new MyFancyAnnotation(request));

    // SimpleAnnotation can be used to create simple key/value pairs
    Nepomuk::Types::Property property(Soprano::Vocabulary::NAO::prefLabel());
    Nepomuk::SimpleAnnotation* anno = new Nepomuk::SimpleAnnotation();
    anno->setProperty(property);
    anno->setValue("Hello World");
    // currently only the comment is used in the existing GUIs
    anno->setComment("Set label to 'Hello World'");
    addNewAnnotation(anno);

    // tell the framework that we are done. All this could also
    // be async
    emitFinsihed();
}

And Now?

At the Nepomuk workshop Tom Albers already experimented with integrating the annotation suggestions into Mailody. It is rather simple to do that but the framework still needs polishing. More importantly, however, the created data needs to be presented to the user in a more appealing way. In short: I need help with all this!

Integrate it into your applications, improve it, come up with new ways of presenting the information, write new plugins. Jump on board of the semantic desktop train.

Thanks for reading.

Sharing my Brain – Another Result of the Nepomuk Workshop

If I learned anything at the Nepomuk workshop it is that too much information is just in my head and nowhere else. I tried to share it by writing API documentation and tutorials and blogs. But it never is enough. So today comes another dump from my brain: Nepomuk tips and tricks, a new chapter in the Nepomuk tutorial series. I hope it helps you to make more of the technologies Nepomuk provides.