Small Things are Happening…

I have to admit: I am a sucker for nice APIs. And, yes, I am sort of in love with some of my own creations. Well, at least until I find the flaws and cannot remove them due to binary compatibility issues (see Soprano). This may sound a bit egomaniac but let’s face it: we almost never get credit for good API or good API documentation. So we need to congratulate ourselves.

My new pet is the Nepomuk Query API. As its name says it can be used to query Nepomuk resources and sets out to replace as many hard coded SPARQL queries as possible. It started out rather simple: matching a set of resources with different types of terms. But then Dario Freddi and his relatively complex telepathy queries came along. So the challenge began. I tried to bend the existing API as much as possible to fit in the features he requested. One thing let to another, I suddenly found myself in need of optional terms and a few days later things were not as simple as they started out any more.

ComparisonTerm already was the most complex term in the query API. But that did not mean much. Basically you could set a property and a sub term and that was it. On Monday it became a bit more beastly. Now you can invert it, force the name of the variable used, give it a sort weight, change the sort order, and even set an aggregate function. And all that on only one type of term. At least I implemented optional terms separately.

To explain what all this is good for I will try to illustrate it with a few examples:

Say you want to find all tags a specifc file has (and ignore the fact that there is a Nepomuk::Resource::tags method). This was not possible before inverted comparison terms came along. Now all you have to do is:

Nepomuk::Query::ComparisonTerm tagTerm(
    Soprano::Vocabulary::NAO::hasTag(),
    Nepomuk::Query::ResourceTerm(myFile)
);
tagTerm.setInverted(true);
Nepomuk::Query::Query(tagTerm);

What happens is that subject and object change places in the ComparisonTerm, thus, the resulting SPARQL query looks something like the following:

select ?r where { <myFile> nao:hasTag ?r . }

Simple but effective and confusing. It gets better. Previously we only had the clumsy Query::addRequestProperty to get additional bindings from the query. It is very restricted as it only allows to query direct relations from the results. With ComparisonTerm::setVariableName we now have the generic counterpart. By setting a variable name this variable is included in the bindings of the final query and can be retrieved via Result::additionalBinding. This allows to retrieve any detail from any part of the query. Again we use the most simple example to illustrate:

Nepomuk::Query::ComparisonTerm labelTerm(
    Soprano::Vocabulary::NAO::prefLabel(),
    Nepomuk::Query::Term() );
labelTerm.setVariableName( "label" );
Nepomuk::Query::ComparisonTerm tagTerm(
    Soprano::Vocabulary::NAO::hasTag(),
    labelTerm );
tagTerm.setInverted(true);
Nepomuk::Query::Query query( tagTerm );

This query lists all tags including their labels. Again the resulting SPARQL query would look something like the following:

select ?r ?label where { <myFile> nao:hasTag ?v1 . ?v1 nao:prefLabel ?label . }

And silently I used another little gimmick that I introduced: ComparisonTerm can now handle invalid properties and invalid sub terms which will simply act as wildcards (or be represented by a variable in SPARQL terms).

Now on to the next feature: sort weights. The idea is simple: you can sort the search results using any value matched by a ComparisonTerm. So let’s extend the above query by sorting the tags according to their labels.

labelTerm.setSortWeight( 1 );

And the resulting SPARQL query will reflect the sorting:

select ?r ?label where { <myFile> nao:hasTag ?v1 . ?v1 nao:prefLabel ?label . } order by ?label

Here I used a sort weight of 1 since I only have the one term that includes sorting. But in theory you can include any number of ComparisonTerms in the sorting. The higher the weight the more important the sort order.

We are nearly done. Only one feature is left: aggregate functions. The Virtuoso SPARQL extensions (and SPARQL 1.1, too) support aggregate functions like count or max. These are now supported in ComparisonTerm. They are only useful in combination with a forced variable name in which case they will be included in the additional bindings or with a sort weight. If we go back to our tags we could for example count the number of tags each file has attached:

Nepomuk::Query::ComparisonTerm tagTerm(
    Soprano::Vocabulary::NAO::hasTag(),
    Nepomuk::Query::ResourceTypeTerm(Soprano::Vocabulary::NAO::Tag())
);
tagTerm.setAggregateFunction(
    Nepomuk::Query::ComparisonTerm::Count
);
tagTerm.setVariableName("cnt");
Nepomuk::Query::Query(tagTerm);

And the resulting SPARQL query will be along the lines of:

select ?r count(?v1) as ?cnt where { ?r nao:hasTag ?v1 . ?v1 a nao:Tag . }

Now we can of course sort by number of tags:

tagTerm.setSortWeight(1, Qt::DescendingOrder);

And we get:

select ?r count(?v1) as ?cnt where { ?r nao:hasTag ?v1 . ?v1 a nao:Tag . } order by desc ?cnt

And just because it is fun, let us make the tagging optional so we also get files with zero tags (be aware that normally one should have at least one non-optional term in the query to get useful results. In this case we are on the safe side since we are using a FileQuery):

Nepomuk::Query::ComparisonTerm tagTerm(
    Soprano::Vocabulary::NAO::hasTag(),
    Nepomuk::Query::ResourceTypeTerm(Soprano::Vocabulary::NAO::Tag())
);
tagTerm.setAggregateFunction(
    Nepomuk::Query::ComparisonTerm::Count
);
tagTerm.setVariableName("cnt");
tagTerm.setSortWeight(1, Qt::DescendingOrder);
Nepomuk::Query::FileQuery(
    Nepomuk::Query::OptionalTerm::optionalizeTerm(tagTerm)
);

And with the SPARQL result of this beauty I finish my little session of self-congratulations:

select ?r count(?v1) as ?cnt where { ?r a nfo:FileDataObject . OPTIONAL { ?r nao:hasTag ?v1 . ?v1 a nao:Tag . } . } order by desc ?cnt

Dangling Meta Data Graphs (Caution: Very Technical)

Nepomuk in KDE uses NRL – the Nepomuk Representation Language – especially the named graphs that it defines. Each triple that is stored in the Nepomuk database is stored in a named graph. We use this graph to attach meta data to the triples themselves. So far this is restricted to the creation date but in the future there will be more like the creator (for shared meta data) and the modification date (this is a bit tricky since technically triples are never modified, only added and deleted. But from a user’s point of view changing a rating means changing a triple.).

What did I say? “We attach meta data to the triples”? Well, to be exact we attach it to the graph which contains the triples. And since everything is triples (or quadruples since there is the named graph) the meta data is, too. And like every triple these also need to be put in a dedicated named graph – the meta data graph. Thus, each triple is contained in one graph and each graph has exactly one meta data graph.

So far so good. But what happens if we delete all triples in a graph? Well, the graph ceases to exist since graphs like resources in an RDF database do only exist due to the triples in which they occur.

And that is when it happens: dangling meta data graphs, i.e. meta data graphs that describe a graph which does no longer exist.

In theory Nepomuk could delete these automatically but I decided against that for performance reasons. It would have to check for dangling meta data graphs after each removal operation. So for now (until I come up with some database maintenance service) these graphs are just waste hanging around not bothering anyone (they are small).

In case you want to check how many of them you have in your database use the following command (see the Nepomuk Tips and Tricks for nepomukcmd):

nepomukcmd query 'select count(?mg) where { ?mg nrl:coreGraphMetadataFor ?g . OPTIONAL { graph ?g { ?s ?p ?o . } . } . FILTER(!BOUND(?s)) . }'

And to simply delete them use a bit of shell magic (the –foo is important since it removes any human-readability gimmicks from the otuput):

for a in `nepomukcmd --foo query 'select ?mg where { ?mg nrl:coreGraphMetadataFor ?g . OPTIONAL { graph ?g { ?s ?p ?o . } . } . FILTER(!BOUND(?s)) . }'`; do nepomukcmd rmgraph "$a"; done

Convenient Querying in libnepomuk

It has been a while since I blogged and a lot has happened. Not only did we have the second Nepomuk workshop on the “Open Social Semantic Desktop” which I did not report on yet, there is also a lot of stuff going on for KDE 4.4. And since I like blogging about technical stuff so much I will start with that. So here goes:

Queries in KDE 4.4 will be so much simpler (for the developer that is) since we now have the Nepomuk Query API!

Do you remember the days when you tried to write your own SPARQL queries and code looked like this:

Nepomuk::Tag myTag = getOurFancyTag();
QString query
   = QString("select distinct ?r where { ?r %1 %2 . }")
     .arg( Soprano::Node::resourceToN3(Soprano::Vocabulary::NAO::hasTag()) )
     .arg( Soprano::Node::resourceToN3(myTag.resourceUri()) );

Well, using the query API this looks a lot nicer:

Nepomuk::Query::ResourceTerm tagTerm(myTag);
Nepomuk::Query::ComparisonTerm term(Soprano::Vocabulary::NAO::hasTag(), tagTerm);
Nepomuk::Query::Query query(term);
QString queryString = query.toSparqlString();

As you can see you do not need to know any SPARQL anymore. But it gets better. The Query class is integrated with the Nepomuk Query Service via the QueryServiceClient class. This allows to simply let the query service do the querying and receive result change updates – in other words use live searches:

Nepomuk::Query::QueryServiceClient client;
connect(&client, SIGNAL(newEntries(QList<Nepomuk::Query::Result>)),
            this, SLOT(slotNewEntries(QList<Nepomuk::Query::Result>)));
client.query(query);

And now just handle the incoming results.

Maybe even nicer is the integration with KIO, i.e. the possibility to list search results as a virtual folder:

KDirModel* model = getFancyDirModel();
Nepomuk::Query::Query query = buildFancyQuery();
KUrl searchUrl = query.toSearchUrl();
model->dirLister()->open( searchUrl );

As you can see it is really simple to list results via KIO (BTW: this is what Dolphin does).

For more examples check the Nepomuk Query API documentation.

Just another way of browsing your files

The Zeitgeist guys created a fuse file system called zeitgeistfs. It is basically a calendar containing the files accessed at that specific date. So at the Akonadi meeting last weekend, having two hours to kill, I thought that should be doable with KIO. So two hours later (most of that time was spent twiddling with UDS entries) the timeline:/ KIO slave was up and running:

The code that actually does something is minimal: a bit of UDS entry creation for dates and a simple SPARQL query to forward to the nepomuksearch KIO slave. Yes, it is as easy as that since we can simply set the UDS_URL property of an item to a nepomuksearch URL and KIO will take care of the rest. Smooth. Thanks a lot David Faure. Once again you paved the way.

OK then. Try it if you like. This is just another example of what can be done with Nepomuk (no real semantics here though). The code is in the playground as always and is based on the current kdebase trunk. So with KDE 4.3 this baby won’t work. And of course as this is based on file meta data, the Nepomuk Strigi integration has to be enabled.

SPARQL is weird…

It really is. The following query is the only way I found to exclude folders when looking for files:

select ?r where {
   ?r a nfo:FileDataObject .
   OPTIONAL { ?r2 a nfo:Folder . FILTER(?r = ?r2) . } .
   FILTER( !BOUND(?r2) ) .
}

Now to me this looks really weird. And maybe I am simply not seeing the wood for all the trees…

And Yet Another Post About Virtuoso

Today nearly all problems are solved. OpenLink provided a patch that makes inserting very large literals (more than 1 metabyte in size) lightning fast, even with a very low buffer count. Also I worked around the issue of URI encoding. Now the Soprano Virtuoso backend simply percent-encodes all non-unreserved characters and all reserved characters that are not used in their special meaning in URIs used in queries. Man, that is a mouth full. Well, it seems to work fine although I can always use more testing with weird file URLs (weird means containing weird characters like brackets and the likes). I also fixed some error handling bugs.

So what is left? Well, there are a few hacks in the Virtuoso backend which are rather ugly. One example is the detection of query result types. To determine if the result is boolean, bindings, or a graph it actually checks the name and number of result columns. Urgh! It would be nicer to check for the type of the result. Seems like graph results are BLOBs.

Anyway, enough for tonight. I am tired. Here is the patch to make Virtuoso not hang when Strigi adds nie:PlainTextContent literals of big files:

Index: sqlrcomp.c
===================================================================
RCS file: virtuoso-opensource/libsrc/Wi/sqlrcomp.c,v
retrieving revision 1.9
diff -u -r1.9 sqlrcomp.c
--- sqlrcomp.c  20 Aug 2009 17:47:22 -0000      1.9
+++ sqlrcomp.c  13 Oct 2009 16:11:49 -0000
@@ -65,7 +65,7 @@
 {
 va_list list;
 char temp[2000];
-  int ret;
+  int ret, rest_sz, copybytes;
 va_start (list, string);
 ret = vsnprintf (temp, sizeof (temp), string, list);
 #ifndef NDEBUG
@@ -75,11 +75,16 @@
 va_end (list);
 #ifndef NDEBUG
 if (*fill + strlen (temp) > len - 1)
-    GPF_T1 ("overflow in strncpy");
+    GPF_T1 ("overflow in memcpy");
 #endif
-  strncpy (&text[*fill], temp, len - *fill - 1);
+  rest_sz = (len - fill[0]);
+  if (ret >= rest_sz)
+    copybytes = ((rest_sz > 0) ? rest_sz : 0);
+  else
+    copybytes = ret+1;
+  memcpy (text+fill[0], temp, copybytes);
 text[len - 1] = 0;
-  *fill += (int) strlen (temp);
+  fill[0] += ret;
 }

A Bit Of Nepomuk Goodness On Your Developer Fingertips

And yet another technical blog entry. This time it concerns the latest improvements in sopranocmd (Soprano >= 2.3.61). For starters there is an improved NRLModel which provides automatic query prefix expansion. What does that mean? Well, it means that debugging Nepomuk data is simpler now as you can simply use all ontologies stored in the Nepomuk database without defining their prefixes. With sopranocmd (I am using nepomukcmd, a little alias I introduce in the Nepomuk Tips and Tricks) this feature is enabled via the –nrl parameter. Thus, querying all tags becomes:

nepomukcmd --nrl query "select ?r where { ?r a nao:Tag . }"

The second new thing is an improved import command. Again enabled with the –nrl parameter it creates a new named graph of type nrl:KnowledgeBase and puts all new statements (which are not in a graph yet) into it. As described in Nepomuk Data Layout it also adds a metadata graph and the creation date. It actually makes use of NRLModel::createGraph.

The reason I did this was to be able to migrate the tmo:Task instances I had created on the laptop to my desktop machine. Just as an example I will show the procedure here:

First I export all the tasks and their properties on the laptop:

nepomukcmd --nrl export "describe ?r where { ?r a tmo:Task . }" \
     /tmp/task-dump.n4

And then on the desktop I simply import them into Nepomuk:

nepomukcmd --nrl import /tmp/task-dump.n4

Update: You can also use query prefixes for statement listing now. Thus the following is now possible:

nepomukcmd --nrl list "" a tmo:Task

(Even the “a” keyword now maps to rdf:type.)