Nepomuk Tasks: Let The Virtuoso Inferencing Begin

Only four days ago I started the experiment to fund specific Nepomuk tasks through donations. Like with last year’s fundraiser I was uncertain if it was a good idea. That, however, changed when only a few hours later two tasks had already reached their donation goal. Again it became obvious that the work done here is appreciated and that the “open” in Open-Source is understood for what it actually is.

So despite my wife not being overly happy about it I used the weekend to work on one of the tasks: Virtuoso inferencing.

Inference?

As a quick reminder: the inferencer automatically infers information from the data in the database. While Virtuoso can handle pretty much any inference rule you throw at it we stick to the basics for now: if resource R1 is of type B and B derives from A then R1 is also of type A. And: if R1 has property P1 with value “foobar” and P1 is derived from P2 then R1 also has property P2 with value “foobar“.

Crappy Inference

This is already very useful and even mandatory in many cases. Until now we used what we called “crappy inferencing 1 & 2″. The Crappy inferencer 1 was based on work done in the original Nepomuk project and it simply inserted triples for all sub-class and sub-property relations. That way we could simulate real inference by querying for something like

select * where {
  ?r ?p "foobar" . 
  ?p rdfs:subPropertyOf rdfs:label .
}

and catch all sub-properties of rdfs:label like nao:prefLabel or nie:title. While this works it means bad performance, additional storage and additional maintenance.

The Crappy Inferencer 2 was even worse. It inserted rdf:type triples for all super-classes. This means that it would look at every added and removed triple to check if it was a rdf:type triple. If so it would add or remove the appropriate rdf:type triples for the super-types. That way we could do fast type queries without relying on the crappy inferencer 1 which relies on the rdfs:subClassOf method. But this meant even more maintenance and even more storage space wasted.

Introducing: Virtuoso Inference

So now we simply rely on Virtuoso to do all that and it does such a wonderful job. Thanks to Virtuoso graph groups we can keep our clean ontology separation (each ontology has its own graph) and still stick to a very simple extension of the queries:

DEFINE input:inference <nepomuk:/ontographgroup>
select * where {
  ?r rdfs:label "foobar" .
}

Brilliant. Of course there are still situations in which you do not want to use the inferencer. Imagine for example the listing of resource properties in the UI. This is what it would look like with inference:

We do not want that. Inference is intended for machine, not for the human, at least not like this. So since back in the day I did not think of adding query flags to Soprano I simply introduced a new virtual query language: SparqlNoInference.

Resource Visibility

While at it I also improved the resource visibility support by simplifying it. We do not need any additional processing anymore. This again means less work on startup and with every triple manipulation command. Again we save space and increase performance. But this also means that resource visibility filtering will not work as before anymore. Nepoogle for example will need adjustment to the new way of filtering. Instead of

?r nao:userVisible 1 .

we now need

FILTER EXISTS { ?r a [ nao:userVisible "true"^^xsd:boolean ] }

Testing

The implementation is done. All that rests are the tests. I am already running all the patches but I still need to adjust some unit tests and maybe write new ones.

You can also test it. The code changes are, as always, spread over Soprano, kdelibs and kde-runtime. Both kdelibs and kde-runtime now contain a branch “nepomuk/virtuosoInference”. For Soprano you need git master.

Look for regressions of any kind so we can merge this as soon as possible. The goal is KDE 4.9.

Akonadi, Nepomuk, and A Lot Of CPU

One Bug has been driving people crazy. This is more than understandable seeing that the bug was an endless high CPU usage by Virtuoso, the database used in Nepomuk. Kolab Systems, the Free Software groupware company behind Kolab, a driving force behind Akonadi, sponsored me to look into that issue.

Finding the issue turned out to be a bit harder than I thought, coming up with a fix even more so. In the process I ended up improving the Akonadi Nepomuk Email indexer/feeder in several places. This, however useful and worthwhile, turned out to be unrelated to the high CPU usage. Virtuoso was not to blame either. In the end the real issue was solved by a little SPARQL query optimization.

Application developers against Akonadi and Nepomuk might want to keep that in mind: The way you build your queries will have dramatic impact on the performance of the whole system. So this is also where opimizations are likely to have a lot of impact in case people want to help improve things further. Discussing query design with the Nepomuk team or on the Virtuoso mailing list can go a long way here.

So thanks to the support from Kolab Systems, Virtuoso is no longer chewing so much CPU, and Akonadi Email indexing will work a lot smoother with KDE 4.8.2.

A Little Bit Of Query Optimization

Every once in a while I add another piece of query optimization code to the Nepomuk Query API. This time it was a direct result of my earlier TV Show handling. I simply thought that a query like “downton season=2 episode=2” takes too long to complete.

Now in order to understand this you need to know that there is a rather simple QueryParser class which converts a query string like the above into a Nepomuk::Query::Query which is simply a collection of Nepomuk::Query::Term instances. A Query instance is then converted into a SPARQL query which can be handled by Virtuoso. This SPARQL query already contains a set of optimizations, some specific to Virtuoso, some specific to Nepomuk. Of course there is always room for improvement.

So let us get back to our query “downton season=2 episode=2” and look at the resulting SPARQL query string (I simplified the query a bit for readability. The important parts are still there):

select distinct ?r where {
  ?r nmm:season "2"^^xsd:int .
  {
    ?r nmm:hasEpisode ?v2 .
    ?v2 ?v3 "2"^^xsd:int .
    ?v3 rdfs:subPropertyOf rdfs:label .
  } UNION {
    ?r nmm:episodeNumber "2"^^xsd:int .
  } .
  {
    ?r ?v4 ?v6 .
    FILTER(bif:contains(?v6, "'downton'")) .
  } UNION {
    ?r ?v4 ?v7 .
    ?v7 ?v5 ?v6 .
    ?v5 rdfs:subPropertyOf rdfs:label .
    FILTER(bif:contains(?v6, "'downton'")) .
  } .
}

Like the user query the SPARQL query has three main parts: the graph pattern checking the nmm:season, the graph pattern checking the episode and the graph pattern checking the full text search term “downton“. The latter we can safely ignore in this case. It is always a UNION so full text searches also include relations to tags and the like.

The interesting bit is the second UNION. The query parser matched the term “episode” to properties nmm:hasEpisode and nmm:episodeNumber. On first glance this is fine since both contain the term “episode“. However, the property nmm:season which is used in the first non-optional graph-pattern has a domain of nmm:TVShow. nmm:hasEpisode on the other hand has a domain of nmm:TVSeries. That means that the first pattern in the UNION can never match in combination with the first graph pattern since the domains are different.

The obvious optimization is to remove the first part of the UNION which yields a much simpler and way faster query:

select distinct ?r where {
  ?r nmm:season "2"^^xsd:int .
  ?r nmm:episodeNumber "2"^^xsd:int .
  {
    ?r ?v4 ?v6 .
    FILTER(bif:contains(?v6, "'downton'")) .
  } UNION {
    ?r ?v4 ?v7 .
    ?v7 ?v5 ?v6 .
    ?v5 rdfs:subPropertyOf rdfs:label .
    FILTER(bif:contains(?v6, "'downton'")) .
  } .
}

Well, sadly this is not generically true since resources can be double/triple/whateveriple-typed, meaning that in theory an nmm:TVShow could also have type nmm:TVSeries. In this case it is obviously not likely but there are many cases in which it in fact does apply. Thus, this optimization cannot be applied to all queries. I will, however, include it in the parser where it is very likely that the user does not take double-typing into account.

If you have good examples that show why this optimization should not be included in the query parser by default please tell me so I can re-consider.

Now after having written this and proof-reading the SPARQL query I realize that this particular query could have been optimized in a much simpler way: the value “2” is obviously an integer value, thus it can never match to a non-literal like required for the nmm:hasEpisode property…

A Little Drier But Not That Dry: Extracting Websites From Nepomuk Resources

After writing about my TV Show Namer I want to get out some more ideas and examples before I will be retiring as a full-time KDE developer in a few weeks.

The original idea for what I am about to present came a long while ago when I remembered that Vishesh gave me a link on IRC but I could not remember when exactly. So I figured that it would be nice to extract web links from Nepomuk resources to be able to query and browse them.

As always what I figured would be a quick thing lead me to a few bugs which I needed to fix before moving on. So all in all it took much longer than I had hoped. Anyway, the result is another small application called nepomukwebsiteextractor. It is a small tool without a UI which will extract websites from the given resource or file. If called without arguments it will query for resources which do not have any related websites and and extract websites from them. Since it tries to fetch a title for each website this is a very slow procedure.

As before the storing to Nepomuk is the easy part. Getting the information is way harder:

using namespace Nepomuk;
using namespace Nepomuk::Vocabulary;

// create the main Website resource
NFO::Website website(url);
website.addType(NFO::WebDataObject());
QString title = fetchHtmlPageTitle(url);
if(!title.isEmpty()) {
  website.setTitle(title);
}

// create the domain website resource
KUrl domainUrl = extractDomain(url);
NFO::Website domainWebPage(domainUrl);
domainWebPage.addType(NFO::WebDataObject());
domainWebPage.addPart(website.uri());
title = fetchHtmlPageTitle(domainUrl);
if(!title.isEmpty()) {
  domainWebPage.setTitle(title);
}

// relate the two via the nie:isPartOf relation
website.addProperty(NIE::isPartOf(), domainUrl);
domainWebPage.addProperty(NIE::hasPart(), website.uri());

// funnily enough the domain is a sub-resource of the website
// this is so removing the website will also remove the domain
// as it is the one which triggered the domain resource's creation
website.addSubResource(domainUrl);

// save it all to Nepomuk
Nepomuk::storeResources(SimpleResourceGraph() << website << domainWebPage);

Once done you will have thousands of nfo:Website resources in your Nepomuk database, each of which are related to their respective domain via nie:isPartOf (I am not entirely sure if this is perfectly sound but it is convenient as far as graph traversal goes). We can of course query those resources with nepomukshell (this is trivial but allows me to pimp up this blog post with a screenshot):

And of course Dolphin shows the extracted links in its meta-data panel:

I am not entirely sure how to usefully show this information to the user yet but it is already quite nice to navigate the sub-graph which has been created here.

Of course we could query all the resources which mention a link with domain http://www.kde.org:

select ?r where {
  ?r nie:links ?w .
  ?w a nfo:Website .
  ?w nie:isPartOf ?p .
  ?p nie:url <http://www.kde.org> .
}

Or the Nepomuk API version of the same:

using namespace Nepomuk::Query;
using namespace Nepomuk::Vocabulary;

Query query =
  ComparisonTerm(NIE::links(),
    ResourceTypeTerm(NFO::Website()) &&
    ComparisonTerm(NIE::isPartOf(),
      ResourceTerm(QUrl("http://www.kde.org"))
    )
  );

It gets even more interesting when combined with the nfo:Websites created by KParts when downloading files.

Well, now I provided screenshots, code examples, and a link to a repository – I think it is all there – have fun.

Update: In the spirit of promoting the previously mentioned ResourceWatcher here is how the website extractor would monitor for new stuff to be extracted:

Nepomuk::ResourceWatcher* watcher = new Nepomuk::ResourceWatcher(this);
watcher->addProperty(NIE::plainTextContent());
connect(watcher, 
        SIGNAL(propertyAdded(Nepomuk::Resource,
                             Nepomuk::Types::Property,
                             QVariant)),
        this,
        SLOT(slotPropertyAdded(Nepomuk::Resource,
                               Nepomuk::Types::Property,
                               QVariant)));
watcher->start();

[...]

void slotPropertyAdded(const Nepomuk::Resource& res,
                       const Nepomuk::Types::Property&,
                       const QVariant& value) {
  if(!hasOneOfThoseXmlOrRdfMimeTypes(res)) {
    const QString text = value.toString();
    extractWebsites(res, text);
  }
}

Finding Duplicate Images Made Easy

It is a typical problem: we downloaded images from a camera, maybe did not delete them from the camera instantly, then downloaded the same images again next time, maybe created an album by copying images into sub-folders (without Nepomuk Digikam can only do so much ;), and so on. Essentially there are a lot of duplicate photos lying around.

But never fear. Just let Nepomuk index all of them and then gather all the duplicates via:

select distinct ?u1 ?u2 where { 
  ?f1 a nexif:Photo . 
  ?f2 a nexif:Photo . 
  ?f1 nfo:hasHash ?h . 
  ?f2 nfo:hasHash ?h . 
  ?f1 nie:url ?u1 . 
  ?f2 nie:url ?u2 . 
  filter(?f1!=?f2) .
}

Quick explanation: the query does select all nexif:Photo resources which have the same hash value but are not the same. This of course can be tweaked by adding something like

?f1 nfo:fileName ?fn .
?f2 nfo:fileName ?fn .

to make sure that we only catch the ones that we downloaded more than once. Or we add

?f1 nie:contentCreated ?cc .
?f2 nie:contentCreated ?cc .

to ensure that the photo was actually taken at the same time – although I suppose the probability that two different photos have the same hash value is rather small.

Maybe one last little detail. In theory it would be more correct to do the following:

?f1 nfo:hasHash ?h1 .
?f2 nfo:hasHash ?h2 .
?h1 nfo:hashValue ?h .
?h2 nfo:hashValue ?h .

However, with the introduction of the Data Management Service in KDE 4.7 similar hash resources are merged into one. Thus, the slightly simpler query above. Still, to be sure to also properly handle pre-KDE-4.7 data the above addition might be prudent.

Of course this should be hidden in some application which does the work for you. The point is that Nepomuk has a lot of power that only reveals itself at second glance. :)

A Word (or Two) on Removable Storage Media Handling in Nepomuk

While fixing existing Nepomuk bugs and trying to close them as they come in I also look into other things. Last week it was the improved file indexer scheduling and file modification handling. This week it is about another improvement in the handling of queries which involve removable media. Ignacio Serantes already found one bug in the URL encoding before. This time he wanted to search through all mounted removable storage media and realized that he could not. I just fixed that. In order to understand how I did that we need to go into detail about how Nepomuk handles removable media.

Removable Storage Media in Nepomuk

Files on removable storage media are a problem when it comes to meta data stored in Nepomuk. As long as the medium is mounted we can simply identify the files through their local file path. But as soon as it is unmounted the paths are no longer valid. To make things worse we could mount the medium at another mount point the next time or mount another medium (which obviously does not contain the files in question) at the same mount point. So we need a way around that problem. Ever since 4.7 Nepomuk has a rather fancy way of doing that.

Internally Nepomuk uses a stack of Soprano::FilterModels which perform several operations on the data that passes through them. One of these models is the RemovableStorageModel. This model does one thing: it converts the local file URLs of files and folders on removable media into mount-path-independent URLs and vice versa. Currently it supports removable disks like USB keys or external hard disks (any storage that has a UUID), optical media, NFS and Samba mounts. The nice thing about it is that this conversion happens transparently to the client. Thus, a client simply uses the local file URLs according to the current mount path and does not care about anything else. It will always get the correct results.

To understand this better we should look at an example. Imagine we have a USB key inserted with UUID “xyz” which is mounted at /media/disk. Now if we add information about a file /media/disk/myfile.txt to Nepomuk the following happens: The RemovableStorageModel will convert the URL file:///media/disk/myfile.txt into filex://xyz/myfile.txt. This is a custom URL scheme which consists of the device UUID and the relative path. When querying the file the model does the conversion in the other direction. So far so simple.

Queries are where it gets a little more complicated. Imagine we want to query all files in a certain directory on the removable medium (ideally the SPARQL would be hidden by the Nepomuk query API). We would then perform a query like the following simplified one.

select ?r where {
  ?r nie:isPartOf ?p . 
  ?p nie:url <file:///media/disk/somefolder> . }

If we would pass this query on to Virtuoso we would not get any results since there is no resource with nie:url <file:///media/disk/somefolder>. So the RemovableStorageModel steps in again and does some query tweaking (rather primitive tweaking seeing that we do not have a SPARQL parser in Nepomuk). The query is converted into

select ?r where {
  ?r nie:isPartOf ?p .
  ?p nie:url <filex://xyz/somefolder> . }

And suddenly we get the expected results.

Of course this is still rather simple. It gets more complicated when SPARQL REGEX filters are involved. Imagine we wanted to look for all files in some sub-tree on a removable medium. We would then use a query along the lines of the following:

select ?r where {
  ?r nie:url ?url .
  FILTER(REGEX(STR(?url), '^file:///media/disk/somefolder/')) . }

As before passing this query directly on to Virtuoso would not yield any results. The RemovableStorageModel needs to do its magic first:

select ?r where {
  ?r nie:url ?url .
  FILTER(REGEX(STR(?url), '^filex://xyz/somefolder/')) . }

This is what the model did before Ignacio wanted to query all his removable media mounted somewhere under /media at once. Obviously he did something like:

select ?r where {
  ?r nie:url ?url .
  FILTER(REGEX(STR(?url), '^file:///media/')) . }

The result, however, was empty. This is simply because there was no exact match to any mount path of any of the removable media and RemovablStorageModel did not replace anything. The solution was to include additional filters for all the candidates in addition to the already existing filter. We need to keep the existing filter in case there is anything else under /media which is not a removable medium and, thus, has normal local file:/ URLs.

If we imagine that we have an additional mounted removable medium with UUIDfoobar” then the query would be converted into something like the following.

select ?r where {
  ?r nie:url ?url .
  FILTER((REGEX(STR(?url), '^file:///media/') ||
          REGEX(STR(?url), '^filex://xyz/') || 
          REGEX(STR(?url), '^filex://foobar/'))) . }

This way we get the expected results. (The additional brackets are necessary in case the filter already contains more than one term.)

Well, I personally think this is a very clean solution where clients only have to consider filex:/ and its friends nfs:/, smb:/, and optical:/ if the media are not mounted. One way of handling that I already drafted a while back. But that will be perfected another day. ;)

For now let me, as always, close with the hint that development like this is still running on your donations:

Click here to lend your support to: Nepomuk - The semantic desktop on KDE and make a donation at www.pledgie.com !
Click here to donate to Nepomukvia Moneybookers

Nice Things To Do With Nepomuk – Part Two

Yesterday I presented how to extract web links from resources (mostly files) in Nepomuk and store them as proper resources themselves.

Let us now take a look at the data we created. For this we will fire up NepSak aka. Nepomukshell and use a bit of SPARQL for testing and debugging purposes (remember: when implementing stuff try to keep to the query API instead of coding your own SPARQL queries). We start by listing all the nfo:Website resources there are:

There are a lot of websites there. This is not very helpful yet. So let’s throw the magic creation time into the mix:

This is already better. We now see all web links sorted by creation date. But what about the files we extracted them from? Let’s modify the query one last time to see those, too:

Now we see all the data the code from yesterday created. The files we extracted the web links from and the websites themselves. This information could now be used in some fancy GUI. Since I did not create a fancy GUI yet I will instead show what already works out of the box. We open Dolphin and hover one of the files we extracted a website from:

The links are already properly displayed and even clickable. (But when we click the links we see a possibility for improvement: it does not open the link but a query looking for other files linking to this website.)

Well, this is all I will present today as I was sidetracked by my system breaking down on me again.

Small Things are Happening…

I have to admit: I am a sucker for nice APIs. And, yes, I am sort of in love with some of my own creations. Well, at least until I find the flaws and cannot remove them due to binary compatibility issues (see Soprano). This may sound a bit egomaniac but let’s face it: we almost never get credit for good API or good API documentation. So we need to congratulate ourselves.

My new pet is the Nepomuk Query API. As its name says it can be used to query Nepomuk resources and sets out to replace as many hard coded SPARQL queries as possible. It started out rather simple: matching a set of resources with different types of terms. But then Dario Freddi and his relatively complex telepathy queries came along. So the challenge began. I tried to bend the existing API as much as possible to fit in the features he requested. One thing let to another, I suddenly found myself in need of optional terms and a few days later things were not as simple as they started out any more.

ComparisonTerm already was the most complex term in the query API. But that did not mean much. Basically you could set a property and a sub term and that was it. On Monday it became a bit more beastly. Now you can invert it, force the name of the variable used, give it a sort weight, change the sort order, and even set an aggregate function. And all that on only one type of term. At least I implemented optional terms separately.

To explain what all this is good for I will try to illustrate it with a few examples:

Say you want to find all tags a specifc file has (and ignore the fact that there is a Nepomuk::Resource::tags method). This was not possible before inverted comparison terms came along. Now all you have to do is:

Nepomuk::Query::ComparisonTerm tagTerm(
    Soprano::Vocabulary::NAO::hasTag(),
    Nepomuk::Query::ResourceTerm(myFile)
);
tagTerm.setInverted(true);
Nepomuk::Query::Query(tagTerm);

What happens is that subject and object change places in the ComparisonTerm, thus, the resulting SPARQL query looks something like the following:

select ?r where { <myFile> nao:hasTag ?r . }

Simple but effective and confusing. It gets better. Previously we only had the clumsy Query::addRequestProperty to get additional bindings from the query. It is very restricted as it only allows to query direct relations from the results. With ComparisonTerm::setVariableName we now have the generic counterpart. By setting a variable name this variable is included in the bindings of the final query and can be retrieved via Result::additionalBinding. This allows to retrieve any detail from any part of the query. Again we use the most simple example to illustrate:

Nepomuk::Query::ComparisonTerm labelTerm(
    Soprano::Vocabulary::NAO::prefLabel(),
    Nepomuk::Query::Term() );
labelTerm.setVariableName( "label" );
Nepomuk::Query::ComparisonTerm tagTerm(
    Soprano::Vocabulary::NAO::hasTag(),
    labelTerm );
tagTerm.setInverted(true);
Nepomuk::Query::Query query( tagTerm );

This query lists all tags including their labels. Again the resulting SPARQL query would look something like the following:

select ?r ?label where { <myFile> nao:hasTag ?v1 . ?v1 nao:prefLabel ?label . }

And silently I used another little gimmick that I introduced: ComparisonTerm can now handle invalid properties and invalid sub terms which will simply act as wildcards (or be represented by a variable in SPARQL terms).

Now on to the next feature: sort weights. The idea is simple: you can sort the search results using any value matched by a ComparisonTerm. So let’s extend the above query by sorting the tags according to their labels.

labelTerm.setSortWeight( 1 );

And the resulting SPARQL query will reflect the sorting:

select ?r ?label where { <myFile> nao:hasTag ?v1 . ?v1 nao:prefLabel ?label . } order by ?label

Here I used a sort weight of 1 since I only have the one term that includes sorting. But in theory you can include any number of ComparisonTerms in the sorting. The higher the weight the more important the sort order.

We are nearly done. Only one feature is left: aggregate functions. The Virtuoso SPARQL extensions (and SPARQL 1.1, too) support aggregate functions like count or max. These are now supported in ComparisonTerm. They are only useful in combination with a forced variable name in which case they will be included in the additional bindings or with a sort weight. If we go back to our tags we could for example count the number of tags each file has attached:

Nepomuk::Query::ComparisonTerm tagTerm(
    Soprano::Vocabulary::NAO::hasTag(),
    Nepomuk::Query::ResourceTypeTerm(Soprano::Vocabulary::NAO::Tag())
);
tagTerm.setAggregateFunction(
    Nepomuk::Query::ComparisonTerm::Count
);
tagTerm.setVariableName("cnt");
Nepomuk::Query::Query(tagTerm);

And the resulting SPARQL query will be along the lines of:

select ?r count(?v1) as ?cnt where { ?r nao:hasTag ?v1 . ?v1 a nao:Tag . }

Now we can of course sort by number of tags:

tagTerm.setSortWeight(1, Qt::DescendingOrder);

And we get:

select ?r count(?v1) as ?cnt where { ?r nao:hasTag ?v1 . ?v1 a nao:Tag . } order by desc ?cnt

And just because it is fun, let us make the tagging optional so we also get files with zero tags (be aware that normally one should have at least one non-optional term in the query to get useful results. In this case we are on the safe side since we are using a FileQuery):

Nepomuk::Query::ComparisonTerm tagTerm(
    Soprano::Vocabulary::NAO::hasTag(),
    Nepomuk::Query::ResourceTypeTerm(Soprano::Vocabulary::NAO::Tag())
);
tagTerm.setAggregateFunction(
    Nepomuk::Query::ComparisonTerm::Count
);
tagTerm.setVariableName("cnt");
tagTerm.setSortWeight(1, Qt::DescendingOrder);
Nepomuk::Query::FileQuery(
    Nepomuk::Query::OptionalTerm::optionalizeTerm(tagTerm)
);

And with the SPARQL result of this beauty I finish my little session of self-congratulations:

select ?r count(?v1) as ?cnt where { ?r a nfo:FileDataObject . OPTIONAL { ?r nao:hasTag ?v1 . ?v1 a nao:Tag . } . } order by desc ?cnt

Dangling Meta Data Graphs (Caution: Very Technical)

Nepomuk in KDE uses NRL – the Nepomuk Representation Language – especially the named graphs that it defines. Each triple that is stored in the Nepomuk database is stored in a named graph. We use this graph to attach meta data to the triples themselves. So far this is restricted to the creation date but in the future there will be more like the creator (for shared meta data) and the modification date (this is a bit tricky since technically triples are never modified, only added and deleted. But from a user’s point of view changing a rating means changing a triple.).

What did I say? “We attach meta data to the triples”? Well, to be exact we attach it to the graph which contains the triples. And since everything is triples (or quadruples since there is the named graph) the meta data is, too. And like every triple these also need to be put in a dedicated named graph – the meta data graph. Thus, each triple is contained in one graph and each graph has exactly one meta data graph.

So far so good. But what happens if we delete all triples in a graph? Well, the graph ceases to exist since graphs like resources in an RDF database do only exist due to the triples in which they occur.

And that is when it happens: dangling meta data graphs, i.e. meta data graphs that describe a graph which does no longer exist.

In theory Nepomuk could delete these automatically but I decided against that for performance reasons. It would have to check for dangling meta data graphs after each removal operation. So for now (until I come up with some database maintenance service) these graphs are just waste hanging around not bothering anyone (they are small).

In case you want to check how many of them you have in your database use the following command (see the Nepomuk Tips and Tricks for nepomukcmd):

nepomukcmd query 'select count(?mg) where { ?mg nrl:coreGraphMetadataFor ?g . OPTIONAL { graph ?g { ?s ?p ?o . } . } . FILTER(!BOUND(?s)) . }'

And to simply delete them use a bit of shell magic (the –foo is important since it removes any human-readability gimmicks from the otuput):

for a in `nepomukcmd --foo query 'select ?mg where { ?mg nrl:coreGraphMetadataFor ?g . OPTIONAL { graph ?g { ?s ?p ?o . } . } . FILTER(!BOUND(?s)) . }'`; do nepomukcmd rmgraph "$a"; done

Convenient Querying in libnepomuk

It has been a while since I blogged and a lot has happened. Not only did we have the second Nepomuk workshop on the “Open Social Semantic Desktop” which I did not report on yet, there is also a lot of stuff going on for KDE 4.4. And since I like blogging about technical stuff so much I will start with that. So here goes:

Queries in KDE 4.4 will be so much simpler (for the developer that is) since we now have the Nepomuk Query API!

Do you remember the days when you tried to write your own SPARQL queries and code looked like this:

Nepomuk::Tag myTag = getOurFancyTag();
QString query
   = QString("select distinct ?r where { ?r %1 %2 . }")
     .arg( Soprano::Node::resourceToN3(Soprano::Vocabulary::NAO::hasTag()) )
     .arg( Soprano::Node::resourceToN3(myTag.resourceUri()) );

Well, using the query API this looks a lot nicer:

Nepomuk::Query::ResourceTerm tagTerm(myTag);
Nepomuk::Query::ComparisonTerm term(Soprano::Vocabulary::NAO::hasTag(), tagTerm);
Nepomuk::Query::Query query(term);
QString queryString = query.toSparqlString();

As you can see you do not need to know any SPARQL anymore. But it gets better. The Query class is integrated with the Nepomuk Query Service via the QueryServiceClient class. This allows to simply let the query service do the querying and receive result change updates – in other words use live searches:

Nepomuk::Query::QueryServiceClient client;
connect(&client, SIGNAL(newEntries(QList<Nepomuk::Query::Result>)),
            this, SLOT(slotNewEntries(QList<Nepomuk::Query::Result>)));
client.query(query);

And now just handle the incoming results.

Maybe even nicer is the integration with KIO, i.e. the possibility to list search results as a virtual folder:

KDirModel* model = getFancyDirModel();
Nepomuk::Query::Query query = buildFancyQuery();
KUrl searchUrl = query.toSearchUrl();
model->dirLister()->open( searchUrl );

As you can see it is really simple to list results via KIO (BTW: this is what Dolphin does).

For more examples check the Nepomuk Query API documentation.