A Little Bit Of Query Optimization

Every once in a while I add another piece of query optimization code to the Nepomuk Query API. This time it was a direct result of my earlier TV Show handling. I simply thought that a query like “downton season=2 episode=2” takes too long to complete.

Now in order to understand this you need to know that there is a rather simple QueryParser class which converts a query string like the above into a Nepomuk::Query::Query which is simply a collection of Nepomuk::Query::Term instances. A Query instance is then converted into a SPARQL query which can be handled by Virtuoso. This SPARQL query already contains a set of optimizations, some specific to Virtuoso, some specific to Nepomuk. Of course there is always room for improvement.

So let us get back to our query “downton season=2 episode=2” and look at the resulting SPARQL query string (I simplified the query a bit for readability. The important parts are still there):

select distinct ?r where {
  ?r nmm:season "2"^^xsd:int .
  {
    ?r nmm:hasEpisode ?v2 .
    ?v2 ?v3 "2"^^xsd:int .
    ?v3 rdfs:subPropertyOf rdfs:label .
  } UNION {
    ?r nmm:episodeNumber "2"^^xsd:int .
  } .
  {
    ?r ?v4 ?v6 .
    FILTER(bif:contains(?v6, "'downton'")) .
  } UNION {
    ?r ?v4 ?v7 .
    ?v7 ?v5 ?v6 .
    ?v5 rdfs:subPropertyOf rdfs:label .
    FILTER(bif:contains(?v6, "'downton'")) .
  } .
}

Like the user query the SPARQL query has three main parts: the graph pattern checking the nmm:season, the graph pattern checking the episode and the graph pattern checking the full text search term “downton“. The latter we can safely ignore in this case. It is always a UNION so full text searches also include relations to tags and the like.

The interesting bit is the second UNION. The query parser matched the term “episode” to properties nmm:hasEpisode and nmm:episodeNumber. On first glance this is fine since both contain the term “episode“. However, the property nmm:season which is used in the first non-optional graph-pattern has a domain of nmm:TVShow. nmm:hasEpisode on the other hand has a domain of nmm:TVSeries. That means that the first pattern in the UNION can never match in combination with the first graph pattern since the domains are different.

The obvious optimization is to remove the first part of the UNION which yields a much simpler and way faster query:

select distinct ?r where {
  ?r nmm:season "2"^^xsd:int .
  ?r nmm:episodeNumber "2"^^xsd:int .
  {
    ?r ?v4 ?v6 .
    FILTER(bif:contains(?v6, "'downton'")) .
  } UNION {
    ?r ?v4 ?v7 .
    ?v7 ?v5 ?v6 .
    ?v5 rdfs:subPropertyOf rdfs:label .
    FILTER(bif:contains(?v6, "'downton'")) .
  } .
}

Well, sadly this is not generically true since resources can be double/triple/whateveriple-typed, meaning that in theory an nmm:TVShow could also have type nmm:TVSeries. In this case it is obviously not likely but there are many cases in which it in fact does apply. Thus, this optimization cannot be applied to all queries. I will, however, include it in the parser where it is very likely that the user does not take double-typing into account.

If you have good examples that show why this optimization should not be included in the query parser by default please tell me so I can re-consider.

Now after having written this and proof-reading the SPARQL query I realize that this particular query could have been optimized in a much simpler way: the value “2″ is obviously an integer value, thus it can never match to a non-literal like required for the nmm:hasEpisode property…

Symbolic Links in Nepomuk – A Solution

Until now symbolic links were not handled in Nepomuk. Today I commited the last patch for the new symlink support in Nepomuk. The solution I chose is not the theoretically perfect one. That would have taken way to much effort while introducing all kinds of possible bugs, regressions, API incompatibilities, and so on. But the solution is nice and clean and simple.

Essentially each direct symlink is indexed as a separate file using the content of its target file. (This is necessary since a direct symlink might have a different file name than the target file.) The interesting part are the indirect symlinks. Indirect symlinks are files in a folder which is a symlink to another folder. An example:

/home/trueg/
|-- subdir/
   |-- thefile.txt
|-- link/ -> subdir/
   |-- thefile.txt

Here I have a folder “subdir” which contains a file “thefile.txt”. The folder “link” is a direct symlink to “subdir” whereas “link/thefile.txt” is an indirect symlink to “subdir/thefile.txt”.

Indirect symlinks are simply stored as alternative URLs on the target file resources using the kext:altUrl property. (The property is not defined in NIE since it is not theoretically sound with respect to the design of NIE. It needs to be considered a beautiful hack.)

The only situation in which the alternative URLs are actually needed is when searching in a specific folder. Imagine searching in “/home/trueg/link” only. Since there are no nie:url values which match that prefix we need to search the kext:altUrls, too.

The result of all this is that nearly no additional space is required except for the kext:altUrl properties, files are not indexed more than once, and files in symlinked folders are found in addition to “normal” files.

In my tests everything seems to work nicely but I urge you to test the nepomuk/symlinkHandling branches in kdelibs and kde-runtime and report any problems back to me. The more testing I get the quicker I can merge both into KDE 4.8.

Lastly the pledgie campaign is done but the search for funds goes on:

Nepomuk 2.0 and the Data Management Service

During the development of Nepomuk in the last years we have gathered a lot of knowledge and ideas about how to integrate semantics and more specifically RDF into the desktop. This knowledge was spread over several components like services, libraries, and applications. Some of it was only found as convention, some of it documented, some only in our brains. But I guess this can be seen as normal in a project which treats on new territory, tries to invent new ways of handling information on our computers.

Thus, in January of this year Vishesh and myself finally set out to define a new API that would gather all this knowledge, all the ideas in one service. This service was intended to enforce our idea of how data in the semantic desktop should be formed while at the same time providing clean and powerful methods to manipulate this data. On April 8th, after 155 commits in a separate repository, I finally merged the new data management service into kde-runtime. And after hinting at it several times it is high time that I explain its ideas and the way we implemented it.

DMS and named graphs

Before I go into details about the DMS API I would like to explain the way information is stored in Nepomuk. More specifically, which ontology entities are used (until now by convention) to describe resources and meta-data.

The following example shows the information created by the file indexer. It nicely shows how we encode the information in Nepomuk. It is encoded using the Trig serialization which is also used by the Shared-Desktop-Ontologies.

<nepomuk:/ctx/graph1> {
  <nepomuk:/res/file1>
    nao:created “2011-05-20T11:23:45Z”^^xsd:dateTime ;
    nao:lastModified “2011-05-20T11:23:45Z”^^xsd:dateTime ;

    nie:contentSize "1286"^^xsd:int ;
    nie:isPartOf <nepomuk:/res/80b4187c-9c40-4e98-9322-9ebcc10bd0bd> ;
    nie:lastModified "2010-12-14T14:49:49Z"^^xsd:dateTime ;
    nie:mimeType "text/plain"^^xsd:string ;
    nie:plainTextContent "[...]"^^xsd:string ;
    nie:url <file:///home/nepomuk/helloworld.txt> ;
    nfo:characterCount "1249"^^xsd:int ;
    nfo:fileName "helloworld.txt"^^xsd:string ;
    nfo:lineCount "37"^^xsd:int ;
    nfo:wordCount "126"^^xsd:int ;
    a nfo:PlainTextDocument, nfo:FileDataObject .
}
<nepomuk:/ctx/metadatagraph1> {
  <nepomuk:/ctx/metadatagraph1>
    a nrl:GraphMetadata ;
    nrl:coreGraphMetadataFor <nepomuk:/ctx/graph1> .

  <nepomuk:/ctx/graph1>
    a nrl:DiscardableInstanceBase ;
    nao:created "2011-05-04T09:46:11.724Z"^^xsd:dateTime ;
    nao:maintainedBy <nepomuk:/res/someapp> .
}

There is essentially three types of information in this example. If we look at the first graph nepomuk:/ctx/graph1 we see that it contains information about one resource nepomuk:/res/file1. This information is split into two blocks for visualization purposes. The first block contains two properties nao:created and nao:lastModified. We call this information the resource-meta-data. It refers to the Nepomuk resource in the database and states when it was created and modified. This information can only be changed by the data management service. In contrast to that the second block contains the data in the resource. In this case we see file meta-data which in Nepomuk terms is just data. Here it is important to see the difference between nao:lastModified and nie:lastModified. The latter refers to the file on disk itself while the former refers to the Nepomuk resource representing the file on disk.

The second graph nepomuk:/ctx/metadatagraph1 only contains what we call graph-meta-data. A named graph (or context in Soprano terms) is a resource like anything else in Nepomuk. Thus, it has a type. Graphs containing graph-meta-data are always of type nrl:GraphMetadata and belong to exactly one non-graph-meta-data graph. The meta-data graph contains the type, the creation date, and the creating application of the actual graph. In Nepomuk we make the distinction between four types of graphs:

  • nrl:InstanceBase graphs contain normal data that has been created by applications or the user.
  • nrl:DiscardableInstanceBase graphs contain normal data that has been extracted from somewhere and can easily be recreated. This includes file or email indexing. Data in this type of graph does not need to be included in backups.
  • nrl:GraphMetadata graphs contain meta-data about other graphs.
  • nrl:Ontology graphs contain class and property definitions. Nepomuk does import all installed ontologies in its database. This is required for query parsing, data inference, and so on.

Whenever the information is changed or new information is added new graphs are created accordingly. Thus, Nepomuk always knows when which information was created by which application.

The Data Management API

Now that the data format is defined we can continue with the DMS API. The DMS provides one central multi-threaded DBus API and a KDE client library for convenience (this library currently lives in kde-runtime until it is stabilized to be moved to kdelibs or a dedicated repository once kdelibs are split). The API consists of two parts: 1. the simple API which is useful for scripts and simple applications and 2. the advanced API which holds most of the power required for systems like the file indexer or data sharing and syncing. Here I will show the client API as it uses proper Qt/KDE types.

The Simple API

The simple API contains the methods that directly spring to mind when thinking about such a service:

KJob* addProperty(const QList<QUrl>& resources,
                  const QUrl& property,
                  const QVariantList& values,
                  const KComponentData& component = KGlobal::mainComponent());
KJob* setProperty(const QList<QUrl>& resources,
                  const QUrl& property,
                  const QVariantList& values,
                  const KComponentData& component = KGlobal::mainComponent());
KJob* removeProperty(const QList<QUrl>& resources,
                     const QUrl& property,
                     const QVariantList& values,
                     const KComponentData& component = KGlobal::mainComponent());
KJob* removeProperties(const QList<QUrl>& resources,
                       const QList<QUrl>& properties,
                       const KComponentData& component = KGlobal::mainComponent());
CreateResourceJob* createResource(const QList<QUrl>& types,
                                  const QString& label,
                                  const QString& description,
                                  const KComponentData& component = KGlobal::mainComponent());
KJob* removeResources(const QList<QUrl>& resources,
                      RemovalFlags flags = NoRemovalFlags,
                      const KComponentData& component = KGlobal::mainComponent());

It has methods to set, add, and remove properties and to create and remove resources. Each method has an addition parameter to set the application that performs the modification. By default the main component of a KDE app is used. The removeResources method has a flags parameter which so far only has one flag: RemoveSubResources. I will get into sub resources another time though.

This part of the API is pretty straight-forward and rather self-explanatory. Every method will check property domains and ranges, enforce cardinalities, and reject any data that is not well-formed.

The Advanced API

The more interesting part is the advanced API.

KJob* removeDataByApplication(const QList<QUrl>& resources,
                              RemovalFlags flags = NoRemovalFlags,
                              const KComponentData& component = KGlobal::mainComponent());
KJob* removeDataByApplication(RemovalFlags flags = NoRemovalFlags,
                              const KComponentData& component = KGlobal::mainComponent());
KJob* mergeResources(const QUrl& resource1,
                     const QUrl& resource2,
                     const KComponentData& component = KGlobal::mainComponent());
KJob* storeResources(const SimpleResourceGraph& resources,
                     const QHash<QUrl, QVariant>& additionalMetadata = QHash<QUrl, QVariant>(),
                     const KComponentData& component = KGlobal::mainComponent());
KJob* importResources(const KUrl& url,
                      Soprano::RdfSerialization serialization,
                      const QString& userSerialization = QString(),
                      const QHash<QUrl, QVariant>& additionalMetadata = QHash<QUrl, QVariant>(),
                      const KComponentData& component = KGlobal::mainComponent());
DescribeResourcesJob* describeResources(const QList<QUrl>& resources,
                                        bool includeSubResources);

The first two methods allow an application to only remove the information it created itself. This is for example very important for tools like the file indexer that need to update data without touching anything added by the user like tags or comments or relations to other resources.

The method mergeResources allows to actually merge two resources. The result is that the properties and relations of the second resource will be moved to the first one, after which the second resource is deleted.

The most powerful method is without a doubt storeResources. It allows to store entire sets of resources in Nepomuk letting it sync them with existing ones automatically. The SimpleResourceGraph which is used as input is basically just a set of resources which in turn consist of a set of properties and an optional URI. The DMS will look for already existing matching resources before storing them in the database. This also means that simple resources like emails and contacts are merged automatically. Clients like Akonadi do not need to perform their own resource resolution anymore. Another variant of the same method is importResources which is probably most useful for scripts as it allows to read the resources from a file rather than a C++ struct.

Last but not least DMS has one single read-only method: describeResources. It returns all relevant information about the resources provided. This method will be used for meta-data sharing and syncing. While currently it only allows to filter sub-resources it will be extended to also allow to filter by application and permissions.

Well, that is it for the DMS for now. The plan is to port all existing tools and apps to this new API. But do not fear. In most cases this does not mean any work for you as the existing Nepomuk API can be used as always. It will then internally perform calls to DMS. The plan is to make the old Soprano-based interface read-only by KDE 4.8.

Nice Things To Do With Nepomuk – Part One

The other day I needed to find a website. The only thing I could remember was that Vishesh gave me the link in IRC a few days back. So I had to grep through thousands of lines of IRC log which, quite frankly, sucks. Nepomuk should handle this. So what do we have to do to achieve that? Three things of which I will present the second thing first:

Extract web links from text in Nepomuk.

Why that? Well, to properly handle web links they need to be represented as Nepomuk resources and not just be some plain text excerpt in some text literal. Only then we can relate them to things, search them by type and order them by access count, times, or whatever.

Let’s go then. First we query for all resources that might mention a web link in their text content (we restrict ourselves to nie:plainTextContent since that covers all files and emails, and so on.):

ComparisonTerm linkTerm(NIE::plainTextContent(), 
    LiteralTerm(QLatin1String("http")));

We look for all resources that contain the string ‘http’ in their plain text content. We then force a variable name for the matched property to be able to access it in the results:

linkTerm.setVariableName(QLatin1String("text"));

We additionally exclude HTML and SVG files to avoid having too many useless links:

Term htmlExcludeTerm = !ComparisonTerm(NIE::mimeType(),
    LiteralTerm(QLatin1String("text/html")), ComparisonTerm::Equal);
Term svgExcludeTerm = !ComparisonTerm(NIE::mimeType(),
    LiteralTerm(QLatin1String("image/svg+xml")), ComparisonTerm::Equal);
Query query(linkTerm && htmlExcludeTerm && svgExcludeTerm);

Finally we request that Nepomuk returns two properties. We will see later on why we need those:

query.addRequestProperty(Query::RequestProperty(NIE::lastModified()));
query.addRequestProperty(Query::RequestProperty(NAO::created()));

And now all we have to do is to run this query via QueryServiceClient and connect to its newEntries signal to handle each result. In that slot we iterate over all new results and see if there are really useful links in there. For that we need a little QRegExp magic which is fairly unrelated to Nepomuk but interesting nonetheless:

QRegExp rx(QLatin1String("\\b(https?://[\\-a-z0-9+&@#/%?=~_\\|!:,.;]*[\\-a-z0-9+&@#/%=~_\\|])"));
rx.setCaseSensitivity(Qt::CaseInsensitive);

We will use this regular expression without comment and get back to our result. First we create a list to remember our website resources (we only do this to show now Nepomuk can handle lists later on):

QList<Nepomuk::Resource> websites;

We then iterate over all matches of the regular expression in the text:

const QString text = result.additionalBinding(QLatin1String("text")).toString();
int i = -1;
do {
    if((i = rx.indexIn(text, i+1)) >= 0) {
        const KUrl url = rx.cap(1);
        Nepomuk::Resource website(url);
        website.addType(NFO::Website());
        websites << website;
} while(i >= 0);


Finally we actually relate the newly created website resources to the original resource using nie:links which is exactly the property we need:

result.resource().addProperty(NIE::links(), websites);

This could already be it. But there was one minor detail which we did not handle yet: the request properties we added to the query. The issue is rather simple: We create these website resources at a time that differs from the time we actually encountered them. Thus, to be able to sort web sites according to the time we used them last we need to change the creation date of the resources. For web links that were found in file contents this is the mtime (the best date we have). For anything else we use the creation time of the resource (the perfect fit here would be the creation time of the property which contains the link but that is for another day):

QDateTime creationDate;
if(result[NIE::lastModified()].isLiteral())
    creationDate = result[NIE::lastModified()].literal().toDateTime();
else if(result[NAO::created()].isLiteral())
    creationDate = result[NAO::created()].literal().toDateTime();
if(creationDate.isValid())
    website.setProperty(NAO::created(), creationDate);

Well, that’s it for today. Next time: great, now we have all these web sites but what do we do with them?

Call for Participation (Web and Scripting-Experts Wanted)

There is always work to do. This is true for every project. It is even more true for open-source projects. It is the truth in itself when it comes to the semantic desktop and Nepomuk. Getting people to help was never a strong point of mine. I think that is partly due to the fuzzy task descriptions. Well, today I try once again but with a slightly different scope. It is not about KDE or Nepomuk coding, this is about the work that needs to be done for the maintenance of the Nepomuk ontologies.

The ontologies have a moved history. They started out as part of the Nepomuk research project. When that was over they lived on in the kdebase package. Then the OSCAF foundation was created with the goal to maintain the ontologies. That did not really work out. Thus, we created the oscaf project on Sourceforge trying to do ontology development the open-source way. This sort of worked but communication with other projects was troublesome (The Tracker project still maintains their own fork of the ontologies). With the oscaf project the shared-desktop-ontologies project was born. Thus, we had a package named shared-desktop-ontologies in the oscaf project. Then we created the shared-desktop-ontologies project on freedesktop.org hoping that this in combination with a move to git would bring the Tracker guys back to the main development – at least in the same repository. Of course that did not happen either. So now we have the oscaf project on sourceforce, the shared-desktop-ontologies project on freedesktop.org and to top it all off we have semanticdesktop.org/ontologies which is used to host the ontologies semantic-web-style.

So much for the mess. If you are still reading that means that I did not scare you away and you might be a candidate to help us out of that mess.

This is what needs to be done – at least that is my current idea, if you bring better ideas – great:

  1. Create a simple website for the shared-desktop-ontologies project on freedesktop.org including links to semanticdesktop.org and the sdo package releases.
  2. Migrate the package releases of sdo from SourceForge to freedesktop.org. I suppose they can be put in some ftp folder and be linked in some download section on the new website.
  3. Set in place scripts that automatically update the ontology pages on semanticdesktop.org like the NIE page. This involves:
    1. Convert the existing HTML headers that we have for ontologies like NAO or NIE into docbook (html2docbook might help with the first conversion step)
    2. Write a script that parses the ontologies and creates docbook code with links to super and sub-properties/classes including links between the ontologies. The result should be something like the existing (but outdated) HTML pages.
    3. Write a script that converts the docbook to HTML and puts it onto semanticdesktop.org.
  4. If possible somehow integrate the l10n script that Sebastien Renard wrote to allow translation of labels and comments (Sebastien or me can provide the script).

There you have it. Not a single line of C++ required and not really any ontology or RDF knowledge necessary. It would be grand to find someone willing to invest some time and effort into this allowing us to finally have up-to-date ontologies on semanticdesktop.org and a clean shared-desktop-ontologies portal.

Thanks for reading.

A Million Ways To Do It Wrong

It is a sad truth: when it comes to creating data for the Nepomuk semantic desktop there are a million ways to do it wrong and basically only one way to get it right. Typically people will choose from the first set of ways. While that is of course bad they are not to blame. Who wants to read page after page of documentation and reference guide? Who wants to dive into the depth of RDF and all that ontology stuff when they just need to store a note (yes, this blog was inspired by a real problem). Nobody – that’s who! Thus, the Nepomuk API should do most of the work. Sadly it does not. It basically allows you to do everything. Resource::setProperty will happily use classes or even invalid URLs as properties without giving any feedback to the developer. Why is that? Well, I suppose there are at least three reasons: 1. Back in the day I figured people would almost always use the resource-generator to create their own convenience classes which handle the types and properties properly, 2. The Resource class is probably the oldest part of the whole Nepomuk stack, and 3. basic lack of time, drive and development power.

So what can we do about this situation? Vishesh and me have been discussing the idea of a central DBus API for Nepomuk data management a million times (as you can see today “a million” is my goto expression when I want to say “a lot”). So far, however, we could not come up with a good API that solves all problems, is future-proof (to a certain extend), and performs well. That did not change. I still do not know the solution. But I have some ideas as to what the API should do for the user in terms of data integrity.

  1. Ensure that only valid existing properties are used and provide a good error message in case a class or an invalid URL or something non-existing is used instead. This would also mean that one could only use ontologies that have been imported into Nepomuk. But since the ontology loader already supports fetching ontologies from the internet this should not be a big problem.
  2. Ensure that the ranges of the properties are honoured. This is pretty straight-forward for literal ranges. In that case we could also do some fancy auto-conversion to simplify the usage but in essence it is easy. The case of a non-literal ranges is a bit more tricky. Do we want to force proper types or do we assume that the object resource has the required type? I suppose flags would be of use:
    • ClosedWorld – It is required that the object resource has the type of the range. If it has not the call fails.
    • OpenWorld – The object resource will simply get the range type. This is not problem since resources can be of several types.

    This would also mean that each property needs to have a properly defined range. AFAIK this is currently not the case for all NIE ontologies. I think it is time for ontology unit tests!

  3. Automatically handle pimo:Things to a certain extend: Here I could imagine that trying to add a PIMO property on a resource would automatically add it to the related pimo:Thing instead.

Moving this from the client library into a service would have other benefits, too.

  • The service could be used from other languages than C++ or even from applications not using KDE.
  • The service could perform optimizations when it comes to storing triples, updating resources, caching, you name it.
  • The service could provide change notifications which are much more useful than the Soprano::Model signals which are pretty useless.
  • The service could perform any number of integrity tests before executing the actual commands on the database, thus improving the quality of the data in Nepomuk altogether.

This blog entry is not about presenting the one solution to solve all our problems. It is merely a brain-dump, trying to share some of the random thoughts that go through my head when taking a walk in the woods. Nonetheless this is an issue that needs tackling at one point or another. In any case my ideas are saved for the ages. :)

Nepomuk Data Layout

I am happy to announce that I just finished writing an article about the data layout in Nepomuk. I think it is another must-read if you want to work with Nepomuk data, either querying it or creating data yourself. The article is another one in the series of techbase tutorials about Nepomuk. Well, this was short and fairly dry…

Reblog this post [with Zemanta]

What Nepomuk Can do and How You Should Use it (as a Developer)

Nepomuk has been around for quite a while but the functionality exposed in KDE 4.3 is still not that impressive. This does not mean that there does not exist cool stuff. It only means that there is not enough developer power to get it all stable and integrated perfectly. Let me give you an overview of what already exists in playground and how it can be used (and how you should use it).

The Basics

For starters there is the Nepomuk API in kdelibs which you should get familiar with.Most importantly (we will use it quite a lot later on) there is Nepomuk::Resource which gives access to arbitrary resources in Nepomuk.

Nepomuk::Resource file( myFilePath );
file.addTag( Nepomuk::Tag( “Fancy stuff” ) );
QString desc = file.description();
QList<Nepomuk::Tag> allTags = Nepomuk::Tag::allTags();

Resource allows simple manipulation of data in Nepomuk. Using some fancy cmake magic through the new NepomukAddOntologyClasses macro in kdelibs data manipulation gets even simpler. The second basic thing you should get familliar with is Soprano and SPARQL. As a quickstart the following code shows how I typically create queries using Soprano:

using namespace Soprano;

Model* model = Nepomuk::ResourceManager::instance()->mainModel();
QString query = QString( “prefix nao:%1 “
                         “select ?x where { “
                         “%2 nao:hasTag ?t . “
                         “?r nao:hasTag ?t . }” )
        .arg(Node::resourceToN3(Vocabulary::NAO::naoNamespace()))
        .arg(Node::resourceToN3(file.resourceUri()));
QueryResultIterator it
        = model->executeQuery( query, Query::QueryLanguageSparql );

As you can see there is always a lot of QString::arg involved to prevent hard-coding of URIs (again Soprano provides some cmake magic for generating Vocabulary namespaces).

These are the basics. Without these basics you cannot use Nepomuk.

Debugging Nepomuk Data

Now before we dive into the unstable, experimental, and really cool stuff let me mention sopranocmd.

sopranocmd is a command line tool that comes with Soprano and allows to perform virtually any operation possible on the Nepomuk RDF database. It has an exhaustive help output and you should use it to debug your data, test your queries and the like (if anyone is interested in creating a graphical version, please step up).

The Nepomuk database (hosting only a single Soprano model called “main”) can be accessed though D-Bus as follows:

sopranocmd --dbus org.kde.NepomukStorage --model main \
      query "select ?r where { ?r ?p ?o . }"

The Good Stuff

There is quite a lot of experimental stuff in the playground but I want to focus on the annotation framework and Scribo.

The central idea of the annotation framework is the annotation suggestion which is encapsulated in the Annotation class (Hint: run “make apidox” in the annotationplugin folder). Instead of the user manually annotating resources (adding tags or relating things to other things) the system proposes annotations which the user then simply acknowledges or discards. These Annotation instances are normally created by AnnotationPlugin instances (although it is perfectly possible to create them some other way) which are trigged through an AnnotationRequest.

Before I continue a short piece of code for the impatient:

Resource res = getResource();

AnnotationPluginWrapper* wrapper = new AnnotationPluginWrapper();
wrapper.setPlugins( AnnotationPluginFactory::instance()
   ->getPluginsSupportingAnnotationForResource( res.resourceUri() ) );
connect( wrapper, SIGNAL(newAnnotation(Nepomuk::Annotation*)),
         this, SLOT(addNewAnnotation(Nepomuk::Annotation*)) );
connect( wrapper, SIGNAL(finished()),
         this, SLOT(slotFinished()) );

AnnotationRequest req;
req.setResource( res );
req.setFilter( filter );
wrapper->getPossibleAnnotations( req );

The AnnotationPluginWrapper is just a convenience class which prevents us from connecting to each plugin separately. It reproduces the same signals the plugins emit.

The interesting part is the AnnotationRequest. At the moment (the framework is under development. This also means that your ideas, patches, and even refactoring actions are very welcome) it has three parameters, all of which are optional:

  1. A resource – The resource for which the annotation should be created. This parameter is a bit tricky as the Annotation::create method allows to create an annotation on an arbitrary resource but in some cases it makes perfect sense to only create annotation suggestions for only one resource.
  2. A filter string – A filter is supposed to be a short string entered by the user which triggers an auto-completion via annotations. Plugins should also take the resource into account if it is set.
  3. A text – An arbitrary long text which is to be analyzed by plugins. Plugins would typically extract keywords or concepts from it. Plugins should also take resource and filter into account if possible. This is where the Scribo system comes in (more later).

Plugins that I already created include very simple ones like the tag plugin which matches the filter to existing tag names and also excludes tags already set on the resource. Way more interesting are other plugins like the pimotype plugin which matches the filter to pimo types and proposed to use that type or the pimo relation plugin which allows to create relations via a very simple syntax: “author:trueg“. The latter will match author to existing properties and trueg to a value based on the property range. One step further goes the geonames annotation plugin which matches the filter or the resource label to cities or countries using the geonames web service. It will then propose to set a location or (in case the resource label was matched) to convert the resource into a city or country linking to the geonames resource.

A picture says more than a thousand words. Thus, here goes:

annotations-english

What do we see here? The user entered the text Paris in the AnnotationWidget (a class available in the framework) and the framework then created a set of suggested annotations. The most likely one is Paris, the city in France as sugested by the geonames plugin. The latter also proposes a few not so likely places. The pimotype plugin proposes to create a new type named Paris and the tag plugin proposes to create a new tag named Paris. Here I see room for improvement: if we can relate to the city Paris there is no need for the tag. Thus, some more sophisticated rating and comparision may be in order.

Now let us bring Scribo into play. Scribo is another framework in the playground which provides an API for text analysis and keyword extraction. It is tied into the annotation framework through a dedicated plugin which uses the TextAnnotation class to create annotations on specific text positions. The TextAnnotation class is supposed to be used to annotate text documents. It will create a new nfo:TextDocument and make it a nie:isPartOf the main document. Then the new resource is annotated according to the implementation.

The Scribo framework will extract keywords and entities from the text (specified via the AnnotationRequest text field) via plugins which will then be used to create annotation suggestions. There currently exist three plugins for Scribo: the datetime plugin extracts dates and times, the pimo plugin matches words in the text to things in the Nepomuk database, and the OpenCalais plugin will use the OpenCalais webservice to extract entities from the text.

You can try the Scribo framework by using the scriboshell which can be found in the playground, too:

scriboshell3

Paste the text to analyze in the left view and press the “Start” button. The right panel will then show all found entities and keywords including the text position and relevance.

The other possibility is to directly use the resourceeditor which is part of the annotation framework and bundles all gui elements the latter has to offer in one widget. Call it on a text file and you will get a window similar to the following:

resourceeditor

At the top you have the typical things: editable label and description, the rating, and the tags. Below that you have the exisiting properties and annotations. In the picture these are only properties extracted by Strigi. Then comes the interesting part: the suggestions. Here you can see three different Scribo plugins in action. First the pimo plugin matched the word “Brein” to an event I already had in my Nepomuk database. Then there is the OpenCalais plugin which extracted the “Commission of European Communities” (so far the plugin ignores the additional semantic information provided by OpenCalais) and proposes to tag the text with it.

The last suggested annotation that we can see is “Create Event“. This is a very interesting hack I did. The Scribo plugin detected the mentioning of a project, a date, and persons and thus, proposes to create an event which has as its topic the project and takes place at the extracted time. Since it is a hack created specifically for a demo its results will not be very great in many situations. But it shows the direction which I would like to take.

Below the suggestions you can see the AnnotationWidget again which allows to manually annotate the file.

How to Write an AnnotationPlugin

This is a Howto in three sentences: Derive from AnnotationPlugin and implement doGetPossibleAnnotations. In that method trigger the creation of annotations. Your annotations can be instances of SimpleAnnotation or be based on Annotation and implement at least doCreate, exists, and equals .

class MyAnnotationPlugin : pubic Nepomuk::AnnotationPlugin
{
public:
    MyAnnotationPlugin(QObject* parent, const QVariantList&);
protected:
    void doGetPossibleAnnotations(const Nepomuk::AnnotationRequest&);
};

void MyAnnotationPlugin::doGetPossibleAnnotations(
      const Nepomuk::AnnotationRequest& request
)
{
    // MyFancyAnnotation can do all sorts of crazy things like creating
    // whole graphs of data or even openeing another GUI
    addNewAnnotation(new MyFancyAnnotation(request));

    // SimpleAnnotation can be used to create simple key/value pairs
    Nepomuk::Types::Property property(Soprano::Vocabulary::NAO::prefLabel());
    Nepomuk::SimpleAnnotation* anno = new Nepomuk::SimpleAnnotation();
    anno->setProperty(property);
    anno->setValue("Hello World");
    // currently only the comment is used in the existing GUIs
    anno->setComment("Set label to 'Hello World'");
    addNewAnnotation(anno);

    // tell the framework that we are done. All this could also
    // be async
    emitFinsihed();
}

And Now?

At the Nepomuk workshop Tom Albers already experimented with integrating the annotation suggestions into Mailody. It is rather simple to do that but the framework still needs polishing. More importantly, however, the created data needs to be presented to the user in a more appealing way. In short: I need help with all this!

Integrate it into your applications, improve it, come up with new ways of presenting the information, write new plugins. Jump on board of the semantic desktop train.

Thanks for reading.

Xesam vs. Nepomuk – It is Settled

Update: Since there seems to be some confusion: This blog discusses the Xesam ontology, not the desktop search API!

For a long time now we had two desktop ontologies trying to solve the same problems: the Xesam ontology and NIE (Nepomuk Information Elements) both define classes and properties to describe desktop metadata ranging from id3 tags (artist, title, and so on) to emails and instant messaging. While the latter was developed as part of the Nepomuk project, the Xesam ontology was a by-product of the Xesam desktop search API project on freedesktop.org mainly designed by Evgeny (Phreedom) Egorochkin.

Now although I am obviously biased I cannot say if one was better than the other. What I can say is that NIE was more complete then Xesam. In any case, even during that time both ontologies benefited from each-other. NIE contains ideas from Xesam and the other way around. So both ontologies were already pretty close. All that was left to do was a merger. And that is what happened now. Sort of.

With Nokia pushing Tracker and being very interested in semantic desktop technology, with KDE leaning towards Nepomuk anyway (ok, that is my fault), and with Evgeny, being the lead Xesam ontology developer, wanting this merge to happen sooner than later NIE sort of won. Tracker now uses NIE and Strigi was also ported to NIE recently. Discussions on the xesam mailing list suggest that ther Xesam ontology can be seen as deprecated now. Any advantages it has over NIE should be either merged into NIE or maintained as an extension to NIE.

I think this is a great development. And again this should not be seen as a defeat of Xesam. As I said: many ideas from Xesam already found their way into NIE and others will in the future. Evgeny and Philip Van Hoof (working on Tracker) already created extensions to at least NMO which should be merged soon.

But IMHO we now should make sure that we do not end up with multiple versions of NIE all over the place. Strigi’s source code now contains copies of NIE, kdebase already contains NIE (alongside other ontologies), and I figure that the same is true for Tracker. Thus, to avoid having one ontology installed in different versions and different places from different packages, I suggest that we create a new project which is only concerned with the desktop ontologies. It has already be done with mime types. So why not do the same with ontologies.

The OSCAF foundation, a result from the Nepomuk project, was supposed to maintain these ontologies. However, so far not much has happened. So we created a sourceforge project which could serve as the platform to collaboratively develop these desktop ontologies. The way I see it we could use a subversion repository to develop the ontologies and put the stable versions on the semanticdesktop.org server. In any case I think we need some platform for the ontologies. And just a wiki is not enough.

A short comment on the technical side: In kdebase I maintain the ontologies as a pair of trig source files and desktop files which describe the ontology (name, comment, namespace, path to source file, etc.). Maybe this could be a starting point to get to a common way of storing ontologies for the desktop.