Nepomuk 2.0 and the Data Management Service

During the development of Nepomuk in the last years we have gathered a lot of knowledge and ideas about how to integrate semantics and more specifically RDF into the desktop. This knowledge was spread over several components like services, libraries, and applications. Some of it was only found as convention, some of it documented, some only in our brains. But I guess this can be seen as normal in a project which treats on new territory, tries to invent new ways of handling information on our computers.

Thus, in January of this year Vishesh and myself finally set out to define a new API that would gather all this knowledge, all the ideas in one service. This service was intended to enforce our idea of how data in the semantic desktop should be formed while at the same time providing clean and powerful methods to manipulate this data. On April 8th, after 155 commits in a separate repository, I finally merged the new data management service into kde-runtime. And after hinting at it several times it is high time that I explain its ideas and the way we implemented it.

DMS and named graphs

Before I go into details about the DMS API I would like to explain the way information is stored in Nepomuk. More specifically, which ontology entities are used (until now by convention) to describe resources and meta-data.

The following example shows the information created by the file indexer. It nicely shows how we encode the information in Nepomuk. It is encoded using the Trig serialization which is also used by the Shared-Desktop-Ontologies.

<nepomuk:/ctx/graph1> {
  <nepomuk:/res/file1>
    nao:created “2011-05-20T11:23:45Z”^^xsd:dateTime ;
    nao:lastModified “2011-05-20T11:23:45Z”^^xsd:dateTime ;

    nie:contentSize "1286"^^xsd:int ;
    nie:isPartOf <nepomuk:/res/80b4187c-9c40-4e98-9322-9ebcc10bd0bd> ;
    nie:lastModified "2010-12-14T14:49:49Z"^^xsd:dateTime ;
    nie:mimeType "text/plain"^^xsd:string ;
    nie:plainTextContent "[...]"^^xsd:string ;
    nie:url <file:///home/nepomuk/helloworld.txt> ;
    nfo:characterCount "1249"^^xsd:int ;
    nfo:fileName "helloworld.txt"^^xsd:string ;
    nfo:lineCount "37"^^xsd:int ;
    nfo:wordCount "126"^^xsd:int ;
    a nfo:PlainTextDocument, nfo:FileDataObject .
}
<nepomuk:/ctx/metadatagraph1> {
  <nepomuk:/ctx/metadatagraph1>
    a nrl:GraphMetadata ;
    nrl:coreGraphMetadataFor <nepomuk:/ctx/graph1> .

  <nepomuk:/ctx/graph1>
    a nrl:DiscardableInstanceBase ;
    nao:created "2011-05-04T09:46:11.724Z"^^xsd:dateTime ;
    nao:maintainedBy <nepomuk:/res/someapp> .
}

There is essentially three types of information in this example. If we look at the first graph nepomuk:/ctx/graph1 we see that it contains information about one resource nepomuk:/res/file1. This information is split into two blocks for visualization purposes. The first block contains two properties nao:created and nao:lastModified. We call this information the resource-meta-data. It refers to the Nepomuk resource in the database and states when it was created and modified. This information can only be changed by the data management service. In contrast to that the second block contains the data in the resource. In this case we see file meta-data which in Nepomuk terms is just data. Here it is important to see the difference between nao:lastModified and nie:lastModified. The latter refers to the file on disk itself while the former refers to the Nepomuk resource representing the file on disk.

The second graph nepomuk:/ctx/metadatagraph1 only contains what we call graph-meta-data. A named graph (or context in Soprano terms) is a resource like anything else in Nepomuk. Thus, it has a type. Graphs containing graph-meta-data are always of type nrl:GraphMetadata and belong to exactly one non-graph-meta-data graph. The meta-data graph contains the type, the creation date, and the creating application of the actual graph. In Nepomuk we make the distinction between four types of graphs:

  • nrl:InstanceBase graphs contain normal data that has been created by applications or the user.
  • nrl:DiscardableInstanceBase graphs contain normal data that has been extracted from somewhere and can easily be recreated. This includes file or email indexing. Data in this type of graph does not need to be included in backups.
  • nrl:GraphMetadata graphs contain meta-data about other graphs.
  • nrl:Ontology graphs contain class and property definitions. Nepomuk does import all installed ontologies in its database. This is required for query parsing, data inference, and so on.

Whenever the information is changed or new information is added new graphs are created accordingly. Thus, Nepomuk always knows when which information was created by which application.

The Data Management API

Now that the data format is defined we can continue with the DMS API. The DMS provides one central multi-threaded DBus API and a KDE client library for convenience (this library currently lives in kde-runtime until it is stabilized to be moved to kdelibs or a dedicated repository once kdelibs are split). The API consists of two parts: 1. the simple API which is useful for scripts and simple applications and 2. the advanced API which holds most of the power required for systems like the file indexer or data sharing and syncing. Here I will show the client API as it uses proper Qt/KDE types.

The Simple API

The simple API contains the methods that directly spring to mind when thinking about such a service:

KJob* addProperty(const QList<QUrl>& resources,
                  const QUrl& property,
                  const QVariantList& values,
                  const KComponentData& component = KGlobal::mainComponent());
KJob* setProperty(const QList<QUrl>& resources,
                  const QUrl& property,
                  const QVariantList& values,
                  const KComponentData& component = KGlobal::mainComponent());
KJob* removeProperty(const QList<QUrl>& resources,
                     const QUrl& property,
                     const QVariantList& values,
                     const KComponentData& component = KGlobal::mainComponent());
KJob* removeProperties(const QList<QUrl>& resources,
                       const QList<QUrl>& properties,
                       const KComponentData& component = KGlobal::mainComponent());
CreateResourceJob* createResource(const QList<QUrl>& types,
                                  const QString& label,
                                  const QString& description,
                                  const KComponentData& component = KGlobal::mainComponent());
KJob* removeResources(const QList<QUrl>& resources,
                      RemovalFlags flags = NoRemovalFlags,
                      const KComponentData& component = KGlobal::mainComponent());

It has methods to set, add, and remove properties and to create and remove resources. Each method has an addition parameter to set the application that performs the modification. By default the main component of a KDE app is used. The removeResources method has a flags parameter which so far only has one flag: RemoveSubResources. I will get into sub resources another time though.

This part of the API is pretty straight-forward and rather self-explanatory. Every method will check property domains and ranges, enforce cardinalities, and reject any data that is not well-formed.

The Advanced API

The more interesting part is the advanced API.

KJob* removeDataByApplication(const QList<QUrl>& resources,
                              RemovalFlags flags = NoRemovalFlags,
                              const KComponentData& component = KGlobal::mainComponent());
KJob* removeDataByApplication(RemovalFlags flags = NoRemovalFlags,
                              const KComponentData& component = KGlobal::mainComponent());
KJob* mergeResources(const QUrl& resource1,
                     const QUrl& resource2,
                     const KComponentData& component = KGlobal::mainComponent());
KJob* storeResources(const SimpleResourceGraph& resources,
                     const QHash<QUrl, QVariant>& additionalMetadata = QHash<QUrl, QVariant>(),
                     const KComponentData& component = KGlobal::mainComponent());
KJob* importResources(const KUrl& url,
                      Soprano::RdfSerialization serialization,
                      const QString& userSerialization = QString(),
                      const QHash<QUrl, QVariant>& additionalMetadata = QHash<QUrl, QVariant>(),
                      const KComponentData& component = KGlobal::mainComponent());
DescribeResourcesJob* describeResources(const QList<QUrl>& resources,
                                        bool includeSubResources);

The first two methods allow an application to only remove the information it created itself. This is for example very important for tools like the file indexer that need to update data without touching anything added by the user like tags or comments or relations to other resources.

The method mergeResources allows to actually merge two resources. The result is that the properties and relations of the second resource will be moved to the first one, after which the second resource is deleted.

The most powerful method is without a doubt storeResources. It allows to store entire sets of resources in Nepomuk letting it sync them with existing ones automatically. The SimpleResourceGraph which is used as input is basically just a set of resources which in turn consist of a set of properties and an optional URI. The DMS will look for already existing matching resources before storing them in the database. This also means that simple resources like emails and contacts are merged automatically. Clients like Akonadi do not need to perform their own resource resolution anymore. Another variant of the same method is importResources which is probably most useful for scripts as it allows to read the resources from a file rather than a C++ struct.

Last but not least DMS has one single read-only method: describeResources. It returns all relevant information about the resources provided. This method will be used for meta-data sharing and syncing. While currently it only allows to filter sub-resources it will be extended to also allow to filter by application and permissions.

Well, that is it for the DMS for now. The plan is to port all existing tools and apps to this new API. But do not fear. In most cases this does not mean any work for you as the existing Nepomuk API can be used as always. It will then internally perform calls to DMS. The plan is to make the old Soprano-based interface read-only by KDE 4.8.

Dangling Meta Data Graphs (Caution: Very Technical)

Nepomuk in KDE uses NRL – the Nepomuk Representation Language – especially the named graphs that it defines. Each triple that is stored in the Nepomuk database is stored in a named graph. We use this graph to attach meta data to the triples themselves. So far this is restricted to the creation date but in the future there will be more like the creator (for shared meta data) and the modification date (this is a bit tricky since technically triples are never modified, only added and deleted. But from a user’s point of view changing a rating means changing a triple.).

What did I say? “We attach meta data to the triples”? Well, to be exact we attach it to the graph which contains the triples. And since everything is triples (or quadruples since there is the named graph) the meta data is, too. And like every triple these also need to be put in a dedicated named graph – the meta data graph. Thus, each triple is contained in one graph and each graph has exactly one meta data graph.

So far so good. But what happens if we delete all triples in a graph? Well, the graph ceases to exist since graphs like resources in an RDF database do only exist due to the triples in which they occur.

And that is when it happens: dangling meta data graphs, i.e. meta data graphs that describe a graph which does no longer exist.

In theory Nepomuk could delete these automatically but I decided against that for performance reasons. It would have to check for dangling meta data graphs after each removal operation. So for now (until I come up with some database maintenance service) these graphs are just waste hanging around not bothering anyone (they are small).

In case you want to check how many of them you have in your database use the following command (see the Nepomuk Tips and Tricks for nepomukcmd):

nepomukcmd query 'select count(?mg) where { ?mg nrl:coreGraphMetadataFor ?g . OPTIONAL { graph ?g { ?s ?p ?o . } . } . FILTER(!BOUND(?s)) . }'

And to simply delete them use a bit of shell magic (the –foo is important since it removes any human-readability gimmicks from the otuput):

for a in `nepomukcmd --foo query 'select ?mg where { ?mg nrl:coreGraphMetadataFor ?g . OPTIONAL { graph ?g { ?s ?p ?o . } . } . FILTER(!BOUND(?s)) . }'`; do nepomukcmd rmgraph "$a"; done