A Little Bit Of Query Optimization

Every once in a while I add another piece of query optimization code to the Nepomuk Query API. This time it was a direct result of my earlier TV Show handling. I simply thought that a query like “downton season=2 episode=2” takes too long to complete.

Now in order to understand this you need to know that there is a rather simple QueryParser class which converts a query string like the above into a Nepomuk::Query::Query which is simply a collection of Nepomuk::Query::Term instances. A Query instance is then converted into a SPARQL query which can be handled by Virtuoso. This SPARQL query already contains a set of optimizations, some specific to Virtuoso, some specific to Nepomuk. Of course there is always room for improvement.

So let us get back to our query “downton season=2 episode=2” and look at the resulting SPARQL query string (I simplified the query a bit for readability. The important parts are still there):

select distinct ?r where {
  ?r nmm:season "2"^^xsd:int .
  {
    ?r nmm:hasEpisode ?v2 .
    ?v2 ?v3 "2"^^xsd:int .
    ?v3 rdfs:subPropertyOf rdfs:label .
  } UNION {
    ?r nmm:episodeNumber "2"^^xsd:int .
  } .
  {
    ?r ?v4 ?v6 .
    FILTER(bif:contains(?v6, "'downton'")) .
  } UNION {
    ?r ?v4 ?v7 .
    ?v7 ?v5 ?v6 .
    ?v5 rdfs:subPropertyOf rdfs:label .
    FILTER(bif:contains(?v6, "'downton'")) .
  } .
}

Like the user query the SPARQL query has three main parts: the graph pattern checking the nmm:season, the graph pattern checking the episode and the graph pattern checking the full text search term “downton“. The latter we can safely ignore in this case. It is always a UNION so full text searches also include relations to tags and the like.

The interesting bit is the second UNION. The query parser matched the term “episode” to properties nmm:hasEpisode and nmm:episodeNumber. On first glance this is fine since both contain the term “episode“. However, the property nmm:season which is used in the first non-optional graph-pattern has a domain of nmm:TVShow. nmm:hasEpisode on the other hand has a domain of nmm:TVSeries. That means that the first pattern in the UNION can never match in combination with the first graph pattern since the domains are different.

The obvious optimization is to remove the first part of the UNION which yields a much simpler and way faster query:

select distinct ?r where {
  ?r nmm:season "2"^^xsd:int .
  ?r nmm:episodeNumber "2"^^xsd:int .
  {
    ?r ?v4 ?v6 .
    FILTER(bif:contains(?v6, "'downton'")) .
  } UNION {
    ?r ?v4 ?v7 .
    ?v7 ?v5 ?v6 .
    ?v5 rdfs:subPropertyOf rdfs:label .
    FILTER(bif:contains(?v6, "'downton'")) .
  } .
}

Well, sadly this is not generically true since resources can be double/triple/whateveriple-typed, meaning that in theory an nmm:TVShow could also have type nmm:TVSeries. In this case it is obviously not likely but there are many cases in which it in fact does apply. Thus, this optimization cannot be applied to all queries. I will, however, include it in the parser where it is very likely that the user does not take double-typing into account.

If you have good examples that show why this optimization should not be included in the query parser by default please tell me so I can re-consider.

Now after having written this and proof-reading the SPARQL query I realize that this particular query could have been optimized in a much simpler way: the value “2” is obviously an integer value, thus it can never match to a non-literal like required for the nmm:hasEpisode property…

Symbolic Links in Nepomuk – A Solution

Until now symbolic links were not handled in Nepomuk. Today I commited the last patch for the new symlink support in Nepomuk. The solution I chose is not the theoretically perfect one. That would have taken way to much effort while introducing all kinds of possible bugs, regressions, API incompatibilities, and so on. But the solution is nice and clean and simple.

Essentially each direct symlink is indexed as a separate file using the content of its target file. (This is necessary since a direct symlink might have a different file name than the target file.) The interesting part are the indirect symlinks. Indirect symlinks are files in a folder which is a symlink to another folder. An example:

/home/trueg/
|-- subdir/
   |-- thefile.txt
|-- link/ -> subdir/
   |-- thefile.txt

Here I have a folder “subdir” which contains a file “thefile.txt”. The folder “link” is a direct symlink to “subdir” whereas “link/thefile.txt” is an indirect symlink to “subdir/thefile.txt”.

Indirect symlinks are simply stored as alternative URLs on the target file resources using the kext:altUrl property. (The property is not defined in NIE since it is not theoretically sound with respect to the design of NIE. It needs to be considered a beautiful hack.)

The only situation in which the alternative URLs are actually needed is when searching in a specific folder. Imagine searching in “/home/trueg/link” only. Since there are no nie:url values which match that prefix we need to search the kext:altUrls, too.

The result of all this is that nearly no additional space is required except for the kext:altUrl properties, files are not indexed more than once, and files in symlinked folders are found in addition to “normal” files.

In my tests everything seems to work nicely but I urge you to test the nepomuk/symlinkHandling branches in kdelibs and kde-runtime and report any problems back to me. The more testing I get the quicker I can merge both into KDE 4.8.

Lastly the pledgie campaign is done but the search for funds goes on:

Nepomuk 2.0 and the Data Management Service

During the development of Nepomuk in the last years we have gathered a lot of knowledge and ideas about how to integrate semantics and more specifically RDF into the desktop. This knowledge was spread over several components like services, libraries, and applications. Some of it was only found as convention, some of it documented, some only in our brains. But I guess this can be seen as normal in a project which treats on new territory, tries to invent new ways of handling information on our computers.

Thus, in January of this year Vishesh and myself finally set out to define a new API that would gather all this knowledge, all the ideas in one service. This service was intended to enforce our idea of how data in the semantic desktop should be formed while at the same time providing clean and powerful methods to manipulate this data. On April 8th, after 155 commits in a separate repository, I finally merged the new data management service into kde-runtime. And after hinting at it several times it is high time that I explain its ideas and the way we implemented it.

DMS and named graphs

Before I go into details about the DMS API I would like to explain the way information is stored in Nepomuk. More specifically, which ontology entities are used (until now by convention) to describe resources and meta-data.

The following example shows the information created by the file indexer. It nicely shows how we encode the information in Nepomuk. It is encoded using the Trig serialization which is also used by the Shared-Desktop-Ontologies.

<nepomuk:/ctx/graph1> {
  <nepomuk:/res/file1>
    nao:created “2011-05-20T11:23:45Z”^^xsd:dateTime ;
    nao:lastModified “2011-05-20T11:23:45Z”^^xsd:dateTime ;

    nie:contentSize "1286"^^xsd:int ;
    nie:isPartOf <nepomuk:/res/80b4187c-9c40-4e98-9322-9ebcc10bd0bd> ;
    nie:lastModified "2010-12-14T14:49:49Z"^^xsd:dateTime ;
    nie:mimeType "text/plain"^^xsd:string ;
    nie:plainTextContent "[...]"^^xsd:string ;
    nie:url <file:///home/nepomuk/helloworld.txt> ;
    nfo:characterCount "1249"^^xsd:int ;
    nfo:fileName "helloworld.txt"^^xsd:string ;
    nfo:lineCount "37"^^xsd:int ;
    nfo:wordCount "126"^^xsd:int ;
    a nfo:PlainTextDocument, nfo:FileDataObject .
}
<nepomuk:/ctx/metadatagraph1> {
  <nepomuk:/ctx/metadatagraph1>
    a nrl:GraphMetadata ;
    nrl:coreGraphMetadataFor <nepomuk:/ctx/graph1> .

  <nepomuk:/ctx/graph1>
    a nrl:DiscardableInstanceBase ;
    nao:created "2011-05-04T09:46:11.724Z"^^xsd:dateTime ;
    nao:maintainedBy <nepomuk:/res/someapp> .
}

There is essentially three types of information in this example. If we look at the first graph nepomuk:/ctx/graph1 we see that it contains information about one resource nepomuk:/res/file1. This information is split into two blocks for visualization purposes. The first block contains two properties nao:created and nao:lastModified. We call this information the resource-meta-data. It refers to the Nepomuk resource in the database and states when it was created and modified. This information can only be changed by the data management service. In contrast to that the second block contains the data in the resource. In this case we see file meta-data which in Nepomuk terms is just data. Here it is important to see the difference between nao:lastModified and nie:lastModified. The latter refers to the file on disk itself while the former refers to the Nepomuk resource representing the file on disk.

The second graph nepomuk:/ctx/metadatagraph1 only contains what we call graph-meta-data. A named graph (or context in Soprano terms) is a resource like anything else in Nepomuk. Thus, it has a type. Graphs containing graph-meta-data are always of type nrl:GraphMetadata and belong to exactly one non-graph-meta-data graph. The meta-data graph contains the type, the creation date, and the creating application of the actual graph. In Nepomuk we make the distinction between four types of graphs:

  • nrl:InstanceBase graphs contain normal data that has been created by applications or the user.
  • nrl:DiscardableInstanceBase graphs contain normal data that has been extracted from somewhere and can easily be recreated. This includes file or email indexing. Data in this type of graph does not need to be included in backups.
  • nrl:GraphMetadata graphs contain meta-data about other graphs.
  • nrl:Ontology graphs contain class and property definitions. Nepomuk does import all installed ontologies in its database. This is required for query parsing, data inference, and so on.

Whenever the information is changed or new information is added new graphs are created accordingly. Thus, Nepomuk always knows when which information was created by which application.

The Data Management API

Now that the data format is defined we can continue with the DMS API. The DMS provides one central multi-threaded DBus API and a KDE client library for convenience (this library currently lives in kde-runtime until it is stabilized to be moved to kdelibs or a dedicated repository once kdelibs are split). The API consists of two parts: 1. the simple API which is useful for scripts and simple applications and 2. the advanced API which holds most of the power required for systems like the file indexer or data sharing and syncing. Here I will show the client API as it uses proper Qt/KDE types.

The Simple API

The simple API contains the methods that directly spring to mind when thinking about such a service:

KJob* addProperty(const QList<QUrl>& resources,
                  const QUrl& property,
                  const QVariantList& values,
                  const KComponentData& component = KGlobal::mainComponent());
KJob* setProperty(const QList<QUrl>& resources,
                  const QUrl& property,
                  const QVariantList& values,
                  const KComponentData& component = KGlobal::mainComponent());
KJob* removeProperty(const QList<QUrl>& resources,
                     const QUrl& property,
                     const QVariantList& values,
                     const KComponentData& component = KGlobal::mainComponent());
KJob* removeProperties(const QList<QUrl>& resources,
                       const QList<QUrl>& properties,
                       const KComponentData& component = KGlobal::mainComponent());
CreateResourceJob* createResource(const QList<QUrl>& types,
                                  const QString& label,
                                  const QString& description,
                                  const KComponentData& component = KGlobal::mainComponent());
KJob* removeResources(const QList<QUrl>& resources,
                      RemovalFlags flags = NoRemovalFlags,
                      const KComponentData& component = KGlobal::mainComponent());

It has methods to set, add, and remove properties and to create and remove resources. Each method has an addition parameter to set the application that performs the modification. By default the main component of a KDE app is used. The removeResources method has a flags parameter which so far only has one flag: RemoveSubResources. I will get into sub resources another time though.

This part of the API is pretty straight-forward and rather self-explanatory. Every method will check property domains and ranges, enforce cardinalities, and reject any data that is not well-formed.

The Advanced API

The more interesting part is the advanced API.

KJob* removeDataByApplication(const QList<QUrl>& resources,
                              RemovalFlags flags = NoRemovalFlags,
                              const KComponentData& component = KGlobal::mainComponent());
KJob* removeDataByApplication(RemovalFlags flags = NoRemovalFlags,
                              const KComponentData& component = KGlobal::mainComponent());
KJob* mergeResources(const QUrl& resource1,
                     const QUrl& resource2,
                     const KComponentData& component = KGlobal::mainComponent());
KJob* storeResources(const SimpleResourceGraph& resources,
                     const QHash<QUrl, QVariant>& additionalMetadata = QHash<QUrl, QVariant>(),
                     const KComponentData& component = KGlobal::mainComponent());
KJob* importResources(const KUrl& url,
                      Soprano::RdfSerialization serialization,
                      const QString& userSerialization = QString(),
                      const QHash<QUrl, QVariant>& additionalMetadata = QHash<QUrl, QVariant>(),
                      const KComponentData& component = KGlobal::mainComponent());
DescribeResourcesJob* describeResources(const QList<QUrl>& resources,
                                        bool includeSubResources);

The first two methods allow an application to only remove the information it created itself. This is for example very important for tools like the file indexer that need to update data without touching anything added by the user like tags or comments or relations to other resources.

The method mergeResources allows to actually merge two resources. The result is that the properties and relations of the second resource will be moved to the first one, after which the second resource is deleted.

The most powerful method is without a doubt storeResources. It allows to store entire sets of resources in Nepomuk letting it sync them with existing ones automatically. The SimpleResourceGraph which is used as input is basically just a set of resources which in turn consist of a set of properties and an optional URI. The DMS will look for already existing matching resources before storing them in the database. This also means that simple resources like emails and contacts are merged automatically. Clients like Akonadi do not need to perform their own resource resolution anymore. Another variant of the same method is importResources which is probably most useful for scripts as it allows to read the resources from a file rather than a C++ struct.

Last but not least DMS has one single read-only method: describeResources. It returns all relevant information about the resources provided. This method will be used for meta-data sharing and syncing. While currently it only allows to filter sub-resources it will be extended to also allow to filter by application and permissions.

Well, that is it for the DMS for now. The plan is to port all existing tools and apps to this new API. But do not fear. In most cases this does not mean any work for you as the existing Nepomuk API can be used as always. It will then internally perform calls to DMS. The plan is to make the old Soprano-based interface read-only by KDE 4.8.

Nice Things To Do With Nepomuk – Part One

The other day I needed to find a website. The only thing I could remember was that Vishesh gave me the link in IRC a few days back. So I had to grep through thousands of lines of IRC log which, quite frankly, sucks. Nepomuk should handle this. So what do we have to do to achieve that? Three things of which I will present the second thing first:

Extract web links from text in Nepomuk.

Why that? Well, to properly handle web links they need to be represented as Nepomuk resources and not just be some plain text excerpt in some text literal. Only then we can relate them to things, search them by type and order them by access count, times, or whatever.

Let’s go then. First we query for all resources that might mention a web link in their text content (we restrict ourselves to nie:plainTextContent since that covers all files and emails, and so on.):

ComparisonTerm linkTerm(NIE::plainTextContent(), 
    LiteralTerm(QLatin1String("http")));

We look for all resources that contain the string ‘http’ in their plain text content. We then force a variable name for the matched property to be able to access it in the results:

linkTerm.setVariableName(QLatin1String("text"));

We additionally exclude HTML and SVG files to avoid having too many useless links:

Term htmlExcludeTerm = !ComparisonTerm(NIE::mimeType(),
    LiteralTerm(QLatin1String("text/html")), ComparisonTerm::Equal);
Term svgExcludeTerm = !ComparisonTerm(NIE::mimeType(),
    LiteralTerm(QLatin1String("image/svg+xml")), ComparisonTerm::Equal);
Query query(linkTerm && htmlExcludeTerm && svgExcludeTerm);

Finally we request that Nepomuk returns two properties. We will see later on why we need those:

query.addRequestProperty(Query::RequestProperty(NIE::lastModified()));
query.addRequestProperty(Query::RequestProperty(NAO::created()));

And now all we have to do is to run this query via QueryServiceClient and connect to its newEntries signal to handle each result. In that slot we iterate over all new results and see if there are really useful links in there. For that we need a little QRegExp magic which is fairly unrelated to Nepomuk but interesting nonetheless:

QRegExp rx(QLatin1String("\\b(https?://[\\-a-z0-9+&@#/%?=~_\\|!:,.;]*[\\-a-z0-9+&@#/%=~_\\|])"));
rx.setCaseSensitivity(Qt::CaseInsensitive);

We will use this regular expression without comment and get back to our result. First we create a list to remember our website resources (we only do this to show now Nepomuk can handle lists later on):

QList<Nepomuk::Resource> websites;

We then iterate over all matches of the regular expression in the text:

const QString text = result.additionalBinding(QLatin1String("text")).toString();
int i = -1;
do {
    if((i = rx.indexIn(text, i+1)) >= 0) {
        const KUrl url = rx.cap(1);
        Nepomuk::Resource website(url);
        website.addType(NFO::Website());
        websites << website;
} while(i >= 0);


Finally we actually relate the newly created website resources to the original resource using nie:links which is exactly the property we need:

result.resource().addProperty(NIE::links(), websites);

This could already be it. But there was one minor detail which we did not handle yet: the request properties we added to the query. The issue is rather simple: We create these website resources at a time that differs from the time we actually encountered them. Thus, to be able to sort web sites according to the time we used them last we need to change the creation date of the resources. For web links that were found in file contents this is the mtime (the best date we have). For anything else we use the creation time of the resource (the perfect fit here would be the creation time of the property which contains the link but that is for another day):

QDateTime creationDate;
if(result[NIE::lastModified()].isLiteral())
    creationDate = result[NIE::lastModified()].literal().toDateTime();
else if(result[NAO::created()].isLiteral())
    creationDate = result[NAO::created()].literal().toDateTime();
if(creationDate.isValid())
    website.setProperty(NAO::created(), creationDate);

Well, that’s it for today. Next time: great, now we have all these web sites but what do we do with them?

Call for Participation (Web and Scripting-Experts Wanted)

There is always work to do. This is true for every project. It is even more true for open-source projects. It is the truth in itself when it comes to the semantic desktop and Nepomuk. Getting people to help was never a strong point of mine. I think that is partly due to the fuzzy task descriptions. Well, today I try once again but with a slightly different scope. It is not about KDE or Nepomuk coding, this is about the work that needs to be done for the maintenance of the Nepomuk ontologies.

The ontologies have a moved history. They started out as part of the Nepomuk research project. When that was over they lived on in the kdebase package. Then the OSCAF foundation was created with the goal to maintain the ontologies. That did not really work out. Thus, we created the oscaf project on Sourceforge trying to do ontology development the open-source way. This sort of worked but communication with other projects was troublesome (The Tracker project still maintains their own fork of the ontologies). With the oscaf project the shared-desktop-ontologies project was born. Thus, we had a package named shared-desktop-ontologies in the oscaf project. Then we created the shared-desktop-ontologies project on freedesktop.org hoping that this in combination with a move to git would bring the Tracker guys back to the main development – at least in the same repository. Of course that did not happen either. So now we have the oscaf project on sourceforce, the shared-desktop-ontologies project on freedesktop.org and to top it all off we have semanticdesktop.org/ontologies which is used to host the ontologies semantic-web-style.

So much for the mess. If you are still reading that means that I did not scare you away and you might be a candidate to help us out of that mess.

This is what needs to be done – at least that is my current idea, if you bring better ideas – great:

  1. Create a simple website for the shared-desktop-ontologies project on freedesktop.org including links to semanticdesktop.org and the sdo package releases.
  2. Migrate the package releases of sdo from SourceForge to freedesktop.org. I suppose they can be put in some ftp folder and be linked in some download section on the new website.
  3. Set in place scripts that automatically update the ontology pages on semanticdesktop.org like the NIE page. This involves:
    1. Convert the existing HTML headers that we have for ontologies like NAO or NIE into docbook (html2docbook might help with the first conversion step)
    2. Write a script that parses the ontologies and creates docbook code with links to super and sub-properties/classes including links between the ontologies. The result should be something like the existing (but outdated) HTML pages.
    3. Write a script that converts the docbook to HTML and puts it onto semanticdesktop.org.
  4. If possible somehow integrate the l10n script that Sebastien Renard wrote to allow translation of labels and comments (Sebastien or me can provide the script).

There you have it. Not a single line of C++ required and not really any ontology or RDF knowledge necessary. It would be grand to find someone willing to invest some time and effort into this allowing us to finally have up-to-date ontologies on semanticdesktop.org and a clean shared-desktop-ontologies portal.

Thanks for reading.

A Million Ways To Do It Wrong

It is a sad truth: when it comes to creating data for the Nepomuk semantic desktop there are a million ways to do it wrong and basically only one way to get it right. Typically people will choose from the first set of ways. While that is of course bad they are not to blame. Who wants to read page after page of documentation and reference guide? Who wants to dive into the depth of RDF and all that ontology stuff when they just need to store a note (yes, this blog was inspired by a real problem). Nobody – that’s who! Thus, the Nepomuk API should do most of the work. Sadly it does not. It basically allows you to do everything. Resource::setProperty will happily use classes or even invalid URLs as properties without giving any feedback to the developer. Why is that? Well, I suppose there are at least three reasons: 1. Back in the day I figured people would almost always use the resource-generator to create their own convenience classes which handle the types and properties properly, 2. The Resource class is probably the oldest part of the whole Nepomuk stack, and 3. basic lack of time, drive and development power.

So what can we do about this situation? Vishesh and me have been discussing the idea of a central DBus API for Nepomuk data management a million times (as you can see today “a million” is my goto expression when I want to say “a lot”). So far, however, we could not come up with a good API that solves all problems, is future-proof (to a certain extend), and performs well. That did not change. I still do not know the solution. But I have some ideas as to what the API should do for the user in terms of data integrity.

  1. Ensure that only valid existing properties are used and provide a good error message in case a class or an invalid URL or something non-existing is used instead. This would also mean that one could only use ontologies that have been imported into Nepomuk. But since the ontology loader already supports fetching ontologies from the internet this should not be a big problem.
  2. Ensure that the ranges of the properties are honoured. This is pretty straight-forward for literal ranges. In that case we could also do some fancy auto-conversion to simplify the usage but in essence it is easy. The case of a non-literal ranges is a bit more tricky. Do we want to force proper types or do we assume that the object resource has the required type? I suppose flags would be of use:
    • ClosedWorld – It is required that the object resource has the type of the range. If it has not the call fails.
    • OpenWorld – The object resource will simply get the range type. This is not problem since resources can be of several types.

    This would also mean that each property needs to have a properly defined range. AFAIK this is currently not the case for all NIE ontologies. I think it is time for ontology unit tests!

  3. Automatically handle pimo:Things to a certain extend: Here I could imagine that trying to add a PIMO property on a resource would automatically add it to the related pimo:Thing instead.

Moving this from the client library into a service would have other benefits, too.

  • The service could be used from other languages than C++ or even from applications not using KDE.
  • The service could perform optimizations when it comes to storing triples, updating resources, caching, you name it.
  • The service could provide change notifications which are much more useful than the Soprano::Model signals which are pretty useless.
  • The service could perform any number of integrity tests before executing the actual commands on the database, thus improving the quality of the data in Nepomuk altogether.

This blog entry is not about presenting the one solution to solve all our problems. It is merely a brain-dump, trying to share some of the random thoughts that go through my head when taking a walk in the woods. Nonetheless this is an issue that needs tackling at one point or another. In any case my ideas are saved for the ages. :)

Nepomuk Data Layout

I am happy to announce that I just finished writing an article about the data layout in Nepomuk. I think it is another must-read if you want to work with Nepomuk data, either querying it or creating data yourself. The article is another one in the series of techbase tutorials about Nepomuk. Well, this was short and fairly dry…

Reblog this post [with Zemanta]