Debugging Nepomuk/Virtuoso’s CPU usage

Rabauke wrote a very good blog post about debugging Nepomuk and Virtuoso query performance on OpenSuse.

Also David Faure posted a Virtuoso patch on the Nepomuk mailing list which makes the Virtuoso status() command output the full queries instead of the truncated ones. I will try to get the latter into Virtuoso upstream, maybe with an additional parameter to the function.

Nepomuk Tasks: KActivityManager Crash

After a little silence during which I was occupied with Eastern and OpenLink related work I bring you news about the second Nepomuk task: the KActivityManager crash.

Ivan Cukic already “fixed” the bug by simply not using Nepomuk but an SQLite backend (at least that is how I understood it, correct me if I am wrong). However, I wanted to fix the root of the original problem.

Soprano provides the communication channel between Nepomuk and its clients. It is based on a very simple custom protocol going through a local socket. So far QLocalSocket, ie. Qt’s implementation was used. The problem with QLocalSocket is that it is a QObject. Thus, it cannot live in two threads at the same time. The hacky solution was to maintain one socket per thread. Sadly that resulted in complicated maintenance code which was impossible to get right. Hence crashes like #269573 or #283451 (basically any crash involving The Soprano::ClientConnection) were never fixed.

A few days ago I finally gave up and decided to get rid of QLocalSocket and replace it with my own implementation. The only problem is that in order to keep Windows compatibility I had to keep the old implementation around by adding quite a lot of #ifdefs.

And now I could use some testers for a Soprano client library that does only create a single connection to the server instead of one per thread. I already pushed the new code into Soprano’s git master. So all you need to do is run KDE on top of that.

Oh, and while at it I finally fixed the problem with re-connecting of clients. So now a restart of Nepomuk will no longer leave the clients with dangling connections, unable to perform queries. That fix, however, is in kdelibs.

Well, the day was long, I am tired, and this blog post feels a little boring. So before in addition to that it gets too long I will stop.

Nepomuk Tasks: Let The Virtuoso Inferencing Begin

Only four days ago I started the experiment to fund specific Nepomuk tasks through donations. Like with last year’s fundraiser I was uncertain if it was a good idea. That, however, changed when only a few hours later two tasks had already reached their donation goal. Again it became obvious that the work done here is appreciated and that the “open” in Open-Source is understood for what it actually is.

So despite my wife not being overly happy about it I used the weekend to work on one of the tasks: Virtuoso inferencing.

Inference?

As a quick reminder: the inferencer automatically infers information from the data in the database. While Virtuoso can handle pretty much any inference rule you throw at it we stick to the basics for now: if resource R1 is of type B and B derives from A then R1 is also of type A. And: if R1 has property P1 with value “foobar” and P1 is derived from P2 then R1 also has property P2 with value “foobar“.

Crappy Inference

This is already very useful and even mandatory in many cases. Until now we used what we called “crappy inferencing 1 & 2”. The Crappy inferencer 1 was based on work done in the original Nepomuk project and it simply inserted triples for all sub-class and sub-property relations. That way we could simulate real inference by querying for something like

select * where {
  ?r ?p "foobar" . 
  ?p rdfs:subPropertyOf rdfs:label .
}

and catch all sub-properties of rdfs:label like nao:prefLabel or nie:title. While this works it means bad performance, additional storage and additional maintenance.

The Crappy Inferencer 2 was even worse. It inserted rdf:type triples for all super-classes. This means that it would look at every added and removed triple to check if it was a rdf:type triple. If so it would add or remove the appropriate rdf:type triples for the super-types. That way we could do fast type queries without relying on the crappy inferencer 1 which relies on the rdfs:subClassOf method. But this meant even more maintenance and even more storage space wasted.

Introducing: Virtuoso Inference

So now we simply rely on Virtuoso to do all that and it does such a wonderful job. Thanks to Virtuoso graph groups we can keep our clean ontology separation (each ontology has its own graph) and still stick to a very simple extension of the queries:

DEFINE input:inference <nepomuk:/ontographgroup>
select * where {
  ?r rdfs:label "foobar" .
}

Brilliant. Of course there are still situations in which you do not want to use the inferencer. Imagine for example the listing of resource properties in the UI. This is what it would look like with inference:

We do not want that. Inference is intended for machine, not for the human, at least not like this. So since back in the day I did not think of adding query flags to Soprano I simply introduced a new virtual query language: SparqlNoInference.

Resource Visibility

While at it I also improved the resource visibility support by simplifying it. We do not need any additional processing anymore. This again means less work on startup and with every triple manipulation command. Again we save space and increase performance. But this also means that resource visibility filtering will not work as before anymore. Nepoogle for example will need adjustment to the new way of filtering. Instead of

?r nao:userVisible 1 .

we now need

FILTER EXISTS { ?r a [ nao:userVisible "true"^^xsd:boolean ] }

Testing

The implementation is done. All that rests are the tests. I am already running all the patches but I still need to adjust some unit tests and maybe write new ones.

You can also test it. The code changes are, as always, spread over Soprano, kdelibs and kde-runtime. Both kdelibs and kde-runtime now contain a branch “nepomuk/virtuosoInference”. For Soprano you need git master.

Look for regressions of any kind so we can merge this as soon as possible. The goal is KDE 4.9.

Akonadi, Nepomuk, and A Lot Of CPU

One Bug has been driving people crazy. This is more than understandable seeing that the bug was an endless high CPU usage by Virtuoso, the database used in Nepomuk. Kolab Systems, the Free Software groupware company behind Kolab, a driving force behind Akonadi, sponsored me to look into that issue.

Finding the issue turned out to be a bit harder than I thought, coming up with a fix even more so. In the process I ended up improving the Akonadi Nepomuk Email indexer/feeder in several places. This, however useful and worthwhile, turned out to be unrelated to the high CPU usage. Virtuoso was not to blame either. In the end the real issue was solved by a little SPARQL query optimization.

Application developers against Akonadi and Nepomuk might want to keep that in mind: The way you build your queries will have dramatic impact on the performance of the whole system. So this is also where opimizations are likely to have a lot of impact in case people want to help improve things further. Discussing query design with the Nepomuk team or on the Virtuoso mailing list can go a long way here.

So thanks to the support from Kolab Systems, Virtuoso is no longer chewing so much CPU, and Akonadi Email indexing will work a lot smoother with KDE 4.8.2.

Nepomuk Tasks – Sponsor a Bug or Feature

Thanks to a very successful fundraiser in 2011 I was able to continue working on Nepomuk and searching for new enterprise sponsoring. Sadly that search was not fruitful and in 2012 Nepomuk has become a hobby. Several people proposed to start another fundraiser or try to raise money on a monthly basis. I, however, will try to get sponsoring for specific bugs or features. Depending on their size the sponsoring goal will differ. This would allow me to keep working on Nepomuk as more than a hobby.

The Nepomuk Tasks page lists the current tasks that can be sponsored. Of course you can propose new tasks but I will try to keep the list of current tasks small. Donate to the tasks you would like to see finished, ignore the ones you do not deem important. I will simply remove tasks if there is no activity within a certain period of time. So please have a look at


The Nepomuk Tasks Page

Keeping Your GitHub Fork Up-To-Date

This is a quick tip: after forking a repository on GitHub sadly there is no button to easily fetch new commits from the original repository. So we have to do it manually. Luckily this is easy enough:

1. Add the original repository as a remote to your local clone:

$ git remote add openlink git://github.com/openlink/virtuoso-opensource.git

2. Now you can simply pull the changes from the original:

$ git checkout develop/6
$ git pull openlink develop/6

However, make sure you do not do a rebase pull with the pull as that would destroy any merges you did on your own fork.

3. Finally push them to your own fork:

$ git push origin develop/6

I am sure that for most git users this is trivial. I just wanted to clearly state that there is no difference when using GitHub. It does not provide us with any magic when it comes to fork updates.

Virtuoso Open-Source Moved to GitHub

Ever since 2006 OpenLink Software has provided its Open-Source version of Virtuoso (VOS), the high-performance SQL server with a powerful RDF/SPARQL data management layer on top.

So far the sources have been developed in an internal cvs repository which was published through the Virtuoso sourceforge pages.

As of March 21. OpenLink took the next step towards Open Development by moving to git as its version management system. The sources are now hosted in the VOS GitHub repository.

Like mentioned on the VOS git usage pages OpenLink now accepts GitHub pull requests and patches. Be sure to read the notes on git branching policy in VOS which are based on the git-flow approach by Vincent Driessen – which by the way is an interesting read independent of VOS.

Most importantly it is now a lot simpler to follow the development of Virtuoso Open-Source. Simply clone the git repository and switch to the appropriate develop branch:

$ git clone git://github.com/openlink/virtuoso-opensource.git
$ cd virtuoso-opensource
$ git checkout -t remotes/origin/develop/6

For details on the used branches see the already mentioned VOS git usage guide.

Refer to the VOS building instructions if the following is not enough for you:

$ ./autogen.sh
$ ./configure --prefix=/usr/local --with-layout=<LAYOUT>
$ make
$ make install

where <LAYOUT> is one of Gnu, Debian, Gentoo, Redhat, Freebsd, opt, Openlink. The latter two force the prefix.

You do not need to know RDF or FOAF to use WebID

After going into way too much detail about FOAF and how to create your own WebID manually the last time, today I will show you how easy it is when you have a system that supports WebID properly.

The system I am talking about is ODS – The OpenLink Data Spaces. Please do not be scared away by the UI which does not offer all the fanciness of today’s Web interfaces. The point with ODS is its backend (which BTW is entirely based on Virtuoso, including the serving of the web pages).

Getting your own WebID through ODS contains of only two steps: 1. create an ODS account, and 2. let ODS create the X.509 certificate for you. Now as mentioned before you need to trust OpenLink with your private key in this case. If you are not willing to do that you can setup your own instance of ODS – more about that another time.

Creating an ODS account

So start by navigating to the public instance of ODS in order to create an account. In the upper right corner you will find a little “Sign Up” button which will get you here:

As you can see there are several ways to sign up for an ODS account, one of which is of course WebID. Since you want to use ODS to create your WebID you need to use another means: plain old username+password or something OAuth-powered like LinkedIn.

Generating your X.509 certificate

Once you created the account and are logged into ODS, navigate to the profile manager via the little “edit” button right next to “Profile” in the upper left. You are now presented with lots of tabs allowing to change all details of your ODS account. In this case you are interested in the “Security->Certificate Generator” section.

Normally all required fields should already be filled. Now you only need to click the “submit certificate request” button and let ODS and your browser do the rest. If the certificate generation was successful you get a notice that it has been imported into your borwser’s key chain. This is what it looks like in Chrome:

To verify that your new certificate has been registered successfully check the “X.509 Certificates” tab which lists all certificates installed in ODS (the ones which include the private key):

Testing your shiny new WebID

Finally you can try your new WebID by logging out (“Logout” in the upper right corner) and signing in again, this time with your WebID:

Notice how your browser takes care of the login by itself as it has the certificate installed in its own key chain.

That is already it. You now have a fully functional and valid WebID without every worrying about RDF or FOAF or anything else besides clicking some buttons.

WebID – A Guide For The Clueless

Clueless – that is what I was a while ago regarding WebID. Since then I learned a lot. One of the things I learned was that apparently there is no easy hands-on guide to get started with WebID. This is what I will try to remedy here. So let us start with the basics:

What is WebID?

WebID is essentially two things: 1. a way to identify yourself and others in the semantic web of things, and 2. a unified password-less alternative to classical login credentials.

A WebID is essentially a URL pointing to a description of yourself (this is typically a FOAF file) combined with a self-signed X.509 certificate. X.509 certificates are those things used to verify the identity of web servers via SSL. Typically they are signed by big brother authorities like Verisign whose root certificates are hard-coded into all web browsers.

How Does WebID Work?

As mentioned before the WebID is a URL which points to a FOAF file. Now if you want to log into some site you simply provide the WebID which means to select a certificate from a list in your web browser. The server will then fetch the FOAF, extract the certificate’s public key from it (more about that later), and then ask you to prove your identity. Since you are the only one having the private key of the certificate that is easily done. And that’s already it. From a high level point of view it is very simple.

The WebID certificate selection dialog on Linux is ugly and shows way too many pointless details – better integration does exist. However, the point stands: it is easy to select the WebID to login with.

Of course this does not differ much from other private/public key systems yet. However, it gets really interesting when you use WebIDs to share information. Imagine your social platform allows you to setup fine grained ACLs based on WebIDs. This person can read that photo, this person can write to that document, and so on. These people do not even need to have accounts on the service in question. Using their WebIDs they will have access to exactly that information.

Is It Safe?

In order to ensure security only two things need to be made sure: 1. The FOAF file your WebID is pointing to should be under your control or that of a trustworthy entity, and 2. as it always has been: make sure nobody steals your private key.

And even if you loose your private key, disabling the WebID is as easy as removing the public key from your FOAF profile (more details following later). Even replacing the certificate with a new one will never invalidate your WebID since it stays an identifier for yourself in the semantic web, independent of the certificate.

Where Can I Learn More?

If you like to read specs check out the latest WebID specs by the W3C WebID community group. Join the mailing list, chat on IRC #foaf, and watch the video showing how simple WebID is for the end user.

How Can I Try Myself?

I will present two ways to play with WebID. The first way is as simple as creating an account at id.myopenlink.net and clicking a few buttons. However, I will leave that for the next blog entry. The second way is the one that leads to a better understanding of WebID and is the result of my struggle with the matter. Be aware, however, that the following howto does show how we can do manually what tools will eventually do for us.

Step 1: Create Your Very Own FOAF Profile

FOAF – Friend Of A Friend – is essentially an RDF vocabulary which allows you to describe your social web. Your WebID will eventually resolve to a foaf:Person representing yourself. So the first thing to do is to decide what your WebID looks like. For simplicity I will use my own as an example: http://www.trueg.de/people/sebastian#me. While Turtle would be a much more readable representation of your FOAF profile I will use RDF+XML instead to increase the probability of server support. Let us start with the basics: the foaf:Person.

<?xml version="1.0"?>
<rdf:RDF
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 xmlns:foaf="http://xmlns.com/foaf/0.1/">
  <foaf:Person rdf:about="http://www.trueg.de/people/sebastian#me">
    <foaf:name>Sebastian Trueg</foaf:name>
  </foaf:Person>
</rdf:RDF>

This is a simple RDF document describing a resource of type foaf:Person with one property: foaf:name. It should be fairly self-explanatory. To this you may add all sorts of information like blogs, email addresses, nick names, personal details, whatever you want. Just be sure to remember that all of this will be public.

There is one very important distinction I want to stress since I missed that for a long time: the document is not the person, ie. the URL of the document needs to differ from the WebID URL (hence the ‘#me’ fragment). To stress this fact lets use some more FOAF:

<?xml version="1.0"?>
<rdf:RDF
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 xmlns:foaf="http://xmlns.com/foaf/0.1/">
  <foaf:PersonalProfileDocument rdf:about="http://www.trueg.de/people/sebastian">
    <dc:title>Sebastian Trueg's FOAF Profile</dc:title>
    <foaf:maker rdf:resource="http://www.trueg.de/people/sebastian#me" />
    <foaf:primaryTopic rdf:resource="http://www.trueg.de/people/sebastian#me" />
  </foaf:PersonalProfileDocument>

  <foaf:Person rdf:about="http://www.trueg.de/people/sebastian#me">
    <foaf:name>Sebastian Trueg</foaf:name>
  </foaf:Person>
</rdf:RDF>

As you can see the document itself only describes the person but it is not the same resource. Failing the make this distinction will result in an invalid WebID.

Create The X.509 Certificate

Once your FOAF file is done you need to get your certificate. This could be done via OpenSSL manually but it is tedious and error-prone. Thus, we fire up our web-browser and let it do the work for us by relying on a certificate generator like webid.fcns.eu/certgen.php and the browser’s own key generator.

All you need is the WebID and your name. The rest is optional. The nice thing here is that the web service will generate the certificate but your browser will generate the key locally and never send the private key to the service. It is safe.

Once the certificate has been generated it is saved to the browser’s own certificate storage thingy.

Copy The Certificate Public Key Into Your FOAF Profile

Now that your certificate has been created you can look at it in the browser’s preferences. In Firefox it can be found via Advanced->View Certificates->Your Certificates. Find the “Subject’s Public Key’ in the details and copy it into your FOAF profile (remove any whitespace in the process):

<?xml version="1.0"?>
<rdf:RDF
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 xmlns:foaf="http://xmlns.com/foaf/0.1/"
 xmlns:cert="http://www.w3.org/ns/auth/cert#">
[...]
  <foaf:Person rdf:about="http://www.trueg.de/people/sebastian#me">
    <foaf:name>Sebastian Trueg</foaf:name>
    <cert:key>
      <cert:RSAPublicKey>
        <cert:modulus rdf:datatype="http://www.w3.org/2001/XMLSchema#hexBinary">a2425fd56d265a45690a36524db6ae290d347a1905429918c7eed70c5bfbf3ce07316563173628d6dfe7b98e1f054446cab7d878953d85d1b8d41b9ffbc983cda6a1daa951207e920205f7172c6f850a3c5d191d314624d984208b365412331d8c260c81813c54ae3b7f3eac6b5f3e152f2ffb6ac951bc0fb3e629171e5c3ded9fd8dcc6ca7e2313bb59186a78af44ee20c9fd4f70c4f443efcecfd75c7c7c19a54c2c749f804cff45cb78e811a6f0993d5da13ba67c426b028d204d908ea9e11794db80bbed569cc99676830db03df98a7462e089fe0e9d5a786ee4eb1ce227e2918a5bf071b4e5a2325b0c67e8b80096e23b58afe3144e5e6b76c9d2fb8e41</cert:modulus>
        <cert:exponent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">65537</cert:exponent>
      </cert:RSAPublicKey>
    </cert:key>
  </foaf:Person>
</rdf:RDF>

Having the public key in the FOAF profile is essential since that is where the servers you want to identify yourself to will read it from.

Upload The FOAF File to Your Server

Finally you need to upload the FOAF file to your web server and make it accessible. Since we are using a fancy WebID this requires some very basic Apaching through .htaccess in the ‘people’ folder which sets up some redirects:

# Turn off MultiViews
Options -MultiViews

# Directive to ensure *.rdf files served as appropriate content type,
# if not present in main apache config
AddType application/rdf+xml .rdf

# Rewrite engine setup
RewriteEngine On
RewriteBase /people

# Rewrite rule to serve HTML content from the vocabulary URI if requested
RewriteCond %{HTTP_ACCEPT} !application/rdf\+xml.*(text/html|application/xhtml\+xml)
RewriteCond %{HTTP_ACCEPT} text/html [OR]
RewriteCond %{HTTP_ACCEPT} application/xhtml\+xml [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*
RewriteRule ^([^/]+)$ $1/index.html [R=303]

# Rewrite rule to serve RDF/XML content from the vocabulary URI if requested
RewriteCond %{HTTP_ACCEPT} application/rdf\+xml
RewriteRule ^([^/]+)$ $1/foaf.rdf [R=303]

# Rewrite rule to serve the RDF+XML content from the vocabulary URI by default
RewriteRule ^([^/]+)$ $1/foaf.rdf [R=303]

Now your shiny new WebID is accessible by web services. Go ahead and verify your WebID on my-profile.eu and use it to sign into id.myopenlink.net.

Next up: more on id.myopenlink.net and possible usage of WebID for social capabilities in Nepomuk.

Nepomuk Gives Back Your CPU Cycles…

…at least partially. With the introduction of the Data Management Service we got a very powerful way to put your data into Nepomuk: storeResources (thank you Vishesh). I will not go into the details here but the important fact is that the file indexer, the Akonadi indexer, the TV namer, the movie integration, and maybe some more rely on this method.

Thus, it is obvious that improving its performance means improving the performance of the overall system in general.

Now like I said it is a very powerful method. Sadly this results in very complex code that is not easy to wrap your head around. So optimizing it is not trivial. I tried anyway and tackled one of the main parts of it: the resource identification. After playing around a little, trying different ways to break up the main query I got to a point where I am satisfied. And here is why:

Total time spent in the resource identification code for 196692 indexed Akonadi items.

Average time spent in the resource identification code for 196692 indexed Akonadi items.

Finally some nice bar graphs! So what does this tell us? Well, the most important bit is that with my patch we save roughly 3 hours when indexing the 196692 Akonadi items I used for testing. But maybe more importantly: if the identification is faster the whole indexing is smoother and eats less CPU since it is throttled and, thus, has shorter phases of high CPU usage.

I took the Akonadi indexing as an example since the difference here is impressive. The file indexing is also slightly improved but by far not as significant as the Akonadi indexing. I suppose this is due to the very simply identification that is required during file indexing (some very basic nco:Contact merging is all that is required most of the time).

So, this is what you will get with KDE 4.8.2. One of several going away presents…