Virtuoso 6.1.6 and KDE 4.9

Shortly after KDE 4.9 hits the net Virtuoso 6.1.6 follows. Virtuoso 6.1.6 comes with a ton of fixes, improvements and optimizations and it is highly recommended to update for the best Nepomuk experience.

Virtuoso 6.1.6 has been tested by the Nepomuk team in cooperation with OpenLink Software before its release. It is the recommended release for Nepomuk. This is not only true for KDE 4.9 but for any version before it.

Get the sources while they are hot and build your packages.

Debugging Nepomuk/Virtuoso’s CPU usage

Rabauke wrote a very good blog post about debugging Nepomuk and Virtuoso query performance on OpenSuse.

Also David Faure posted a Virtuoso patch on the Nepomuk mailing list which makes the Virtuoso status() command output the full queries instead of the truncated ones. I will try to get the latter into Virtuoso upstream, maybe with an additional parameter to the function.

Nepomuk Tasks: Let The Virtuoso Inferencing Begin

Only four days ago I started the experiment to fund specific Nepomuk tasks through donations. Like with last year’s fundraiser I was uncertain if it was a good idea. That, however, changed when only a few hours later two tasks had already reached their donation goal. Again it became obvious that the work done here is appreciated and that the “open” in Open-Source is understood for what it actually is.

So despite my wife not being overly happy about it I used the weekend to work on one of the tasks: Virtuoso inferencing.

Inference?

As a quick reminder: the inferencer automatically infers information from the data in the database. While Virtuoso can handle pretty much any inference rule you throw at it we stick to the basics for now: if resource R1 is of type B and B derives from A then R1 is also of type A. And: if R1 has property P1 with value “foobar” and P1 is derived from P2 then R1 also has property P2 with value “foobar“.

Crappy Inference

This is already very useful and even mandatory in many cases. Until now we used what we called “crappy inferencing 1 & 2″. The Crappy inferencer 1 was based on work done in the original Nepomuk project and it simply inserted triples for all sub-class and sub-property relations. That way we could simulate real inference by querying for something like

select * where {
  ?r ?p "foobar" . 
  ?p rdfs:subPropertyOf rdfs:label .
}

and catch all sub-properties of rdfs:label like nao:prefLabel or nie:title. While this works it means bad performance, additional storage and additional maintenance.

The Crappy Inferencer 2 was even worse. It inserted rdf:type triples for all super-classes. This means that it would look at every added and removed triple to check if it was a rdf:type triple. If so it would add or remove the appropriate rdf:type triples for the super-types. That way we could do fast type queries without relying on the crappy inferencer 1 which relies on the rdfs:subClassOf method. But this meant even more maintenance and even more storage space wasted.

Introducing: Virtuoso Inference

So now we simply rely on Virtuoso to do all that and it does such a wonderful job. Thanks to Virtuoso graph groups we can keep our clean ontology separation (each ontology has its own graph) and still stick to a very simple extension of the queries:

DEFINE input:inference <nepomuk:/ontographgroup>
select * where {
  ?r rdfs:label "foobar" .
}

Brilliant. Of course there are still situations in which you do not want to use the inferencer. Imagine for example the listing of resource properties in the UI. This is what it would look like with inference:

We do not want that. Inference is intended for machine, not for the human, at least not like this. So since back in the day I did not think of adding query flags to Soprano I simply introduced a new virtual query language: SparqlNoInference.

Resource Visibility

While at it I also improved the resource visibility support by simplifying it. We do not need any additional processing anymore. This again means less work on startup and with every triple manipulation command. Again we save space and increase performance. But this also means that resource visibility filtering will not work as before anymore. Nepoogle for example will need adjustment to the new way of filtering. Instead of

?r nao:userVisible 1 .

we now need

FILTER EXISTS { ?r a [ nao:userVisible "true"^^xsd:boolean ] }

Testing

The implementation is done. All that rests are the tests. I am already running all the patches but I still need to adjust some unit tests and maybe write new ones.

You can also test it. The code changes are, as always, spread over Soprano, kdelibs and kde-runtime. Both kdelibs and kde-runtime now contain a branch “nepomuk/virtuosoInference”. For Soprano you need git master.

Look for regressions of any kind so we can merge this as soon as possible. The goal is KDE 4.9.

Virtuoso Open-Source Moved to GitHub

Ever since 2006 OpenLink Software has provided its Open-Source version of Virtuoso (VOS), the high-performance SQL server with a powerful RDF/SPARQL data management layer on top.

So far the sources have been developed in an internal cvs repository which was published through the Virtuoso sourceforge pages.

As of March 21. OpenLink took the next step towards Open Development by moving to git as its version management system. The sources are now hosted in the VOS GitHub repository.

Like mentioned on the VOS git usage pages OpenLink now accepts GitHub pull requests and patches. Be sure to read the notes on git branching policy in VOS which are based on the git-flow approach by Vincent Driessen – which by the way is an interesting read independent of VOS.

Most importantly it is now a lot simpler to follow the development of Virtuoso Open-Source. Simply clone the git repository and switch to the appropriate develop branch:

$ git clone git://github.com/openlink/virtuoso-opensource.git
$ cd virtuoso-opensource
$ git checkout -t remotes/origin/develop/6

For details on the used branches see the already mentioned VOS git usage guide.

Refer to the VOS building instructions if the following is not enough for you:

$ ./autogen.sh
$ ./configure --prefix=/usr/local --with-layout=<LAYOUT>
$ make
$ make install

where <LAYOUT> is one of Gnu, Debian, Gentoo, Redhat, Freebsd, opt, Openlink. The latter two force the prefix.

A Word (or Two) on Removable Storage Media Handling in Nepomuk

While fixing existing Nepomuk bugs and trying to close them as they come in I also look into other things. Last week it was the improved file indexer scheduling and file modification handling. This week it is about another improvement in the handling of queries which involve removable media. Ignacio Serantes already found one bug in the URL encoding before. This time he wanted to search through all mounted removable storage media and realized that he could not. I just fixed that. In order to understand how I did that we need to go into detail about how Nepomuk handles removable media.

Removable Storage Media in Nepomuk

Files on removable storage media are a problem when it comes to meta data stored in Nepomuk. As long as the medium is mounted we can simply identify the files through their local file path. But as soon as it is unmounted the paths are no longer valid. To make things worse we could mount the medium at another mount point the next time or mount another medium (which obviously does not contain the files in question) at the same mount point. So we need a way around that problem. Ever since 4.7 Nepomuk has a rather fancy way of doing that.

Internally Nepomuk uses a stack of Soprano::FilterModels which perform several operations on the data that passes through them. One of these models is the RemovableStorageModel. This model does one thing: it converts the local file URLs of files and folders on removable media into mount-path-independent URLs and vice versa. Currently it supports removable disks like USB keys or external hard disks (any storage that has a UUID), optical media, NFS and Samba mounts. The nice thing about it is that this conversion happens transparently to the client. Thus, a client simply uses the local file URLs according to the current mount path and does not care about anything else. It will always get the correct results.

To understand this better we should look at an example. Imagine we have a USB key inserted with UUID “xyz” which is mounted at /media/disk. Now if we add information about a file /media/disk/myfile.txt to Nepomuk the following happens: The RemovableStorageModel will convert the URL file:///media/disk/myfile.txt into filex://xyz/myfile.txt. This is a custom URL scheme which consists of the device UUID and the relative path. When querying the file the model does the conversion in the other direction. So far so simple.

Queries are where it gets a little more complicated. Imagine we want to query all files in a certain directory on the removable medium (ideally the SPARQL would be hidden by the Nepomuk query API). We would then perform a query like the following simplified one.

select ?r where {
  ?r nie:isPartOf ?p . 
  ?p nie:url <file:///media/disk/somefolder> . }

If we would pass this query on to Virtuoso we would not get any results since there is no resource with nie:url <file:///media/disk/somefolder>. So the RemovableStorageModel steps in again and does some query tweaking (rather primitive tweaking seeing that we do not have a SPARQL parser in Nepomuk). The query is converted into

select ?r where {
  ?r nie:isPartOf ?p .
  ?p nie:url <filex://xyz/somefolder> . }

And suddenly we get the expected results.

Of course this is still rather simple. It gets more complicated when SPARQL REGEX filters are involved. Imagine we wanted to look for all files in some sub-tree on a removable medium. We would then use a query along the lines of the following:

select ?r where {
  ?r nie:url ?url .
  FILTER(REGEX(STR(?url), '^file:///media/disk/somefolder/')) . }

As before passing this query directly on to Virtuoso would not yield any results. The RemovableStorageModel needs to do its magic first:

select ?r where {
  ?r nie:url ?url .
  FILTER(REGEX(STR(?url), '^filex://xyz/somefolder/')) . }

This is what the model did before Ignacio wanted to query all his removable media mounted somewhere under /media at once. Obviously he did something like:

select ?r where {
  ?r nie:url ?url .
  FILTER(REGEX(STR(?url), '^file:///media/')) . }

The result, however, was empty. This is simply because there was no exact match to any mount path of any of the removable media and RemovablStorageModel did not replace anything. The solution was to include additional filters for all the candidates in addition to the already existing filter. We need to keep the existing filter in case there is anything else under /media which is not a removable medium and, thus, has normal local file:/ URLs.

If we imagine that we have an additional mounted removable medium with UUIDfoobar” then the query would be converted into something like the following.

select ?r where {
  ?r nie:url ?url .
  FILTER((REGEX(STR(?url), '^file:///media/') ||
          REGEX(STR(?url), '^filex://xyz/') || 
          REGEX(STR(?url), '^filex://foobar/'))) . }

This way we get the expected results. (The additional brackets are necessary in case the filter already contains more than one term.)

Well, I personally think this is a very clean solution where clients only have to consider filex:/ and its friends nfs:/, smb:/, and optical:/ if the media are not mounted. One way of handling that I already drafted a while back. But that will be perfected another day. ;)

For now let me, as always, close with the hint that development like this is still running on your donations:

Click here to lend your support to: Nepomuk - The semantic desktop on KDE and make a donation at www.pledgie.com !
Click here to donate to Nepomukvia Moneybookers

The Hunt For Nepomuk Bugs Continues

Let me open with a few stats just to brag:

  • Top bug killer on the last commit digest
  • Number of Nepomuk crash reports now below 100
  • Overall number of Nepomuk bugs down to 163 (this is actually not much, have a look at the related statistics)
  • I closed some serious bugs this week (details below)

If you want to track the progress you can use the following links to check from time to time:

Finally I want to present two fixes I did this last week just to show what kind of work needs to be done in order to fix problems in Nepomuk:

1. Bug 281136 – Nepomuk queries containing unicode characters fail

The problem presented itself as follows: whenever the user would execute a query containing extended characters such as german umlauts, french accents, or for example any russian character the query would not return any results.

After some testing I realized that the queries simply failed when being delivered to Virtuoso because of Nepomuk’s automatic search excerpt extraction. It turned out that Virtuoso’s bif:search_excerpt method cannot handle wide characters which is exactly what it got. So I turned to the Virtuoso team for help and got a workaround which essentially means that we convert the wide characters to UTF8. However, this results in stripped search excerpts so the story does not end yet – I am waiting for a better solution from the Virtuoso guys.

2. Nepomuk deletes annotations of files on removable media

This was a very interesting bug – to me at least. The problem was that Nepomuk would delete the manually added information like tags, ratings, relations to other files, and so on from files that are stored on an external hard disk.

Now to understand this problem better I have to explain a bit how Nepomuk handles external media: Nepomuk uses Soprano’s Api to access RDF data. This is done through a whole stack of what we call models, each of which performs some operations on the data that passes through. One of these models handles external media. It converts each URL of a file from an external media into a new URL which is independent of the media’s mount point.

Imagine for example that the external hard disk with UUID “foobar” is mounted at /media/hd. Then a URL like file:///media/hd/myfile.txt is converted to filex://foobar/myfile.txt. That way Nepomuk will find the file again even when the disk is mounted at another path. This conversion happens transparently for all clients, meaning they only work with the local file:/ URLs. A nice side-effect is that when the disk is not mounted any code that performs clean-up like removing data for non-existing files will ignore those entries since they have no relation to the mount point.

On to the bug. Thankfully Ignacio Serantes realized that he only lost the information from files that had spaces in their names. That already pointed to a URL encoding problem. When we convert URIs from and to strings we use percent encoding. If all goes well this works fine. However, if we have a bug we might end up percent-encoding the percent-encoded URI. This was the case in the removable media handling of Nepomuk. When converting the internal filex:/ URL back to its file:// counterpart the percent encoding got borked. As a result the clean-up code would check for the existence of the wrong local URL and remove the related data. The fix involved some trickery with QUrl and KUrl and reminded me that unit tests involving URIs should always check for possible percent-encoding problems.

Well, the hunt for bugs is going on. In the meantime I am also still hunting for Nepomuk funding.

Click here to lend your support to: Nepomuk - The semantic desktop on KDE and make a donation at www.pledgie.com !
Click here to donate to Nepomukvia Moneybookers

About Strigi, Soprano, Virtuoso, CLucene, and Libstreamanalyzer

There seems to be a lot of confusion about the parts that make up the Nepomuk infrastructure. Let me shed some light.

Soprano is the RDF data storage and parsing library used in Nepomuk. Soprano provides a plugin for Virtuoso which is mandatory and requires libiodbc. It does NOT work with unixODBC (It compiles but simply does not work due to some extensions in libiodbc required for RDF data handling). In addition to the Virtuoso plugin Nepomuk requires the Raptor parser plugin and the Redland storage plugin for ontology import.

CLucene is not required in Nepomuk anymore. It has been used for full-text indexing in early versions of KDE but is superseded by the fullt-text indexing functionality of Virtuoso. Consequently the Soprano clucene module is not required anymore and development has effectively been stopped. It will most likely not be part of Soprano 3 (unless someone interested steps up and does the required work).

Virtuoso is a full-blown SQL server with a powerful RDF layer on top. OpenLink, the company developing Virtuoso, maintains an open channel of communication to the Nepomuk developers and introduced a “lite” mode for us (please no comments on how it still is not “lite”). Virtuoso 6.1.3 is the current version. It has a unicode bug which can be fixed by applying the patch attached to KDE bug 271664. Virtuoso 6.1.4 will be released soon and contains several fixes to bugs reported by me. An update is highly recommended.

Libstreamanalyzer and libstreams are libraries which are part of the Strigi project. In addition the Strigi project contains strigidaemon, an alternative scheduler for indexing files which is based on CLucene and not used by Nepomuk. I asked the maintainer of Strigi once to split libstreams and libstreamanalyzer into their own independently released packages. He refused which is understandable seeing as he has little time for Strigi as it is. As a consequence I advise packagers to either use libstreamanalyzer from git master or the latest tag instead of using released tarballs.

I think that is all. If I missed something please comment and I will update the post.

Just in Time For KDE SC 4.4: Virtuoso 6.1.0

Finally all testing and bugfixing is finished. OpenLink has done an outstanding job with this new release of Virtuoso. Again my thanks go out to the Virtuoso development team and Patrick van Kleef who was my contact to smooth out the issues which prevented us to use Virtuoso 6 with Nepomuk.

So now is the time for distributions to package Virtuoso 6.1.0 and for you to update it on your own. But wait, there is one little detail: the database format changed significantly between Virtuoso 5 and 6. That is why I wrote a little conversion tool called Virtuosoconverter which takes care of this problem (Caution: the build system will download the Virtuoso 5.0.12 sources which are roughly 60MB). Usage is simple:

  1. Shut down Nepomuk
  2. Install Virtuoso 6.1.0
  3. Run the Converter
  4. Restart Nepomuk

Virtuoso 6 offers a wide range of features which are yet to be exposed through Nepomuk. The fun is only just starting!

Hints for Distributors:

  • You might want to run the converter in auto mode before starting Nepomuk.
  • If you do not like the build system downloading the Virtuoso 5 sources simply put them in the source tree. The build system will pick them up and use them instead of downloading.

Updates:

  • If you have old Virtuoso V5 data and do not run the converter after updating to Virtuoso V6 Nepomuk will not start.
  • The converter is the only way to convert the data to the new database format (except if you run some sql commands on the server manually)

Virtuoso – Once More With Feeling

The Virtuoso backend for Soprano and, thus, Nepomuk can be seen as rather stable now. So now the big tests can begin as the goal is to make it the standard in KDE 4.4. Let me summarize the important information again:

Step 1

Get Virtuoso 5.0.12 from the Sourceforge download page. Virtuoso 6 is NOT supported. (not yet anyway)

Step 2

Hints for packagers: Soprano only needs two files: the virtuoso-t binary and the virtodbc_r(.so) ODBC driver. Everything else is optional. (For the self-compiling folks out there: –disable-all-vads is your friend.)

Step 3

Install libiodbc which is what the Soprano build will look for (Virtuoso is simply a run-time dependency.)

Step 4

Rebuild Soprano from current svn trunk (Remember: Redland is still mandatory. Its memory storage is used all over Nepomuk!)

Step 5

Edit ${KDEHOME}/share/config/nepomukserverrc with your favorite editor. In the “[Basic Settings]“ section add “Soprano Backend=virtuosobackend”. Do not touch the main repository settings!

Step 6

Restart Nepomuk. I propose the following procedure to gather debugging information in case something goes wrong:
Shutdown Nepomuk completely:

 # qdbus org.kde.NepomukServer /nepomukserver org.kde.NepomukServer.quit

Restart it by piping the output into a temporary file (bash syntax):

 # nepomukserver 2> /tmp/nepomuk.stderr

Step 7

Wait for Nepomuk to convert your data. If you are running KDE trunk you even get a nice progress bar in the notification area (BTW: does anyone know why it won’t show the title?)

And Now?

That is already it. Now you can enjoy the new Virtuoso backend.

The development has taken a long time. But I want to thank OpenLink and especially Patrick van Kleef who helped a lot by fixing the last little tidbits in Virtuoso 5 for my unit tests to pass. Next step is Virtuoso 6.

And Yet Another Post About Virtuoso

Today nearly all problems are solved. OpenLink provided a patch that makes inserting very large literals (more than 1 metabyte in size) lightning fast, even with a very low buffer count. Also I worked around the issue of URI encoding. Now the Soprano Virtuoso backend simply percent-encodes all non-unreserved characters and all reserved characters that are not used in their special meaning in URIs used in queries. Man, that is a mouth full. Well, it seems to work fine although I can always use more testing with weird file URLs (weird means containing weird characters like brackets and the likes). I also fixed some error handling bugs.

So what is left? Well, there are a few hacks in the Virtuoso backend which are rather ugly. One example is the detection of query result types. To determine if the result is boolean, bindings, or a graph it actually checks the name and number of result columns. Urgh! It would be nicer to check for the type of the result. Seems like graph results are BLOBs.

Anyway, enough for tonight. I am tired. Here is the patch to make Virtuoso not hang when Strigi adds nie:PlainTextContent literals of big files:

Index: sqlrcomp.c
===================================================================
RCS file: virtuoso-opensource/libsrc/Wi/sqlrcomp.c,v
retrieving revision 1.9
diff -u -r1.9 sqlrcomp.c
--- sqlrcomp.c  20 Aug 2009 17:47:22 -0000      1.9
+++ sqlrcomp.c  13 Oct 2009 16:11:49 -0000
@@ -65,7 +65,7 @@
 {
 va_list list;
 char temp[2000];
-  int ret;
+  int ret, rest_sz, copybytes;
 va_start (list, string);
 ret = vsnprintf (temp, sizeof (temp), string, list);
 #ifndef NDEBUG
@@ -75,11 +75,16 @@
 va_end (list);
 #ifndef NDEBUG
 if (*fill + strlen (temp) > len - 1)
-    GPF_T1 ("overflow in strncpy");
+    GPF_T1 ("overflow in memcpy");
 #endif
-  strncpy (&text[*fill], temp, len - *fill - 1);
+  rest_sz = (len - fill[0]);
+  if (ret >= rest_sz)
+    copybytes = ((rest_sz > 0) ? rest_sz : 0);
+  else
+    copybytes = ret+1;
+  memcpy (text+fill[0], temp, copybytes);
 text[len - 1] = 0;
-  *fill += (int) strlen (temp);
+  fill[0] += ret;
 }