The Hunt For Nepomuk Bugs Continues

Let me open with a few stats just to brag:

  • Top bug killer on the last commit digest
  • Number of Nepomuk crash reports now below 100
  • Overall number of Nepomuk bugs down to 163 (this is actually not much, have a look at the related statistics)
  • I closed some serious bugs this week (details below)

If you want to track the progress you can use the following links to check from time to time:

Finally I want to present two fixes I did this last week just to show what kind of work needs to be done in order to fix problems in Nepomuk:

1. Bug 281136 – Nepomuk queries containing unicode characters fail

The problem presented itself as follows: whenever the user would execute a query containing extended characters such as german umlauts, french accents, or for example any russian character the query would not return any results.

After some testing I realized that the queries simply failed when being delivered to Virtuoso because of Nepomuk’s automatic search excerpt extraction. It turned out that Virtuoso’s bif:search_excerpt method cannot handle wide characters which is exactly what it got. So I turned to the Virtuoso team for help and got a workaround which essentially means that we convert the wide characters to UTF8. However, this results in stripped search excerpts so the story does not end yet – I am waiting for a better solution from the Virtuoso guys.

2. Nepomuk deletes annotations of files on removable media

This was a very interesting bug – to me at least. The problem was that Nepomuk would delete the manually added information like tags, ratings, relations to other files, and so on from files that are stored on an external hard disk.

Now to understand this problem better I have to explain a bit how Nepomuk handles external media: Nepomuk uses Soprano’s Api to access RDF data. This is done through a whole stack of what we call models, each of which performs some operations on the data that passes through. One of these models handles external media. It converts each URL of a file from an external media into a new URL which is independent of the media’s mount point.

Imagine for example that the external hard disk with UUID “foobar” is mounted at /media/hd. Then a URL like file:///media/hd/myfile.txt is converted to filex://foobar/myfile.txt. That way Nepomuk will find the file again even when the disk is mounted at another path. This conversion happens transparently for all clients, meaning they only work with the local file:/ URLs. A nice side-effect is that when the disk is not mounted any code that performs clean-up like removing data for non-existing files will ignore those entries since they have no relation to the mount point.

On to the bug. Thankfully Ignacio Serantes realized that he only lost the information from files that had spaces in their names. That already pointed to a URL encoding problem. When we convert URIs from and to strings we use percent encoding. If all goes well this works fine. However, if we have a bug we might end up percent-encoding the percent-encoded URI. This was the case in the removable media handling of Nepomuk. When converting the internal filex:/ URL back to its file:// counterpart the percent encoding got borked. As a result the clean-up code would check for the existence of the wrong local URL and remove the related data. The fix involved some trickery with QUrl and KUrl and reminded me that unit tests involving URIs should always check for possible percent-encoding problems.

Well, the hunt for bugs is going on. In the meantime I am also still hunting for Nepomuk funding.

Click here to lend your support to: Nepomuk - The semantic desktop on KDE and make a donation at www.pledgie.com !
Click here to donate to Nepomukvia Moneybookers

About Strigi, Soprano, Virtuoso, CLucene, and Libstreamanalyzer

There seems to be a lot of confusion about the parts that make up the Nepomuk infrastructure. Let me shed some light.

Soprano is the RDF data storage and parsing library used in Nepomuk. Soprano provides a plugin for Virtuoso which is mandatory and requires libiodbc. It does NOT work with unixODBC (It compiles but simply does not work due to some extensions in libiodbc required for RDF data handling). In addition to the Virtuoso plugin Nepomuk requires the Raptor parser plugin and the Redland storage plugin for ontology import.

CLucene is not required in Nepomuk anymore. It has been used for full-text indexing in early versions of KDE but is superseded by the fullt-text indexing functionality of Virtuoso. Consequently the Soprano clucene module is not required anymore and development has effectively been stopped. It will most likely not be part of Soprano 3 (unless someone interested steps up and does the required work).

Virtuoso is a full-blown SQL server with a powerful RDF layer on top. OpenLink, the company developing Virtuoso, maintains an open channel of communication to the Nepomuk developers and introduced a “lite” mode for us (please no comments on how it still is not “lite”). Virtuoso 6.1.3 is the current version. It has a unicode bug which can be fixed by applying the patch attached to KDE bug 271664. Virtuoso 6.1.4 will be released soon and contains several fixes to bugs reported by me. An update is highly recommended.

Libstreamanalyzer and libstreams are libraries which are part of the Strigi project. In addition the Strigi project contains strigidaemon, an alternative scheduler for indexing files which is based on CLucene and not used by Nepomuk. I asked the maintainer of Strigi once to split libstreams and libstreamanalyzer into their own independently released packages. He refused which is understandable seeing as he has little time for Strigi as it is. As a consequence I advise packagers to either use libstreamanalyzer from git master or the latest tag instead of using released tarballs.

I think that is all. If I missed something please comment and I will update the post.

Just in Time For KDE SC 4.4: Virtuoso 6.1.0

Finally all testing and bugfixing is finished. OpenLink has done an outstanding job with this new release of Virtuoso. Again my thanks go out to the Virtuoso development team and Patrick van Kleef who was my contact to smooth out the issues which prevented us to use Virtuoso 6 with Nepomuk.

So now is the time for distributions to package Virtuoso 6.1.0 and for you to update it on your own. But wait, there is one little detail: the database format changed significantly between Virtuoso 5 and 6. That is why I wrote a little conversion tool called Virtuosoconverter which takes care of this problem (Caution: the build system will download the Virtuoso 5.0.12 sources which are roughly 60MB). Usage is simple:

  1. Shut down Nepomuk
  2. Install Virtuoso 6.1.0
  3. Run the Converter
  4. Restart Nepomuk

Virtuoso 6 offers a wide range of features which are yet to be exposed through Nepomuk. The fun is only just starting!

Hints for Distributors:

  • You might want to run the converter in auto mode before starting Nepomuk.
  • If you do not like the build system downloading the Virtuoso 5 sources simply put them in the source tree. The build system will pick them up and use them instead of downloading.

Updates:

  • If you have old Virtuoso V5 data and do not run the converter after updating to Virtuoso V6 Nepomuk will not start.
  • The converter is the only way to convert the data to the new database format (except if you run some sql commands on the server manually)

Virtuoso – Once More With Feeling

The Virtuoso backend for Soprano and, thus, Nepomuk can be seen as rather stable now. So now the big tests can begin as the goal is to make it the standard in KDE 4.4. Let me summarize the important information again:

Step 1

Get Virtuoso 5.0.12 from the Sourceforge download page. Virtuoso 6 is NOT supported. (not yet anyway)

Step 2

Hints for packagers: Soprano only needs two files: the virtuoso-t binary and the virtodbc_r(.so) ODBC driver. Everything else is optional. (For the self-compiling folks out there: –disable-all-vads is your friend.)

Step 3

Install libiodbc which is what the Soprano build will look for (Virtuoso is simply a run-time dependency.)

Step 4

Rebuild Soprano from current svn trunk (Remember: Redland is still mandatory. Its memory storage is used all over Nepomuk!)

Step 5

Edit ${KDEHOME}/share/config/nepomukserverrc with your favorite editor. In the “[Basic Settings]“ section add “Soprano Backend=virtuosobackend”. Do not touch the main repository settings!

Step 6

Restart Nepomuk. I propose the following procedure to gather debugging information in case something goes wrong:
Shutdown Nepomuk completely:

 # qdbus org.kde.NepomukServer /nepomukserver org.kde.NepomukServer.quit

Restart it by piping the output into a temporary file (bash syntax):

 # nepomukserver 2> /tmp/nepomuk.stderr

Step 7

Wait for Nepomuk to convert your data. If you are running KDE trunk you even get a nice progress bar in the notification area (BTW: does anyone know why it won’t show the title?)

And Now?

That is already it. Now you can enjoy the new Virtuoso backend.

The development has taken a long time. But I want to thank OpenLink and especially Patrick van Kleef who helped a lot by fixing the last little tidbits in Virtuoso 5 for my unit tests to pass. Next step is Virtuoso 6.

And Yet Another Post About Virtuoso

Today nearly all problems are solved. OpenLink provided a patch that makes inserting very large literals (more than 1 metabyte in size) lightning fast, even with a very low buffer count. Also I worked around the issue of URI encoding. Now the Soprano Virtuoso backend simply percent-encodes all non-unreserved characters and all reserved characters that are not used in their special meaning in URIs used in queries. Man, that is a mouth full. Well, it seems to work fine although I can always use more testing with weird file URLs (weird means containing weird characters like brackets and the likes). I also fixed some error handling bugs.

So what is left? Well, there are a few hacks in the Virtuoso backend which are rather ugly. One example is the detection of query result types. To determine if the result is boolean, bindings, or a graph it actually checks the name and number of result columns. Urgh! It would be nicer to check for the type of the result. Seems like graph results are BLOBs.

Anyway, enough for tonight. I am tired. Here is the patch to make Virtuoso not hang when Strigi adds nie:PlainTextContent literals of big files:

Index: sqlrcomp.c
===================================================================
RCS file: virtuoso-opensource/libsrc/Wi/sqlrcomp.c,v
retrieving revision 1.9
diff -u -r1.9 sqlrcomp.c
--- sqlrcomp.c  20 Aug 2009 17:47:22 -0000      1.9
+++ sqlrcomp.c  13 Oct 2009 16:11:49 -0000
@@ -65,7 +65,7 @@
 {
 va_list list;
 char temp[2000];
-  int ret;
+  int ret, rest_sz, copybytes;
 va_start (list, string);
 ret = vsnprintf (temp, sizeof (temp), string, list);
 #ifndef NDEBUG
@@ -75,11 +75,16 @@
 va_end (list);
 #ifndef NDEBUG
 if (*fill + strlen (temp) > len - 1)
-    GPF_T1 ("overflow in strncpy");
+    GPF_T1 ("overflow in memcpy");
 #endif
-  strncpy (&text[*fill], temp, len - *fill - 1);
+  rest_sz = (len - fill[0]);
+  if (ret >= rest_sz)
+    copybytes = ((rest_sz > 0) ? rest_sz : 0);
+  else
+    copybytes = ret+1;
+  memcpy (text+fill[0], temp, copybytes);
 text[len - 1] = 0;
-  *fill += (int) strlen (temp);
+  fill[0] += ret;
 }

Virtuoso – for real!

used-bckend-virtuoso

Soprano 2.3.63 – that is the magic version number you need to look out for.

And then once you have updated your kdebase copy to the latest trunk you run your favorite text editor on ~/.kde/share/config/nepomukserverrc. In there you set Soprano Backend=virtuosobackend in the [Basic Settings] section. After that you simply restart Nepomuk as described in the corresponding howto. You can also logout and log back in again but then you won’t be able to provide as nice bug reports.

Once done Nepomuk will convert your database. This can take a loooong time if strigi is enabled. But it will finish. :)

BTW: You need a recent snapshot of Virtuoso 5.0.12 for this to work.