Virtuoso – Once More With Feeling

The Virtuoso backend for Soprano and, thus, Nepomuk can be seen as rather stable now. So now the big tests can begin as the goal is to make it the standard in KDE 4.4. Let me summarize the important information again:

Step 1

Get Virtuoso 5.0.12 from the Sourceforge download page. Virtuoso 6 is NOT supported. (not yet anyway)

Step 2

Hints for packagers: Soprano only needs two files: the virtuoso-t binary and the virtodbc_r(.so) ODBC driver. Everything else is optional. (For the self-compiling folks out there: –disable-all-vads is your friend.)

Step 3

Install libiodbc which is what the Soprano build will look for (Virtuoso is simply a run-time dependency.)

Step 4

Rebuild Soprano from current svn trunk (Remember: Redland is still mandatory. Its memory storage is used all over Nepomuk!)

Step 5

Edit ${KDEHOME}/share/config/nepomukserverrc with your favorite editor. In the “[Basic Settings]“ section add “Soprano Backend=virtuosobackend”. Do not touch the main repository settings!

Step 6

Restart Nepomuk. I propose the following procedure to gather debugging information in case something goes wrong:
Shutdown Nepomuk completely:

 # qdbus org.kde.NepomukServer /nepomukserver org.kde.NepomukServer.quit

Restart it by piping the output into a temporary file (bash syntax):

 # nepomukserver 2> /tmp/nepomuk.stderr

Step 7

Wait for Nepomuk to convert your data. If you are running KDE trunk you even get a nice progress bar in the notification area (BTW: does anyone know why it won’t show the title?)

And Now?

That is already it. Now you can enjoy the new Virtuoso backend.

The development has taken a long time. But I want to thank OpenLink and especially Patrick van Kleef who helped a lot by fixing the last little tidbits in Virtuoso 5 for my unit tests to pass. Next step is Virtuoso 6.

And Yet Another Post About Virtuoso

Today nearly all problems are solved. OpenLink provided a patch that makes inserting very large literals (more than 1 metabyte in size) lightning fast, even with a very low buffer count. Also I worked around the issue of URI encoding. Now the Soprano Virtuoso backend simply percent-encodes all non-unreserved characters and all reserved characters that are not used in their special meaning in URIs used in queries. Man, that is a mouth full. Well, it seems to work fine although I can always use more testing with weird file URLs (weird means containing weird characters like brackets and the likes). I also fixed some error handling bugs.

So what is left? Well, there are a few hacks in the Virtuoso backend which are rather ugly. One example is the detection of query result types. To determine if the result is boolean, bindings, or a graph it actually checks the name and number of result columns. Urgh! It would be nicer to check for the type of the result. Seems like graph results are BLOBs.

Anyway, enough for tonight. I am tired. Here is the patch to make Virtuoso not hang when Strigi adds nie:PlainTextContent literals of big files:

Index: sqlrcomp.c
===================================================================
RCS file: virtuoso-opensource/libsrc/Wi/sqlrcomp.c,v
retrieving revision 1.9
diff -u -r1.9 sqlrcomp.c
--- sqlrcomp.c  20 Aug 2009 17:47:22 -0000      1.9
+++ sqlrcomp.c  13 Oct 2009 16:11:49 -0000
@@ -65,7 +65,7 @@
 {
 va_list list;
 char temp[2000];
-  int ret;
+  int ret, rest_sz, copybytes;
 va_start (list, string);
 ret = vsnprintf (temp, sizeof (temp), string, list);
 #ifndef NDEBUG
@@ -75,11 +75,16 @@
 va_end (list);
 #ifndef NDEBUG
 if (*fill + strlen (temp) > len - 1)
-    GPF_T1 ("overflow in strncpy");
+    GPF_T1 ("overflow in memcpy");
 #endif
-  strncpy (&text[*fill], temp, len - *fill - 1);
+  rest_sz = (len - fill[0]);
+  if (ret >= rest_sz)
+    copybytes = ((rest_sz > 0) ? rest_sz : 0);
+  else
+    copybytes = ret+1;
+  memcpy (text+fill[0], temp, copybytes);
 text[len - 1] = 0;
-  *fill += (int) strlen (temp);
+  fill[0] += ret;
 }

A Bit Of Nepomuk Goodness On Your Developer Fingertips

And yet another technical blog entry. This time it concerns the latest improvements in sopranocmd (Soprano >= 2.3.61). For starters there is an improved NRLModel which provides automatic query prefix expansion. What does that mean? Well, it means that debugging Nepomuk data is simpler now as you can simply use all ontologies stored in the Nepomuk database without defining their prefixes. With sopranocmd (I am using nepomukcmd, a little alias I introduce in the Nepomuk Tips and Tricks) this feature is enabled via the –nrl parameter. Thus, querying all tags becomes:

nepomukcmd --nrl query "select ?r where { ?r a nao:Tag . }"

The second new thing is an improved import command. Again enabled with the –nrl parameter it creates a new named graph of type nrl:KnowledgeBase and puts all new statements (which are not in a graph yet) into it. As described in Nepomuk Data Layout it also adds a metadata graph and the creation date. It actually makes use of NRLModel::createGraph.

The reason I did this was to be able to migrate the tmo:Task instances I had created on the laptop to my desktop machine. Just as an example I will show the procedure here:

First I export all the tasks and their properties on the laptop:

nepomukcmd --nrl export "describe ?r where { ?r a tmo:Task . }" \
     /tmp/task-dump.n4

And then on the desktop I simply import them into Nepomuk:

nepomukcmd --nrl import /tmp/task-dump.n4

Update: You can also use query prefixes for statement listing now. Thus the following is now possible:

nepomukcmd --nrl list "" a tmo:Task

(Even the “a” keyword now maps to rdf:type.)

Nepomuk And Some CMake Magic

And now for some technical build scripting. Some of you maybe know nepomuk-rcgen. It can be used to generate subclasses of Nepomuk::Resource which provide nice wrapper methods around Nepomuk::Resource::property and Nepomuk::Resource::setProperty. Although it has been in need of improvement for quite some time it can be very useful as code gets way more readable.

So far the cmake integration was, well, let’s face it, bad. It was the result of my first attempts at creating a bit more complex cmake macros. Funnily enough the kdepim developers did not find a real solution to the problem of creating source files at cmake-time either. They would have a long list of classes that would be generated in the CMakeLists.txt file.

Today I finally sat down and fixed it. The new macro (currently only available in playground due to the KDE 4.3 feature freeze) does a much better job. It creates the list of source and header files to be generated at cmake-time while the files themselves are generated at build-time. This way not all files need to be recompiled every time cmake is run. This was the biggest problem of the previous macro.

Usage is fairly simple and in align with known KDE macros such as kde4_add_ui_files:

NEPOMUK_ADD_ONTOLOGY_CLASSES(<sources-var>
     [ONTOLOGIES] <onto-file1> [<onto-file2> ...]
     [TEMPLATES <template1> [<template2> ...]]
   )

A typical example looks as follows:

NEPOMUK_ADD_ONTOLOGY_CLASSES(
   nie_SOURCES
   ONTOLOGIES
   nie.rdfs
   nfo.rdfs
   nco.rdfs
 )
kde4_add_library(nie STATIC ${nie_SOURCES})
target_link_libraries(nie ${NEPOMUK_LIBRARIES})

I will probably also add a SERIALIZATION parameter so we do not have to rely on the crude auto-detection based on the file extension.

Another tool which is probably even less known is Soprano’s onto2vocabularyclass. It also generates code. (Actually not classes but namespaces. Thus the name is misleading. ) And is also uses an ontology file as input. The result is a namespace like Soprano::Vocabulary::NAO which gives access to static QUrl objects representing the classes and properties defined in the ontology. A very useful tool when one has to create many queries and does not want to hardcode all the RDF namespaces.

Just to show how one would typically use such a namespace here is a SPARQL query using Soprano:

QString query = QString("select ?r where { ?r a %1 . }")
      .arg( Soprano::Node::resourceToN3( Soprano::Vocabulary::NAO::Tag() ) );

And just like before I created a little cmake macro that creates these namespaces for you. It has a bit more parameters than the one above but is still quite simple:

kde4_add_ontology(<sources-var>
     <onto-file>
     <onto-name>
     <namespace to use>
     <serialization>
     [VISIBILITY <visibility-name>])

One simply has to specify the ontology file, the name of the C++ namespace (NAO in the example above), the parent namespace (Soprano::Vocabulary in the example above), the serialization of the ontology file, and optionally the visibility in case the namespace should be exported publicly (see the onto2vocabularyclass documentation for details).

Like before I will close with a little example taken from playground (simplified):

find_file(PIMO_TRIG_SOURCE
   pimo.trig)
kde4_add_ontology(pimo_LIB_SRCS
   ${PIMO_TRIG_SOURCE}
   "PIMO"
   "Nepomuk::Vocabulary"
   "trig"
   VISIBILITY "nepomuk")
kde4_add_library(pimo SHARED ${pimo_LIB_SRCS})

“Are We There Yet?” – The Long Road To a Stable Soprano Virtuoso Backend

The first time I blogged about the Virtuoso backend for Soprano testing it was still a bit bumpy: one had to manually start a server locally and then connect to it. Now the situation has changed. I just commited the changes that allow the Soprano Virtuoso backend to spawn a local instance of the Virtuoso server. Thus, using the Virtuoso backend in Nepomuk is now as simple as specifying it in the Nepomuk server configuration file ~/.kde/share/config/nepomukserverrc as mentioned before:

[Basic Settings]
Configured repositories=main
Soprano Backend=virtuoso
Start Nepomuk=true

(Notice that the backend name is now “virtuoso” without the “backend” suffix. Actually Soprano can now handle both for convinience.)

The backend is still not 100% stable. Some queries still fail due to encoding problems. which may include conversion of the data.

Apart from the spawning of a local instance the backend can now handle indices (for improved query speed) and the state of the full text index. Both are controlled through backend options as explained in the Soprano Virtuoso backend documentation.

Virtuoso Packaging

Now that is established let’s go to a few packaging questions. There were arguments that Virtuoso uses too much disk space. Well, fact is that the Virtuoso server as used by Soprano only needs a very small portion of the whole Virtuoso installation: the “virtuoso-t” binary and the “virtodbc_r.so” ODBC driver. Combined they come to a size of roughly 9 MB. I think that is no big problem (as a comparison the whole Virtuoso installation is about 150 MB). :)

Packagers should split the Virtuoso package into small parts. I recommend to provide a package for the server binary, one for the ODBC drivers, one for the VAD files, and so on. Thus, the harddisk usage will be minimal.

A New Blog and The Possible End to the Java Dependancy in Nepomuk-KDE

I changed the blog system again. Why? It is rather simple: since the update of the blogging system of kdedevelopers.org it is virtually unusable. All the nice features are gone and I am told, one needs to have an account to comment. I need something spiffy that works nicely. WordPress was recommended to me.

Now back to the real content: A new backend for Soprano which could finally let us drop the sesame2 backend which needs a JVM.

Today I committed the new Virtuoso Soprano backend. Virtuoso is a powerful SQL/RDF DB server created by OpenLink Software. OpenLink provides an open-source version of their database server released under the GPL. The official description from their homepage reads:

At core, Virtuoso is a high-performance object-relational SQL database. As a database, it provides transactions, a smart SQL compiler, powerful stored-procedure language with optional Java and .Net server-side hosting, hot backup, SQL-99 support and more. It has all major data-access interfaces, such as ODBC, JDBC, ADO .Net and OLE/DB.
[...]
OpenLink Virtuoso supports SPARQL embedded into SQL for querying RDF data stored in Virtuoso’s database. SPARQL benefits from low-level support in the engine itself, such as SPARQL-aware type-casting rules and a dedicated IRI data type. This is the newest and fastest developing area in Virtuoso.

Virtuoso not only frees us from the shackles of the java dependency. It also provides a bunch of features that Sesame2 does not. Most importantly Virtuoso features full text indexing which can be used within SPARQL queries:

select ?foo where { ?foo rdfs:label ?label . ?label bif:contains "bar" . }

At the moment (using the sesame2 backend) we need to make use of the CLucene based full-text-indexing layer which I implemented for Soprano. While this works very well, it forces us to split queries into fulltext and graph part and then merge the results. That tends to be slow.

Apart from the Virtuoso also allows to do quite a lot of SPARQL magic with nested queries or even embedded SQL. Plus it supports SPARUL, the Sparql Update Language which will make a lot of code simpler and faster.

The nice guys at OpenLink were very open to the idea of using their server in KDE. With the release of Virtuoso 5.0.10 they introduced a new lite mode which trims the server down to our needs. The memory usage goes down to a minimum: IMHO roughly 80M is acceptable. (Especially since the JVM easily goes up to 10 times that value!) That in combination with disabling pretty much all features except for SPARQL we have a decent desktop DB solution.

Over the next weeks I will try to get the new backend into shape to replace the sesame2 backend. Hopefully by then distributions will have created packages for Virtuoso. If you want to give it a spin yourself without waiting or even to help me ;) this is what you need to do:

  • Get Virtuoso 5.0.10 from the Sourceforge download page
  • Install Virtuoso (obviously)
  • Start Virtuoso with a config file similar to the one below
  • Change the Nepomuk server config file (~/.kde4/share/config/nepomukserverrc) – Add "Soprano Backend=virtuosobackend” to the Basic Settings section.
  • Restart Nepomuk

Now Nepomuk should convert all existing data to the new backend. This process can take a while. There might even be errors. But it allows to test the new features such as the integrated fulltext indexing (which still has to be enabled manually).

And now the virtuoso.ini file:

[Database]
DatabaseFile = soprano-virtuoso.db
ErrorLogFile = soprano-virtuoso.log
TransactionFile = soprano-virtuoso.trx
xa_persistent_file = soprano-virtuoso.pxa
ErrorLogLevel = 7
FileExtend = 100
MaxCheckpointRemap = 1000
Striping = 0
TempStorage = TempDatabase

[TempDatabase]
DatabaseFile = soprano-virtuoso-temp.db
TransactionFile = soprano-virtuoso-temp.trx
MaxCheckpointRemap = 1000
Striping = 0

[Parameters]
LiteMode = 1
ServerPort = 1111
DisableUnixSocket = 0
O_DIRECT = 0
CaseMode = 1
CheckpointAuditTrail = 0
AllowOSCalls = 0
DirsAllowed = .
PrefixResultNames = 0
ServerThreads = 5 ; down from 10
CheckpointInterval = 10 ; down from 60
MaxDirtyBuffers = 50 ; down from 1200
SchedulerInterval = 5 ; down from 10
FreeTextBatchSize = 1000

[HTTPServer]
DavRoot = DAV
EnabledDavVSP = 0
HTTPProxyEnabled = 0
TempASPXDir = 0
Charset = UTF-8
ServerThreads = 2 ; down from 5
KeepAliveTimeout = 5 ; down from 10
HTTPThreadSize = 10000 ; down from 280000

[AutoRepair]
BadParentLinks = 0

[Client]
SQL_PREFETCH_ROWS = 10
SQL_PREFETCH_BYTES = 4096
SQL_QUERY_TIMEOUT = 0
SQL_TXN_TIMEOUT = 0

[VDB]
ArrayOptimization = 0
NumArrayParameters = 10
VDBDisconnectTimeout = 1000
KeepConnectionOnFixedThread = 0

[Replication]
ServerEnable = 0