Debugging Nepomuk/Virtuoso’s CPU usage

July 4, 2012 / Sebastian Trüg / 1 Comment

Rabauke wrote a very good blog post about debugging Nepomuk and Virtuoso query performance on OpenSuse.

Also David Faure posted a Virtuoso patch on the Nepomuk mailing list which makes the Virtuoso status() command output the full queries instead of the truncated ones. I will try to get the latter into Virtuoso upstream, maybe with an additional parameter to the function.

Nepomuk Tasks: KActivityManager Crash

April 30, 2012 / Sebastian Trüg / 11 Comments

After a little silence during which I was occupied with Eastern and OpenLink related work I bring you news about the second Nepomuk task: the KActivityManager crash.

Ivan Cukic already “fixed” the bug by simply not using Nepomuk but an SQLite backend (at least that is how I understood it, correct me if I am wrong). However, I wanted to fix the root of the original problem.

Soprano provides the communication channel between Nepomuk and its clients. It is based on a very simple custom protocol going through a local socket. So far QLocalSocket, ie. Qt’s implementation was used. The problem with QLocalSocket is that it is a QObject. Thus, it cannot live in two threads at the same time. The hacky solution was to maintain one socket per thread. Sadly that resulted in complicated maintenance code which was impossible to get right. Hence crashes like #269573 or #283451 (basically any crash involving The Soprano::ClientConnection) were never fixed.

A few days ago I finally gave up and decided to get rid of QLocalSocket and replace it with my own implementation. The only problem is that in order to keep Windows compatibility I had to keep the old implementation around by adding quite a lot of #ifdefs.

And now I could use some testers for a Soprano client library that does only create a single connection to the server instead of one per thread. I already pushed the new code into Soprano’s git master. So all you need to do is run KDE on top of that.

Oh, and while at it I finally fixed the problem with re-connecting of clients. So now a restart of Nepomuk will no longer leave the clients with dangling connections, unable to perform queries. That fix, however, is in kdelibs.

Well, the day was long, I am tired, and this blog post feels a little boring. So before in addition to that it gets too long I will stop.

Nepomuk Tasks: Let The Virtuoso Inferencing Begin

April 3, 2012 / Sebastian Trüg / 8 Comments

Only four days ago I started the experiment to fund specific Nepomuk tasks through donations. Like with last year’s fundraiser I was uncertain if it was a good idea. That, however, changed when only a few hours later two tasks had already reached their donation goal. Again it became obvious that the work done here is appreciated and that the “open” in Open-Source is understood for what it actually is.

So despite my wife not being overly happy about it I used the weekend to work on one of the tasks: Virtuoso inferencing.

Inference?

As a quick reminder: the inferencer automatically infers information from the data in the database. While Virtuoso can handle pretty much any inference rule you throw at it we stick to the basics for now: if resource R1 is of type B and B derives from A then R1 is also of type A. And: if R1 has property P1 with value “foobar” and P1 is derived from P2 then R1 also has property P2 with value “foobar“.

Crappy Inference

This is already very useful and even mandatory in many cases. Until now we used what we called “crappy inferencing 1 & 2”. The Crappy inferencer 1 was based on work done in the original Nepomuk project and it simply inserted triples for all sub-class and sub-property relations. That way we could simulate real inference by querying for something like

select * where {
  ?r ?p "foobar" . 
  ?p rdfs:subPropertyOf rdfs:label .
}

and catch all sub-properties of rdfs:label like nao:prefLabel or nie:title. While this works it means bad performance, additional storage and additional maintenance.

The Crappy Inferencer 2 was even worse. It inserted rdf:type triples for all super-classes. This means that it would look at every added and removed triple to check if it was a rdf:type triple. If so it would add or remove the appropriate rdf:type triples for the super-types. That way we could do fast type queries without relying on the crappy inferencer 1 which relies on the rdfs:subClassOf method. But this meant even more maintenance and even more storage space wasted.

Introducing: Virtuoso Inference

So now we simply rely on Virtuoso to do all that and it does such a wonderful job. Thanks to Virtuoso graph groups we can keep our clean ontology separation (each ontology has its own graph) and still stick to a very simple extension of the queries:

DEFINE input:inference <nepomuk:/ontographgroup>
select * where {
  ?r rdfs:label "foobar" .
}

Brilliant. Of course there are still situations in which you do not want to use the inferencer. Imagine for example the listing of resource properties in the UI. This is what it would look like with inference:

We do not want that. Inference is intended for machine, not for the human, at least not like this. So since back in the day I did not think of adding query flags to Soprano I simply introduced a new virtual query language: SparqlNoInference.

Resource Visibility

While at it I also improved the resource visibility support by simplifying it. We do not need any additional processing anymore. This again means less work on startup and with every triple manipulation command. Again we save space and increase performance. But this also means that resource visibility filtering will not work as before anymore. Nepoogle for example will need adjustment to the new way of filtering. Instead of

?r nao:userVisible 1 .

we now need

FILTER EXISTS { ?r a [ nao:userVisible "true"^^xsd:boolean ] }

Testing

The implementation is done. All that rests are the tests. I am already running all the patches but I still need to adjust some unit tests and maybe write new ones.

You can also test it. The code changes are, as always, spread over Soprano, kdelibs and kde-runtime. Both kdelibs and kde-runtime now contain a branch “nepomuk/virtuosoInference”. For Soprano you need git master.

Look for regressions of any kind so we can merge this as soon as possible. The goal is KDE 4.9.

About Strigi, Soprano, Virtuoso, CLucene, and Libstreamanalyzer

September 22, 2011 / Sebastian Trüg / 20 Comments

There seems to be a lot of confusion about the parts that make up the Nepomuk infrastructure. Let me shed some light.

Soprano is the RDF data storage and parsing library used in Nepomuk. Soprano provides a plugin for Virtuoso which is mandatory and requires libiodbc. It does NOT work with unixODBC (It compiles but simply does not work due to some extensions in libiodbc required for RDF data handling). In addition to the Virtuoso plugin Nepomuk requires the Raptor parser plugin and the Redland storage plugin for ontology import.

CLucene is not required in Nepomuk anymore. It has been used for full-text indexing in early versions of KDE but is superseded by the fullt-text indexing functionality of Virtuoso. Consequently the Soprano clucene module is not required anymore and development has effectively been stopped. It will most likely not be part of Soprano 3 (unless someone interested steps up and does the required work).

Virtuoso is a full-blown SQL server with a powerful RDF layer on top. OpenLink, the company developing Virtuoso, maintains an open channel of communication to the Nepomuk developers and introduced a “lite” mode for us (please no comments on how it still is not “lite”). Virtuoso 6.1.3 is the current version. It has a unicode bug which can be fixed by applying the patch attached to KDE bug 271664. Virtuoso 6.1.4 will be released soon and contains several fixes to bugs reported by me. An update is highly recommended.

Libstreamanalyzer and libstreams are libraries which are part of the Strigi project. In addition the Strigi project contains strigidaemon, an alternative scheduler for indexing files which is based on CLucene and not used by Nepomuk. I asked the maintainer of Strigi once to split libstreams and libstreamanalyzer into their own independently released packages. He refused which is understandable seeing as he has little time for Strigi as it is. As a consequence I advise packagers to either use libstreamanalyzer from git master or the latest tag instead of using released tarballs.

I think that is all. If I missed something please comment and I will update the post.

Nepomuk – What Comes Next

September 16, 2011 / Sebastian Trüg / 49 Comments

After a very generous start to my fundraiser (thank you so much for your support) it is time I get into more detail about what you are actually supporting. Originally I wanted to do that by updating nepomuk.kde.org. I will still do that but it will take a little more time than anticipated. Thus, I will simply start with another blog post.

Well then, apart from cleaning out the bug database at bugs.kde.org (this will be a hard one), continuing to support app developers with Nepomuk integration, maintaining the whole Nepomuk stack, Soprano, the Shared-desktop-ontologies, and some smaller Nepomuk-based applications there are some very specific tasks I want to work on in the near future (In this case the near-future roughly spans the next half year).

Semantic Saving and Loading of Documents

Pretty much forever we have managed documents in a very nerdy manner: the way they are stored on the local file system. We navigate physical folders, create complex hierarchies, get lost in them, recreate parts of them, never find our files again, and still keep on doing it.

The vision I have is that we do not think about folders at all any more since for me they are a restriction of the 3-dimensional world that has no place in a computer. A document on the real world can only be archived in a single folder. On the computer there is no such restriction. We can do whatever we want. Thus, the idea is to organize documents closer to the way our brain organizes information: based on context and familiar topics and relations.

This vision, however, is not feasible in the near future. There is simply too much legacy data and too many applications relying on the classical folder structure. Thus, the idea would be a hybrid approach which combines classical local folders with advanced semantic relations and meta-data. This is a development which I already started with fantastic input from the community.

The next steps include finishing the prototype and creating its counterpart, the file open dialog. This will be a very tough one for which I will ask your support again since that works out so great with the save dialog.

Excerpts

A typical use case is bookmarking pages or copying specific parts of a document into some collage of snippets. However, as always we loose the relation to the source. This is were Nepomuk will shine: instead of copying the part of the document we simply define the excerpt (the portion the user is interested in. This can be a section which is marked, it can be a specific position in the document ranging up to its end, or it can be part of an image.) as a resource in Nepomuk which we can annotate like any other resource. This means that we relate it to topics, people, projects, files, other snippets, web pages, comment on it, and so on – all the while we keep the relation to the original document

This allows for nice things like automatic collages (think of selecting all snippets which mention a certain topic or relate to a certain project and were created before some date and merging them all into one view), simpler quoting of things you read before (since the relation to the original document is in tact you have easy access to the details required for the quote – very interesting for academic workers), and a simple listing of all interesting quotes from documents by some person you like (an example query).

Sharing Nepomuk Data – Step 1

Whenever we create information we want to share it with others. Vishesh Handa already started a very ambitious project to support several types of data sharing through a plugin system. What I want to do first is much less but nonetheless interesting: sharing bits of Nepomuk data manually.

This means that you define the information you want to share and then simply export it into a file which you can then send to someone else. They in turn can import this information into their own Nepomuk system. For starters there will be tracking of origin of the data or anything like keeping two ratings at the same time. That is for later.

This is a very simple first step to sharing which should be fairly easy to implement, the GUI being the only actually hard part. The Data Management Service already takes care of export and import for us.

Once this works adding the same to EMail sending or Telepathy communications ill be very simple. In fact the Telepathy-KDE guys (namely Daniele E. Domenichell aka Dr. Danz) have been interesting in that for a long time. (I wish I were with you guys at Cambridge now!)

To this end I will probably finally get to work on Ginkgo, the generic Nepomuk resource management tool developed by Mandriva’s Stephane Lauriere.

For App Developers: Resource Watcher

For the longest time the only way of getting notified of changes in the Nepomuk database were the very generic Soprano signals Model::statementsAdded and Model::statementsRemoved. Checking for specific changes meant to check each statement which was added or removed or doing a pull each time one of those signals was emitted. An ugly and not very fast solution.

With the introduction of the Data Management Service this will finally change. We already have a draft API for the Nepomuk::ResourceWatcher which allows to opt in for change notifications of different kinds: changes on specific resources, new resource of specific types, changes to specific properties.

The initial API is there and partially integrated with the Data Management Service already. However, I would like to add some more nice features like only watching for non-indexed data or excluding changes done by a specific application (useful for an app which does changes itself and does not want to bother with that data). Also integration into the DMS needs to be finished as not all features exposed in the API are supported yet.

The technical aspect: KDE frameworks

With KDE 5.0 kdelibs and kde-runtime will be split into smaller parts to make it simpler for application developers to depend our powerful technologies. This also means a split for Nepomuk. I already started this split but a lot more work needs to be done to make Nepomuk an independent part in the KDE frameworks family.

Part of this also involves getting rid of deprecated legacy API and improving API where we were previously restricted by binary-compatibility issues.

So this is it for now. Reading over it again I get the feeling that it might be too much already – especially since I am fairly certain that new things will pop up all over the place. Nonetheless I will try to stay the course for once. ;)

Thanks again for your support.

Nepomuk Frameworks – kdelibs 5.0: What To Do

August 31, 2011 / Sebastian Trüg / 15 Comments

Development of kdelibs 5.0 has begun in the framework branch of its git repository. The main goal for kdelibs 5.0 is that there will be no more kdelibs as it is now. kdelibs (and kde-runtime) will be split up into smaller pieces to lower the barrier for non-KDE developers to use part of the power we as a KDE development community provide. The rough idea is that there will be three groups of libraries/frameworks:

Tier 1: components which only depend on Qt and no other lib/component from KDE.
Tier 2: components which depend on Qt and other libraries from Tier 1.
Tier 3: components which depend on anything.

This of course includes Nepomuk and is a great opportunity for us to reorganize code and get rid of deprecated junk we needed to keep around for binary compatibility.

To this end we had two meetings on IRC this week to discuss how we want to proceed for KDE frameworks. It was a lot of fun (at least for me), mostly because it was rather easy to reach good decisions and fun to radically kick stuff that has been bugging us for a long time. The result is documented in a wiki page but let me summarize the basics again:

We will have four new git repositories:

nepomuk-core – The main Nepomuk repository which is required in any case and provides the basic Nepomuk functionality including storage, query, and the like. I already began work on this in a nepomuk-core scratch repo. This repository will contain:
- Nepomukserver
- nepomukservicestub
- Extension ontologies
- Core services: Storage/DMS, Query service, Filewatch service, File indexer
- Core library: the current libnepomuk including Nepomuk::Resource and Nepomuk::Types, libnepomukquery, and libnepomukdatamanagement
nepomuk-ui – A repository containing Nepomuk UI extensions. For starters this will contain:
- SearchLineEdit and SearchWidget
- KFileMetaDataWidget which is currently living in KIO
nepomuk-kde-kio – A KIO Nepomuk extension repository which contains the KIO slaves we currently provide in kde-runtime:
- nepomuksearch – General purpose queries
- timeline – browse files by date
- nepomuk – a simple kio slave which allows browsing of Nepomuk resources via HTML pages
nepomuk-kde-config-ui – The repository for Nepomuk configuration extensions based on KDE technology. It contains:
- nepomukcontroller
- Nepomuk KCM

We feel this gives a clean separation and will actually urge packagers to not split those repositories up any further.

Apart from that we decided a few more things – which API to drop, some internals, how to get rid of Soprano::Model in the Nepomuk API altogether, and so on. Actually I am very happy to soon have dedicated Nepomuk repositories as that will make development easier.

And BTW: kde-runtime/nepomuk master is frozen for commits. Development already moved to the new nepomuk-core repository.

A Million Ways To Do It Wrong

October 31, 2010 / Sebastian Trüg / 9 Comments

It is a sad truth: when it comes to creating data for the Nepomuk semantic desktop there are a million ways to do it wrong and basically only one way to get it right. Typically people will choose from the first set of ways. While that is of course bad they are not to blame. Who wants to read page after page of documentation and reference guide? Who wants to dive into the depth of RDF and all that ontology stuff when they just need to store a note (yes, this blog was inspired by a real problem). Nobody – that’s who! Thus, the Nepomuk API should do most of the work. Sadly it does not. It basically allows you to do everything. Resource::setProperty will happily use classes or even invalid URLs as properties without giving any feedback to the developer. Why is that? Well, I suppose there are at least three reasons: 1. Back in the day I figured people would almost always use the resource-generator to create their own convenience classes which handle the types and properties properly, 2. The Resource class is probably the oldest part of the whole Nepomuk stack, and 3. basic lack of time, drive and development power.

So what can we do about this situation? Vishesh and me have been discussing the idea of a central DBus API for Nepomuk data management a million times (as you can see today “a million” is my goto expression when I want to say “a lot”). So far, however, we could not come up with a good API that solves all problems, is future-proof (to a certain extend), and performs well. That did not change. I still do not know the solution. But I have some ideas as to what the API should do for the user in terms of data integrity.

Ensure that only valid existing properties are used and provide a good error message in case a class or an invalid URL or something non-existing is used instead. This would also mean that one could only use ontologies that have been imported into Nepomuk. But since the ontology loader already supports fetching ontologies from the internet this should not be a big problem.
Ensure that the ranges of the properties are honoured. This is pretty straight-forward for literal ranges. In that case we could also do some fancy auto-conversion to simplify the usage but in essence it is easy. The case of a non-literal ranges is a bit more tricky. Do we want to force proper types or do we assume that the object resource has the required type? I suppose flags would be of use:
- ClosedWorld – It is required that the object resource has the type of the range. If it has not the call fails.
- OpenWorld – The object resource will simply get the range type. This is not problem since resources can be of several types.
This would also mean that each property needs to have a properly defined range. AFAIK this is currently not the case for all NIE ontologies. I think it is time for ontology unit tests!
Automatically handle pimo:Things to a certain extend: Here I could imagine that trying to add a PIMO property on a resource would automatically add it to the related pimo:Thing instead.

Moving this from the client library into a service would have other benefits, too.

The service could be used from other languages than C++ or even from applications not using KDE.
The service could perform optimizations when it comes to storing triples, updating resources, caching, you name it.
The service could provide change notifications which are much more useful than the Soprano::Model signals which are pretty useless.
The service could perform any number of integrity tests before executing the actual commands on the database, thus improving the quality of the data in Nepomuk altogether.

This blog entry is not about presenting the one solution to solve all our problems. It is merely a brain-dump, trying to share some of the random thoughts that go through my head when taking a walk in the woods. Nonetheless this is an issue that needs tackling at one point or another. In any case my ideas are saved for the ages. :)

Small Things are Happening…

May 5, 2010May 5, 2010 / Sebastian Trüg / 11 Comments

I have to admit: I am a sucker for nice APIs. And, yes, I am sort of in love with some of my own creations. Well, at least until I find the flaws and cannot remove them due to binary compatibility issues (see Soprano). This may sound a bit egomaniac but let’s face it: we almost never get credit for good API or good API documentation. So we need to congratulate ourselves.

My new pet is the Nepomuk Query API. As its name says it can be used to query Nepomuk resources and sets out to replace as many hard coded SPARQL queries as possible. It started out rather simple: matching a set of resources with different types of terms. But then Dario Freddi and his relatively complex telepathy queries came along. So the challenge began. I tried to bend the existing API as much as possible to fit in the features he requested. One thing let to another, I suddenly found myself in need of optional terms and a few days later things were not as simple as they started out any more.

ComparisonTerm already was the most complex term in the query API. But that did not mean much. Basically you could set a property and a sub term and that was it. On Monday it became a bit more beastly. Now you can invert it, force the name of the variable used, give it a sort weight, change the sort order, and even set an aggregate function. And all that on only one type of term. At least I implemented optional terms separately.

To explain what all this is good for I will try to illustrate it with a few examples:

Say you want to find all tags a specifc file has (and ignore the fact that there is a Nepomuk::Resource::tags method). This was not possible before inverted comparison terms came along. Now all you have to do is:

Nepomuk::Query::ComparisonTerm tagTerm(
    Soprano::Vocabulary::NAO::hasTag(),
    Nepomuk::Query::ResourceTerm(myFile)
);
tagTerm.setInverted(true);
Nepomuk::Query::Query(tagTerm);

What happens is that subject and object change places in the ComparisonTerm, thus, the resulting SPARQL query looks something like the following:

select ?r where { <myFile> nao:hasTag ?r . }

Simple but effective and confusing. It gets better. Previously we only had the clumsy Query::addRequestProperty to get additional bindings from the query. It is very restricted as it only allows to query direct relations from the results. With ComparisonTerm::setVariableName we now have the generic counterpart. By setting a variable name this variable is included in the bindings of the final query and can be retrieved via Result::additionalBinding. This allows to retrieve any detail from any part of the query. Again we use the most simple example to illustrate:

Nepomuk::Query::ComparisonTerm labelTerm(
    Soprano::Vocabulary::NAO::prefLabel(),
    Nepomuk::Query::Term() );
labelTerm.setVariableName( "label" );
Nepomuk::Query::ComparisonTerm tagTerm(
    Soprano::Vocabulary::NAO::hasTag(),
    labelTerm );
tagTerm.setInverted(true);
Nepomuk::Query::Query query( tagTerm );

This query lists all tags including their labels. Again the resulting SPARQL query would look something like the following:

select ?r ?label where { <myFile> nao:hasTag ?v1 . ?v1 nao:prefLabel ?label . }

And silently I used another little gimmick that I introduced: ComparisonTerm can now handle invalid properties and invalid sub terms which will simply act as wildcards (or be represented by a variable in SPARQL terms).

Now on to the next feature: sort weights. The idea is simple: you can sort the search results using any value matched by a ComparisonTerm. So let’s extend the above query by sorting the tags according to their labels.

labelTerm.setSortWeight( 1 );

And the resulting SPARQL query will reflect the sorting:

select ?r ?label where { <myFile> nao:hasTag ?v1 . ?v1 nao:prefLabel ?label . } order by ?label

Here I used a sort weight of 1 since I only have the one term that includes sorting. But in theory you can include any number of ComparisonTerms in the sorting. The higher the weight the more important the sort order.

We are nearly done. Only one feature is left: aggregate functions. The Virtuoso SPARQL extensions (and SPARQL 1.1, too) support aggregate functions like count or max. These are now supported in ComparisonTerm. They are only useful in combination with a forced variable name in which case they will be included in the additional bindings or with a sort weight. If we go back to our tags we could for example count the number of tags each file has attached:

Nepomuk::Query::ComparisonTerm tagTerm(
    Soprano::Vocabulary::NAO::hasTag(),
    Nepomuk::Query::ResourceTypeTerm(Soprano::Vocabulary::NAO::Tag())
);
tagTerm.setAggregateFunction(
    Nepomuk::Query::ComparisonTerm::Count
);
tagTerm.setVariableName("cnt");
Nepomuk::Query::Query(tagTerm);

And the resulting SPARQL query will be along the lines of:

select ?r count(?v1) as ?cnt where { ?r nao:hasTag ?v1 . ?v1 a nao:Tag . }

Now we can of course sort by number of tags:

tagTerm.setSortWeight(1, Qt::DescendingOrder);

And we get:

select ?r count(?v1) as ?cnt where { ?r nao:hasTag ?v1 . ?v1 a nao:Tag . } order by desc ?cnt

And just because it is fun, let us make the tagging optional so we also get files with zero tags (be aware that normally one should have at least one non-optional term in the query to get useful results. In this case we are on the safe side since we are using a FileQuery):

Nepomuk::Query::ComparisonTerm tagTerm(
    Soprano::Vocabulary::NAO::hasTag(),
    Nepomuk::Query::ResourceTypeTerm(Soprano::Vocabulary::NAO::Tag())
);
tagTerm.setAggregateFunction(
    Nepomuk::Query::ComparisonTerm::Count
);
tagTerm.setVariableName("cnt");
tagTerm.setSortWeight(1, Qt::DescendingOrder);
Nepomuk::Query::FileQuery(
    Nepomuk::Query::OptionalTerm::optionalizeTerm(tagTerm)
);

And with the SPARQL result of this beauty I finish my little session of self-congratulations:

select ?r count(?v1) as ?cnt where { ?r a nfo:FileDataObject . OPTIONAL { ?r nao:hasTag ?v1 . ?v1 a nao:Tag . } . } order by desc ?cnt

Virtuoso – Once More With Feeling

October 22, 2009October 22, 2009 / Sebastian Trüg / 23 Comments

The Virtuoso backend for Soprano and, thus, Nepomuk can be seen as rather stable now. So now the big tests can begin as the goal is to make it the standard in KDE 4.4. Let me summarize the important information again:

Step 1

Get Virtuoso 5.0.12 from the Sourceforge download page. Virtuoso 6 is NOT supported. (not yet anyway)

Step 2

Hints for packagers: Soprano only needs two files: the virtuoso-t binary and the virtodbc_r(.so) ODBC driver. Everything else is optional. (For the self-compiling folks out there: –disable-all-vads is your friend.)

Step 3

Install libiodbc which is what the Soprano build will look for (Virtuoso is simply a run-time dependency.)

Step 4

Rebuild Soprano from current svn trunk (Remember: Redland is still mandatory. Its memory storage is used all over Nepomuk!)

Step 5

Edit ${KDEHOME}/share/config/nepomukserverrc with your favorite editor. In the “[Basic Settings]” section add “Soprano Backend=virtuosobackend”. Do not touch the main repository settings!

Step 6

Restart Nepomuk. I propose the following procedure to gather debugging information in case something goes wrong:
Shutdown Nepomuk completely:

 # qdbus org.kde.NepomukServer /nepomukserver org.kde.NepomukServer.quit

Restart it by piping the output into a temporary file (bash syntax):

 # nepomukserver 2> /tmp/nepomuk.stderr

Step 7

Wait for Nepomuk to convert your data. If you are running KDE trunk you even get a nice progress bar in the notification area (BTW: does anyone know why it won’t show the title?)

And Now?

That is already it. Now you can enjoy the new Virtuoso backend.

The development has taken a long time. But I want to thank OpenLink and especially Patrick van Kleef who helped a lot by fixing the last little tidbits in Virtuoso 5 for my unit tests to pass. Next step is Virtuoso 6.

And Yet Another Post About Virtuoso

October 14, 2009 / Sebastian Trüg / 7 Comments

Today nearly all problems are solved. OpenLink provided a patch that makes inserting very large literals (more than 1 metabyte in size) lightning fast, even with a very low buffer count. Also I worked around the issue of URI encoding. Now the Soprano Virtuoso backend simply percent-encodes all non-unreserved characters and all reserved characters that are not used in their special meaning in URIs used in queries. Man, that is a mouth full. Well, it seems to work fine although I can always use more testing with weird file URLs (weird means containing weird characters like brackets and the likes). I also fixed some error handling bugs.

So what is left? Well, there are a few hacks in the Virtuoso backend which are rather ugly. One example is the detection of query result types. To determine if the result is boolean, bindings, or a graph it actually checks the name and number of result columns. Urgh! It would be nicer to check for the type of the result. Seems like graph results are BLOBs.

Anyway, enough for tonight. I am tired. Here is the patch to make Virtuoso not hang when Strigi adds nie:PlainTextContent literals of big files:

Index: sqlrcomp.c
===================================================================
RCS file: virtuoso-opensource/libsrc/Wi/sqlrcomp.c,v
retrieving revision 1.9
diff -u -r1.9 sqlrcomp.c
--- sqlrcomp.c  20 Aug 2009 17:47:22 -0000      1.9
+++ sqlrcomp.c  13 Oct 2009 16:11:49 -0000
@@ -65,7 +65,7 @@
 {
 va_list list;
 char temp[2000];
-  int ret;
+  int ret, rest_sz, copybytes;
 va_start (list, string);
 ret = vsnprintf (temp, sizeof (temp), string, list);
 #ifndef NDEBUG
@@ -75,11 +75,16 @@
 va_end (list);
 #ifndef NDEBUG
 if (*fill + strlen (temp) > len - 1)
-    GPF_T1 ("overflow in strncpy");
+    GPF_T1 ("overflow in memcpy");
 #endif
-  strncpy (&text[*fill], temp, len - *fill - 1);
+  rest_sz = (len - fill[0]);
+  if (ret >= rest_sz)
+    copybytes = ((rest_sz > 0) ? rest_sz : 0);
+  else
+    copybytes = ret+1;
+  memcpy (text+fill[0], temp, copybytes);
 text[len - 1] = 0;
-  *fill += (int) strlen (temp);
+  fill[0] += ret;
 }

Trueg's Blog

Semantic Webbiness, some authentication, and a whole lot of ACLs

Soprano