Nepomuk Tasks: Let The Virtuoso Inferencing Begin


Only four days ago I started the experiment to fund specific Nepomuk tasks through donations. Like with last year’s fundraiser I was uncertain if it was a good idea. That, however, changed when only a few hours later two tasks had already reached their donation goal. Again it became obvious that the work done here is appreciated and that the “open” in Open-Source is understood for what it actually is.

So despite my wife not being overly happy about it I used the weekend to work on one of the tasks: Virtuoso inferencing.

Inference?

As a quick reminder: the inferencer automatically infers information from the data in the database. While Virtuoso can handle pretty much any inference rule you throw at it we stick to the basics for now: if resource R1 is of type B and B derives from A then R1 is also of type A. And: if R1 has property P1 with value “foobar” and P1 is derived from P2 then R1 also has property P2 with value “foobar“.

Crappy Inference

This is already very useful and even mandatory in many cases. Until now we used what we called “crappy inferencing 1 & 2″. The Crappy inferencer 1 was based on work done in the original Nepomuk project and it simply inserted triples for all sub-class and sub-property relations. That way we could simulate real inference by querying for something like

select * where {
  ?r ?p "foobar" . 
  ?p rdfs:subPropertyOf rdfs:label .
}

and catch all sub-properties of rdfs:label like nao:prefLabel or nie:title. While this works it means bad performance, additional storage and additional maintenance.

The Crappy Inferencer 2 was even worse. It inserted rdf:type triples for all super-classes. This means that it would look at every added and removed triple to check if it was a rdf:type triple. If so it would add or remove the appropriate rdf:type triples for the super-types. That way we could do fast type queries without relying on the crappy inferencer 1 which relies on the rdfs:subClassOf method. But this meant even more maintenance and even more storage space wasted.

Introducing: Virtuoso Inference

So now we simply rely on Virtuoso to do all that and it does such a wonderful job. Thanks to Virtuoso graph groups we can keep our clean ontology separation (each ontology has its own graph) and still stick to a very simple extension of the queries:

DEFINE input:inference <nepomuk:/ontographgroup>
select * where {
  ?r rdfs:label "foobar" .
}

Brilliant. Of course there are still situations in which you do not want to use the inferencer. Imagine for example the listing of resource properties in the UI. This is what it would look like with inference:

We do not want that. Inference is intended for machine, not for the human, at least not like this. So since back in the day I did not think of adding query flags to Soprano I simply introduced a new virtual query language: SparqlNoInference.

Resource Visibility

While at it I also improved the resource visibility support by simplifying it. We do not need any additional processing anymore. This again means less work on startup and with every triple manipulation command. Again we save space and increase performance. But this also means that resource visibility filtering will not work as before anymore. Nepoogle for example will need adjustment to the new way of filtering. Instead of

?r nao:userVisible 1 .

we now need

FILTER EXISTS { ?r a [ nao:userVisible "true"^^xsd:boolean ] }

Testing

The implementation is done. All that rests are the tests. I am already running all the patches but I still need to adjust some unit tests and maybe write new ones.

You can also test it. The code changes are, as always, spread over Soprano, kdelibs and kde-runtime. Both kdelibs and kde-runtime now contain a branch “nepomuk/virtuosoInference”. For Soprano you need git master.

Look for regressions of any kind so we can merge this as soon as possible. The goal is KDE 4.9.

About these ads

8 thoughts on “Nepomuk Tasks: Let The Virtuoso Inferencing Begin

  1. Then I need detect KDE version and use one or other method or, even better, could be possible detect the installed soprano version?

    I want to noticed you about some severe performance issues I detected when I was use “true”^^xsd:boolean. I think that Virtuoso optimizer sometimes is not doing conversion before building the query and is doing a conversion for all row with a severe impact in performance if the stored value is 1 and not True. Obviously this is only a theory and I could be wrong, and even this issue was solved with latest versions of Virtuoso, but you only need a couple of minutes to test I’m wrong so I prefer to inform you about this :).

    • Virtuoso does not support boolean. Thus, Soprano uses a fake custom datatype with values “true” and “false” (no idea why I did not choose 1 and 0 back when I implemented it). Thus, it always does handle string values which obviously cannot be compared to integers. This in turn means that you can never query boolean values stored as int and the fake type at the same time.
      With this is mind I have to admit that I do not really understand the problem you are describing.

      • Sorry for my Engrish :(.

        If you are using “True”^^xsd:boolean but you stored the number 1 in your database a conversion is required to compare “True”, the value used in the query, with 1, the value stored in the database.

        If the optimizer it’s smart enough then detects that this is a constant value and the pseudo-code must be something like:

        in this case only one conversion is required to create the query. But, if the pseudo-code is not smart then a conversion is required for any comparison:

        This could be a problem with joins where it’s easy have millions of comparisons. The same example in c++ code, probably with bugs ;).

        iterations = ;
        str = “1″;
        sum = 0;
        for (i = 0, i++, iterations) {
        sum += int(str);
        }

        it’s slow than
        str = “1″;
        intStr = int(str);
        sum = 0;
        for (i = 0, i++, iterations) {
        sum += intStr;
        }

        • Damn, this stupid wordpress remove some code :P. Well I think that with the C++ version it’s enought ;).

        • I think I get it. I don’t think this is an issue though since Virtuoso does not support boolean. Thus, if we use boolean in the query it will be converted to some fake type by Soprano. Virtuoso will not be able to compare that to integer values in the store. It might if I had used “0″ and “1″ instead of “false” and “true” when implementing it years ago.

  2. This is completely off topic, but I think the Nepomuk team absolutely NEEDS to watch this video. Basically, this is Microsoft showcasing what is possible with their (now dead) WinFS. If we consider WinFS as Nepomuk done at the filesystem level (the idea is the same, implementation has similarities, Microsoft considered of course using their SQL server instead of Virtuoso, there are ontologies, there even is a NepSaK equivalent called Microsoft StothereSpy), then the possibilities are endless.

    [youtube http://www.youtube.com/watch?v=lvsxPNt-_B0&w=420&h=315%5D

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s