Protecting And Sharing Linked Data With Virtuoso

Disclaimer: Many of the features presented here are rather new and can not be found in  the open-source version of Virtuoso.

Last time we saw how to share files and folders stored in the Virtuoso DAV system. Today we will protect and share data stored in Virtuoso’s Triple Store – we will share RDF data.

Virtuoso is actually a quadruple-store which means each triple lives in a named graph. In Virtuoso named graphs can be public or private (in reality it is a bit more complex than that but this view on things is sufficient for our purposes), public graphs being readable and writable by anyone who has permission to read or write in general, private graphs only being readable and writable by administrators and those to which named graph permissions have been granted. The latter case is what interests us today.

We will start by inserting some triples into a named graph as dba – the master of the Virtuoso universe:

Virtuoso Sparql Endpoint

Sparql Result

This graph is now public and can be queried by anyone. Since we want to make it private we quickly need to change into a SQL session since this part is typically performed by an application rather than manually:

$ isql-v localhost:1112 dba dba
Connected to OpenLink Virtuoso
Driver: 07.10.3211 OpenLink Virtuoso ODBC Driver
OpenLink Interactive SQL (Virtuoso), version 0.9849b.
Type HELP; for help and EXIT; to exit.
SQL> DB.DBA.RDF_GRAPH_GROUP_INS ('http://www.openlinksw.com/schemas/virtrdf#PrivateGraphs', 'urn:trueg:demo');

Done. -- 2 msec.

Now our new named graph urn:trueg:demo is private and its contents cannot be seen by anyone. We can easily test this by logging out and trying to query the graph:

Sparql Query
Sparql Query Result

But now we want to share the contents of this named graph with someone. Like before we will use my LinkedIn account. This time, however, we will not use a UI but Virtuoso’s RESTful ACL API to create the necessary rules for sharing the named graph. The API uses Turtle as its main input format. Thus, we will describe the ACL rule used to share the contents of the named graph as follows.

@prefix acl: <http://www.w3.org/ns/auth/acl#> .
@prefix oplacl: <http://www.openlinksw.com/ontology/acl#> .
<#rule> a acl:Authorization ;
  rdfs:label "Share Demo Graph with trueg's LinkedIn account" ;
  acl:agent <http://www.linkedin.com/in/trueg> ;
  acl:accessTo <urn:trueg:demo> ;
  oplacl:hasAccessMode oplacl:Read ;
  oplacl:hasScope oplacl:PrivateGraphs .

Virtuoso makes use of the ACL ontology proposed by the W3C and extends on it with several custom classes and properties in the OpenLink ACL Ontology. Most of this little Turtle snippet should be obvious: we create an Authorization resource which grants Read access to urn:trueg:demo for agent http://www.linkedin.com/in/trueg. The only tricky part is the scope. Virtuoso has the concept of ACL scopes which group rules by their resource type. In this case the scope is private graphs, another typical scope would be DAV resources.

Given that file rule.ttl contains the above resource we can post the rule via the RESTful ACL API:

$ curl -X POST --data-binary @rule.ttl -H"Content-Type: text/turtle" -u dba:dba http://localhost:8890/acl/rules

As a result we get the full rule resource including additional properties added by the API.

Finally we will login using my LinkedIn identity and are granted read access to the graph:

SPARQL Endpoint  Login
sparql6
sparql7
sparql8

We see all the original triples in the private graph. And as before with DAV resources no local account is necessary to get access to named graphs. Of course we can also grant write access, use groups, etc.. But those are topics for another day.

Technical Footnote

Using ACLs with named graphs as described in this article requires some basic configuration. The ACL system is disabled by default. In order to enable it for the default application realm (another topic for another day) the following SPARQL statement needs to be executed as administrator:

sparql
prefix oplacl: <http://www.openlinksw.com/ontology/acl#>
with <urn:virtuoso:val:config>
delete {
  oplacl:DefaultRealm oplacl:hasDisabledAclScope oplacl:Query , oplacl:PrivateGraphs .
}
insert {
  oplacl:DefaultRealm oplacl:hasEnabledAclScope oplacl:Query , oplacl:PrivateGraphs .
};

This will enable ACLs for named graphs and SPARQL in general. Finally the LinkedIn account from the example requires generic SPARQL read permissions. The simplest approach is to just allow anyone to SPARQL read:

@prefix acl: <http://www.w3.org/ns/auth/acl#> .
@prefix oplacl: <http://www.openlinksw.com/ontology/acl#> .
<#rule> a acl:Authorization ;
  rdfs:label "Allow Anyone to SPARQL Read" ;
  acl:agentClass foaf:Agent ;
  acl:accessTo <urn:virtuoso:access:sparql> ;
  oplacl:hasAccessMode oplacl:Read ;
  oplacl:hasScope oplacl:Query .

I will explain these technical concepts in more detail in another article.

Virtuoso 6.1.6 and KDE 4.9

Shortly after KDE 4.9 hits the net Virtuoso 6.1.6 follows. Virtuoso 6.1.6 comes with a ton of fixes, improvements and optimizations and it is highly recommended to update for the best Nepomuk experience.

Virtuoso 6.1.6 has been tested by the Nepomuk team in cooperation with OpenLink Software before its release. It is the recommended release for Nepomuk. This is not only true for KDE 4.9 but for any version before it.

Get the sources while they are hot and build your packages.

Nepomuk Tasks: Let The Virtuoso Inferencing Begin

Only four days ago I started the experiment to fund specific Nepomuk tasks through donations. Like with last year’s fundraiser I was uncertain if it was a good idea. That, however, changed when only a few hours later two tasks had already reached their donation goal. Again it became obvious that the work done here is appreciated and that the “open” in Open-Source is understood for what it actually is.

So despite my wife not being overly happy about it I used the weekend to work on one of the tasks: Virtuoso inferencing.

Inference?

As a quick reminder: the inferencer automatically infers information from the data in the database. While Virtuoso can handle pretty much any inference rule you throw at it we stick to the basics for now: if resource R1 is of type B and B derives from A then R1 is also of type A. And: if R1 has property P1 with value “foobar” and P1 is derived from P2 then R1 also has property P2 with value “foobar“.

Crappy Inference

This is already very useful and even mandatory in many cases. Until now we used what we called “crappy inferencing 1 & 2″. The Crappy inferencer 1 was based on work done in the original Nepomuk project and it simply inserted triples for all sub-class and sub-property relations. That way we could simulate real inference by querying for something like

select * where {
  ?r ?p "foobar" . 
  ?p rdfs:subPropertyOf rdfs:label .
}

and catch all sub-properties of rdfs:label like nao:prefLabel or nie:title. While this works it means bad performance, additional storage and additional maintenance.

The Crappy Inferencer 2 was even worse. It inserted rdf:type triples for all super-classes. This means that it would look at every added and removed triple to check if it was a rdf:type triple. If so it would add or remove the appropriate rdf:type triples for the super-types. That way we could do fast type queries without relying on the crappy inferencer 1 which relies on the rdfs:subClassOf method. But this meant even more maintenance and even more storage space wasted.

Introducing: Virtuoso Inference

So now we simply rely on Virtuoso to do all that and it does such a wonderful job. Thanks to Virtuoso graph groups we can keep our clean ontology separation (each ontology has its own graph) and still stick to a very simple extension of the queries:

DEFINE input:inference <nepomuk:/ontographgroup>
select * where {
  ?r rdfs:label "foobar" .
}

Brilliant. Of course there are still situations in which you do not want to use the inferencer. Imagine for example the listing of resource properties in the UI. This is what it would look like with inference:

We do not want that. Inference is intended for machine, not for the human, at least not like this. So since back in the day I did not think of adding query flags to Soprano I simply introduced a new virtual query language: SparqlNoInference.

Resource Visibility

While at it I also improved the resource visibility support by simplifying it. We do not need any additional processing anymore. This again means less work on startup and with every triple manipulation command. Again we save space and increase performance. But this also means that resource visibility filtering will not work as before anymore. Nepoogle for example will need adjustment to the new way of filtering. Instead of

?r nao:userVisible 1 .

we now need

FILTER EXISTS { ?r a [ nao:userVisible "true"^^xsd:boolean ] }

Testing

The implementation is done. All that rests are the tests. I am already running all the patches but I still need to adjust some unit tests and maybe write new ones.

You can also test it. The code changes are, as always, spread over Soprano, kdelibs and kde-runtime. Both kdelibs and kde-runtime now contain a branch “nepomuk/virtuosoInference”. For Soprano you need git master.

Look for regressions of any kind so we can merge this as soon as possible. The goal is KDE 4.9.

Virtuoso Open-Source Moved to GitHub

Ever since 2006 OpenLink Software has provided its Open-Source version of Virtuoso (VOS), the high-performance SQL server with a powerful RDF/SPARQL data management layer on top.

So far the sources have been developed in an internal cvs repository which was published through the Virtuoso sourceforge pages.

As of March 21. OpenLink took the next step towards Open Development by moving to git as its version management system. The sources are now hosted in the VOS GitHub repository.

Like mentioned on the VOS git usage pages OpenLink now accepts GitHub pull requests and patches. Be sure to read the notes on git branching policy in VOS which are based on the git-flow approach by Vincent Driessen – which by the way is an interesting read independent of VOS.

Most importantly it is now a lot simpler to follow the development of Virtuoso Open-Source. Simply clone the git repository and switch to the appropriate develop branch:

$ git clone git://github.com/openlink/virtuoso-opensource.git
$ cd virtuoso-opensource
$ git checkout -t remotes/origin/develop/6

For details on the used branches see the already mentioned VOS git usage guide.

Refer to the VOS building instructions if the following is not enough for you:

$ ./autogen.sh
$ ./configure --prefix=/usr/local --with-layout=<LAYOUT>
$ make
$ make install

where <LAYOUT> is one of Gnu, Debian, Gentoo, Redhat, Freebsd, opt, Openlink. The latter two force the prefix.

A Word (or Two) on Removable Storage Media Handling in Nepomuk

While fixing existing Nepomuk bugs and trying to close them as they come in I also look into other things. Last week it was the improved file indexer scheduling and file modification handling. This week it is about another improvement in the handling of queries which involve removable media. Ignacio Serantes already found one bug in the URL encoding before. This time he wanted to search through all mounted removable storage media and realized that he could not. I just fixed that. In order to understand how I did that we need to go into detail about how Nepomuk handles removable media.

Removable Storage Media in Nepomuk

Files on removable storage media are a problem when it comes to meta data stored in Nepomuk. As long as the medium is mounted we can simply identify the files through their local file path. But as soon as it is unmounted the paths are no longer valid. To make things worse we could mount the medium at another mount point the next time or mount another medium (which obviously does not contain the files in question) at the same mount point. So we need a way around that problem. Ever since 4.7 Nepomuk has a rather fancy way of doing that.

Internally Nepomuk uses a stack of Soprano::FilterModels which perform several operations on the data that passes through them. One of these models is the RemovableStorageModel. This model does one thing: it converts the local file URLs of files and folders on removable media into mount-path-independent URLs and vice versa. Currently it supports removable disks like USB keys or external hard disks (any storage that has a UUID), optical media, NFS and Samba mounts. The nice thing about it is that this conversion happens transparently to the client. Thus, a client simply uses the local file URLs according to the current mount path and does not care about anything else. It will always get the correct results.

To understand this better we should look at an example. Imagine we have a USB key inserted with UUID “xyz” which is mounted at /media/disk. Now if we add information about a file /media/disk/myfile.txt to Nepomuk the following happens: The RemovableStorageModel will convert the URL file:///media/disk/myfile.txt into filex://xyz/myfile.txt. This is a custom URL scheme which consists of the device UUID and the relative path. When querying the file the model does the conversion in the other direction. So far so simple.

Queries are where it gets a little more complicated. Imagine we want to query all files in a certain directory on the removable medium (ideally the SPARQL would be hidden by the Nepomuk query API). We would then perform a query like the following simplified one.

select ?r where {
  ?r nie:isPartOf ?p . 
  ?p nie:url <file:///media/disk/somefolder> . }

If we would pass this query on to Virtuoso we would not get any results since there is no resource with nie:url <file:///media/disk/somefolder>. So the RemovableStorageModel steps in again and does some query tweaking (rather primitive tweaking seeing that we do not have a SPARQL parser in Nepomuk). The query is converted into

select ?r where {
  ?r nie:isPartOf ?p .
  ?p nie:url <filex://xyz/somefolder> . }

And suddenly we get the expected results.

Of course this is still rather simple. It gets more complicated when SPARQL REGEX filters are involved. Imagine we wanted to look for all files in some sub-tree on a removable medium. We would then use a query along the lines of the following:

select ?r where {
  ?r nie:url ?url .
  FILTER(REGEX(STR(?url), '^file:///media/disk/somefolder/')) . }

As before passing this query directly on to Virtuoso would not yield any results. The RemovableStorageModel needs to do its magic first:

select ?r where {
  ?r nie:url ?url .
  FILTER(REGEX(STR(?url), '^filex://xyz/somefolder/')) . }

This is what the model did before Ignacio wanted to query all his removable media mounted somewhere under /media at once. Obviously he did something like:

select ?r where {
  ?r nie:url ?url .
  FILTER(REGEX(STR(?url), '^file:///media/')) . }

The result, however, was empty. This is simply because there was no exact match to any mount path of any of the removable media and RemovablStorageModel did not replace anything. The solution was to include additional filters for all the candidates in addition to the already existing filter. We need to keep the existing filter in case there is anything else under /media which is not a removable medium and, thus, has normal local file:/ URLs.

If we imagine that we have an additional mounted removable medium with UUIDfoobar” then the query would be converted into something like the following.

select ?r where {
  ?r nie:url ?url .
  FILTER((REGEX(STR(?url), '^file:///media/') ||
          REGEX(STR(?url), '^filex://xyz/') || 
          REGEX(STR(?url), '^filex://foobar/'))) . }

This way we get the expected results. (The additional brackets are necessary in case the filter already contains more than one term.)

Well, I personally think this is a very clean solution where clients only have to consider filex:/ and its friends nfs:/, smb:/, and optical:/ if the media are not mounted. One way of handling that I already drafted a while back. But that will be perfected another day. ;)

For now let me, as always, close with the hint that development like this is still running on your donations:

Click here to lend your support to: Nepomuk - The semantic desktop on KDE and make a donation at www.pledgie.com !
Click here to donate to Nepomukvia Moneybookers

The Hunt For Nepomuk Bugs Continues

Let me open with a few stats just to brag:

  • Top bug killer on the last commit digest
  • Number of Nepomuk crash reports now below 100
  • Overall number of Nepomuk bugs down to 163 (this is actually not much, have a look at the related statistics)
  • I closed some serious bugs this week (details below)

If you want to track the progress you can use the following links to check from time to time:

Finally I want to present two fixes I did this last week just to show what kind of work needs to be done in order to fix problems in Nepomuk:

1. Bug 281136 – Nepomuk queries containing unicode characters fail

The problem presented itself as follows: whenever the user would execute a query containing extended characters such as german umlauts, french accents, or for example any russian character the query would not return any results.

After some testing I realized that the queries simply failed when being delivered to Virtuoso because of Nepomuk’s automatic search excerpt extraction. It turned out that Virtuoso’s bif:search_excerpt method cannot handle wide characters which is exactly what it got. So I turned to the Virtuoso team for help and got a workaround which essentially means that we convert the wide characters to UTF8. However, this results in stripped search excerpts so the story does not end yet – I am waiting for a better solution from the Virtuoso guys.

2. Nepomuk deletes annotations of files on removable media

This was a very interesting bug – to me at least. The problem was that Nepomuk would delete the manually added information like tags, ratings, relations to other files, and so on from files that are stored on an external hard disk.

Now to understand this problem better I have to explain a bit how Nepomuk handles external media: Nepomuk uses Soprano’s Api to access RDF data. This is done through a whole stack of what we call models, each of which performs some operations on the data that passes through. One of these models handles external media. It converts each URL of a file from an external media into a new URL which is independent of the media’s mount point.

Imagine for example that the external hard disk with UUID “foobar” is mounted at /media/hd. Then a URL like file:///media/hd/myfile.txt is converted to filex://foobar/myfile.txt. That way Nepomuk will find the file again even when the disk is mounted at another path. This conversion happens transparently for all clients, meaning they only work with the local file:/ URLs. A nice side-effect is that when the disk is not mounted any code that performs clean-up like removing data for non-existing files will ignore those entries since they have no relation to the mount point.

On to the bug. Thankfully Ignacio Serantes realized that he only lost the information from files that had spaces in their names. That already pointed to a URL encoding problem. When we convert URIs from and to strings we use percent encoding. If all goes well this works fine. However, if we have a bug we might end up percent-encoding the percent-encoded URI. This was the case in the removable media handling of Nepomuk. When converting the internal filex:/ URL back to its file:// counterpart the percent encoding got borked. As a result the clean-up code would check for the existence of the wrong local URL and remove the related data. The fix involved some trickery with QUrl and KUrl and reminded me that unit tests involving URIs should always check for possible percent-encoding problems.

Well, the hunt for bugs is going on. In the meantime I am also still hunting for Nepomuk funding.

Click here to lend your support to: Nepomuk - The semantic desktop on KDE and make a donation at www.pledgie.com !
Click here to donate to Nepomukvia Moneybookers