A Million Ways To Do It Wrong


It is a sad truth: when it comes to creating data for the Nepomuk semantic desktop there are a million ways to do it wrong and basically only one way to get it right. Typically people will choose from the first set of ways. While that is of course bad they are not to blame. Who wants to read page after page of documentation and reference guide? Who wants to dive into the depth of RDF and all that ontology stuff when they just need to store a note (yes, this blog was inspired by a real problem). Nobody – that’s who! Thus, the Nepomuk API should do most of the work. Sadly it does not. It basically allows you to do everything. Resource::setProperty will happily use classes or even invalid URLs as properties without giving any feedback to the developer. Why is that? Well, I suppose there are at least three reasons: 1. Back in the day I figured people would almost always use the resource-generator to create their own convenience classes which handle the types and properties properly, 2. The Resource class is probably the oldest part of the whole Nepomuk stack, and 3. basic lack of time, drive and development power.

So what can we do about this situation? Vishesh and me have been discussing the idea of a central DBus API for Nepomuk data management a million times (as you can see today “a million” is my goto expression when I want to say “a lot”). So far, however, we could not come up with a good API that solves all problems, is future-proof (to a certain extend), and performs well. That did not change. I still do not know the solution. But I have some ideas as to what the API should do for the user in terms of data integrity.

  1. Ensure that only valid existing properties are used and provide a good error message in case a class or an invalid URL or something non-existing is used instead. This would also mean that one could only use ontologies that have been imported into Nepomuk. But since the ontology loader already supports fetching ontologies from the internet this should not be a big problem.
  2. Ensure that the ranges of the properties are honoured. This is pretty straight-forward for literal ranges. In that case we could also do some fancy auto-conversion to simplify the usage but in essence it is easy. The case of a non-literal ranges is a bit more tricky. Do we want to force proper types or do we assume that the object resource has the required type? I suppose flags would be of use:
    • ClosedWorld – It is required that the object resource has the type of the range. If it has not the call fails.
    • OpenWorld – The object resource will simply get the range type. This is not problem since resources can be of several types.

    This would also mean that each property needs to have a properly defined range. AFAIK this is currently not the case for all NIE ontologies. I think it is time for ontology unit tests!

  3. Automatically handle pimo:Things to a certain extend: Here I could imagine that trying to add a PIMO property on a resource would automatically add it to the related pimo:Thing instead.

Moving this from the client library into a service would have other benefits, too.

  • The service could be used from other languages than C++ or even from applications not using KDE.
  • The service could perform optimizations when it comes to storing triples, updating resources, caching, you name it.
  • The service could provide change notifications which are much more useful than the Soprano::Model signals which are pretty useless.
  • The service could perform any number of integrity tests before executing the actual commands on the database, thus improving the quality of the data in Nepomuk altogether.

This blog entry is not about presenting the one solution to solve all our problems. It is merely a brain-dump, trying to share some of the random thoughts that go through my head when taking a walk in the woods. Nonetheless this is an issue that needs tackling at one point or another. In any case my ideas are saved for the ages. :)

About these ads

9 thoughts on “A Million Ways To Do It Wrong

  1. Who wants to read page after page of documentation and reference guide?

    That leads to question “who will write that documentation and keep it up-to-date with nepomuk development pace?”. Currently ontologies documentation is in kinda poor state

  2. Not sure if this is what you are talking about. But just the other day I was looking for some easy-to-use dbus magic letting me read and set tags of files from a bash script. I couldn’t find any. Is there a way to do this, even if not so straight forward? And in case there is, how?

    • Actually this is exactly what I am talking about. There is no such an API. Creating a tag from command line is far from trivial at the moment as you would need to do all that is done in the client library by yourself. Obviously that is not a great solution. So I would say: tell me your needs and I will use it as input for the API.

      • Well, the use case is actually rather simple: I have files which are in some way encoded/encrypted/compressed. When downloading them KGet lets me automatically assign tags. Now I have to decode the file. In this case with a Gtk application which will probably never ever get Nempomuk support. This means that all semantic information is lost on the decoded file. So my idea was to put a bash script around which reads the tags from the original file and applies them on the decoded file when finished. That’s it :)

        More generally this would apply also to other information attached to a file, like download URL, comments or what ever.

        • This is an interesting use case. In a perfect world you would simply state that the decompressed file has been derived from the downloaded one and then everything would work. Sadly we are not there yet. Thus, we need to have the API as mentioned. Maybe a simple copyAnnotations(URL, URL) would be a good candidate…

  3. I could be wrong, but this post gives me the impression that you might be misunderstanding the purpose of RDF ontologies. Ontologies are not like XML Schemas or SGML DTDs; ontologies describe implications not constraints. If you read the specifications for RDFS and OWL, you’ll see they use the language of logical assertion (“is”, “are”, “implies”, “entails”) not conformance (“must”, “should”, “may”). Ontologies do not limit what you can put in RDF, they supply the RDF server with the general knowledge to enable it to reason out absent information, and identify logically inconsistent information, the same way a human can. However, just as with human reasoning–and contrary to program logic–inconsistent information in an RDF graph is not an error, merely a paradox or a lie.

    If an ontology (or, for that matter, any triples asserted anywhere in the graph) says that {tag:master.ryukage@gmail.com,2010:foo} is an rdf:Property with rdfs:domain {tag:master.ryukage@gmail.com,2010:Bar} and rdfs:range {tag:master.ryukage@gmail.com,2010:Quux}, and I assert the triple {{http://example.com/frob},{tag:master.ryukage@gmail.com,2010:foo},{http://example.com/frobnitz}}, then I am also asserting by implication the triples {{http://example.com/frob},{rdf:type},{tag:master.ryukage@gmail.com,2010:Bar}} and {{http://example.com/frobnitz},{rdf:type},{tag:master.ryukage@gmail.com,2010:Quux}}. It doesn’t matter if the latter two triples already exist in the graph or not: asserting the first causes the RDF server to infer the other two. If this conflicts with some other assertion, it may simply be a paradox, or it may mean something has been asserted somewhere that is untrue. Either way, something being paradoxical or untrue does not make anything invalid.

    Since you mention literals specifically, you may be falling prey to a logical error that unfortunately crept into the RDF spec itself. The spec distinguishes between resources (URIs) and literals (strings), and treats them as separate, mutally exclusive data types. This is an artifact of the RDF/XML serialization syntax, and really shouldn’t have been put in the abstract model. There’s nothing in the conceptual model that prevents a literal from also being a resource; the difference is only in how the node is referenced in a serialization syntax or API: resources are nodes referenced by name (i.e. their URI), literals are nodes referenced by identity (i.e. their content). Unfortunately, RDF/XML can not handle referencing the same node in both ways, so the spec writers made the unfortunate decision to give RDF two primitive data types instead of the the single type it logically should have had. How you deal with this issue is up to you, but while devising your solution you should remember that the resource/literal distinction is an implementation flaw, not inherent to the conceptual model.

    • Your reasoning is sound and perfectly holds in the rather theoretical and research-driven world of the semantic web. However, in the semantic desktop this does not hold anymore. We are trying to provide APIs that the RDF-unaware developer can use, UIs that provide really useful data and not abstract weird looking triples like dbpedia and friends. We need to hold on to a lot more restrictions if we want to make the system usable and useful.

  4. I have the following use case. I have a decade or so of research on a variety of topics. I’ve used an XML document to manage the list of topics, which has grown to some 800 nodes over time. I would love to be able to tag/classify some subset of local files, URLs, casual notes, locally cached copies of Web pages… with the topic taxonomy and then browse the content in terms of it. While I’m dreaming, I’d love to be able to turn the whole lot into a docbook. Is managing content in terms of a managed topic taxonomy one of the use cases for, say, PIMO?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s