About Strigi, Soprano, Virtuoso, CLucene, and Libstreamanalyzer

There seems to be a lot of confusion about the parts that make up the Nepomuk infrastructure. Let me shed some light.

Soprano is the RDF data storage and parsing library used in Nepomuk. Soprano provides a plugin for Virtuoso which is mandatory and requires libiodbc. It does NOT work with unixODBC (It compiles but simply does not work due to some extensions in libiodbc required for RDF data handling). In addition to the Virtuoso plugin Nepomuk requires the Raptor parser plugin and the Redland storage plugin for ontology import.

CLucene is not required in Nepomuk anymore. It has been used for full-text indexing in early versions of KDE but is superseded by the fullt-text indexing functionality of Virtuoso. Consequently the Soprano clucene module is not required anymore and development has effectively been stopped. It will most likely not be part of Soprano 3 (unless someone interested steps up and does the required work).

Virtuoso is a full-blown SQL server with a powerful RDF layer on top. OpenLink, the company developing Virtuoso, maintains an open channel of communication to the Nepomuk developers and introduced a “lite” mode for us (please no comments on how it still is not “lite”). Virtuoso 6.1.3 is the current version. It has a unicode bug which can be fixed by applying the patch attached to KDE bug 271664. Virtuoso 6.1.4 will be released soon and contains several fixes to bugs reported by me. An update is highly recommended.

Libstreamanalyzer and libstreams are libraries which are part of the Strigi project. In addition the Strigi project contains strigidaemon, an alternative scheduler for indexing files which is based on CLucene and not used by Nepomuk. I asked the maintainer of Strigi once to split libstreams and libstreamanalyzer into their own independently released packages. He refused which is understandable seeing as he has little time for Strigi as it is. As a consequence I advise packagers to either use libstreamanalyzer from git master or the latest tag instead of using released tarballs.

I think that is all. If I missed something please comment and I will update the post.

A Summer 2010 Full of Nepomuk Code

Another year, another Google Summer of Code, another 4 (yes, four!) semantic desktop projects. It is amazing. After two very successful projects in 2009 we now take it one step further with three Nepomuk projects and one Strigi project. Without further ado I give you the Nepomuk Google Summer of Code 2010 projects:

Metadata Backup, Sync and Sharing by Vishesh Handa

Ever since we started to create meta data on the desktop (by this I mean tags, ratings, and relations between resources that can not be recreated easily) we also had the need for backup and syncing of this data. So far this area is lacking in Nepomuk. Vishesh sets out to change this situation and develop ways to sync meta data between different clients (imagine syncing your laptop with the desktop computer or the phone) or simply to back it up. This does not simply mean to code some backup GUI – it actually includes changes on the ontology (the data-) level. When syncing data between two clients (or syncing data between a client and a backup – the principle is the same) the two most complicated matters are: 1. identifying the resources which need to be merged on both ends and 2. deciding which data needs to be removed and which to be added.

Well, it suffices to say that Vishesh has an ambitious project ahead of him. But looking at his enthusiasm and his early involvement in KDE (he is already commiting one patch after the other) I am very confident that he will succeed.

Web Metadata Extractor Framework and Service by Artem Serebriyskiy

In Nepomuk we use the Strigi system to extract meta data from files and store them in the Nepomuk database, allowing the user to search files based on their meta data. This is very useful. However, there are certain types of files that do not provide much or no meta data at all. Typical examples are video files. It would be very interesting to be able to search for video files by title, actors, directors, or release year. All this information is available on the Internet. So why not make use of it?

This is exactly what Artem’s project is about: extract meta data from the web and associate it with local files. Of course he will implement this as a Nepomuk service that provides a plugin system allowing for different types of extractors and being able to handle uncertainties and information duplicates as smoothly as possible. Look out for more cool information on your fingertips.

Nepomuk Dedicated Desktop Search GUI by Oszkar Ambrus

Let’s face it: today desktop search is still the number one use case for Nepomuk (although it was not the original motivation. But that is another story.) So having a good and convenient user interface is essential for the success of the system. We have several interfaces in KDE including the search bar in Dolphin and the search runner. But all are lacking in at least two main areas: 1. the query building: so far one has to know a lot about the underlying data structures to write powerful queries; and 2. the presentation of the search results: currently the results are presented like any other folder excluding interesting information like a hit score or details on why the result was returned. (Actually there is a number three which I hope Oszkar will have the time to attack: since we have more than file results we need a good way to open and present these resources.)

Oszkar sets out to improve this situation and create reusable components to let the user create powerful queries without much knowledge of the data and to present the results in a convenient way. An important project that will undoubtedly yield great results.

Strigi: Stream Analyzer based on Data Structure Descriptions

Jos was kind enough to write a paragraph on the Strigi project:

Yet another project has been granted. Yulia Medvedeva will work on a new type of file analyzer for Strigi. The goal of the project is to write the structure of files down in a grammar file and generate code from the grammar or parse the grammar at runtime. Writing analyzers usually involves quite a bit of repetitive error-prone code. It also requires knowledge of C++. By writing the format in a grammar language, coding errors are avoided. In adddition to that, the independence of the programming language allows the grammars to be shared with other projects.

Well, that is it for the four projects that should give Nepomuk a good push forward. I am very happy about the selection and have to say thank you to Google and the rest of the KDE mentor team for giving us this much support. It will be legendary!