A Fun Release: Nepomuk TV Namer 0.2

As requested I prepared a release of the TV Show managing thingi I implemented. You can download it from download.kde.org mirrors at unstable/nepomuk/nepomuktvnamer-0.2.0.tar.bz2.

The nepomuktvnamer 0.2.0 is a little more polished than the original version and comes with a nice service menu extension allowing to manually start the fetching of TV Show information on folders or video files. This is important since the service does only react on new videos. So you need to start the initial information fetching manually on your TV Show folder.

The tvnamer has two requirements in addition to the typical KDE ones:

  • LibTVDb – LibTvdb is a Qt-based library which provides asynchronous access to TV series information from thetvdb.com via a very simple interface. Its use in the Nepomuk TV namer should be obvious.
  • Shared-Desktop-Ontologies 0.9.0 – The recently released new version of SDO provides the required nfo:depiction property used by the tvnamer to store banners.

I also recommend to apply the kdelibs patch I mentioned earlier to actually see the TV Show banners. Have fun with it – maybe someone will even package it.

Just For The Fun Of It: Browsing Music With Nepomuk

Since implementing the TV Show KIO slave was that easy I decided I could do the same for music – just to show how simple it can be. There a a few more lines but that is only because I added browsing by album, artist, and genre. So there are a lot of if/else constructs. Anyway, here goes:

Browsing music by artist is easy. As you can see I also implemented a preview generator plugin the same way I did for the TV Shows. The only problem is that there is no tool yet that automatically fetches those images. Thus, I had to do it manually for one example which looks somewhat like this:

qdbus org.kde.NepomukStorage 
  /datamanagement
  org.kde.nepomuk.DataManagement.addProperty
  "nepomuk:/res/0152825f-5c49-4ca8-aa0a-23fc9a1305f1"
  "nfo:depiction"
  "/home/trueg/atb2.jpg"
  "shell"

This is part of the fancy Data management API which allows me to add the file atb2.jg as a nfo:depiction of the nco:Contact resource identifying the artist ATB.

Anyway, entering the artist themselves and what lies beyond:

(Again I had to fetch the cover art manually. I did not want to implement my own cover art retrieval tool and I found the Amarok code not to be very reusable. Again maybe someone wants to take up this task?)

Finally we end up in the album tracks. Sadly dragging an album to a media player playlist does not work yet. I am not quite sure how to fix that.

Last but not least a quick look at browsing by genre:

This was fun. But before I go to bed let me share with you the very simple code which is responsible for the nice previews (abbreviated of course):

bool MusicThumbCreator::create(const QString &path,
                               int w, int h,
                               QImage &img)
{
  KUrl url(path);
  QStringList pathTokens
      = url.path().split('/', QString::SkipEmptyParts);
  if(pathTokens.count() < 2) {
    return false;
  }

  // there are only two cases for us: artists and albums
  if(pathTokens[pathTokens.count()-2] == QLatin1String("artists") ||
     pathTokens[pathTokens.count()-2] == QLatin1String("albums")) {
      const QUrl uri = recoverUriFromUrlToken(pathTokens.last());
    // we just query the first depiction there is
    Soprano::QueryResultIterator it
       = Nepomuk::ResourceManager::instance()->mainModel()
         ->executeQuery(
              QString::fromLatin1("select ?u where { "
                                  "%1 nfo:depiction [ nie:url ?u ] . "
                                  "} LIMIT 1")
              .arg(Soprano::Node::resourceToN3(uri)),
              Soprano::Query::QueryLanguageSparql);
    if(it.next()) {
      img.load(it["u"].uri().toLocalFile());
      return true;
    }
  }

  return false;
}

The rest of the code can be found in the nepomuk-audio-kio-slave scratch repository. Maybe at some point I could just throw all of those things into some “Nepomuk KIO extensions” package… oh, well, off to bed now…

More Fun With TV Shows

After fetching all the details about TV Shows from thetvdb.com I went back to my favorite way of browsing things: KIO slaves. So without further ado let me introduce the tvshow:/ KIO slave:

So the root folder lists all TV Series. As you can see the previews are messed up aspect-ratio-wise. If anyone has an idea of how to improve that without patching KIcon or KIO or caching my own thumbnails in some tmp folder please tell me.

Entering the season listing…

And finally the episodes. And just because it is fun here is one more:

Why do this? Well, nepomuksearch cannot create sub-folders (yet) and this only has about 120 relevant lines of code, most of which is used up by the three queries it creates.

To try it simply update your git clone of the nepomuktvnamer and have fun.

A Little Bit Of Query Optimization

Every once in a while I add another piece of query optimization code to the Nepomuk Query API. This time it was a direct result of my earlier TV Show handling. I simply thought that a query like “downton season=2 episode=2” takes too long to complete.

Now in order to understand this you need to know that there is a rather simple QueryParser class which converts a query string like the above into a Nepomuk::Query::Query which is simply a collection of Nepomuk::Query::Term instances. A Query instance is then converted into a SPARQL query which can be handled by Virtuoso. This SPARQL query already contains a set of optimizations, some specific to Virtuoso, some specific to Nepomuk. Of course there is always room for improvement.

So let us get back to our query “downton season=2 episode=2” and look at the resulting SPARQL query string (I simplified the query a bit for readability. The important parts are still there):

select distinct ?r where {
  ?r nmm:season "2"^^xsd:int .
  {
    ?r nmm:hasEpisode ?v2 .
    ?v2 ?v3 "2"^^xsd:int .
    ?v3 rdfs:subPropertyOf rdfs:label .
  } UNION {
    ?r nmm:episodeNumber "2"^^xsd:int .
  } .
  {
    ?r ?v4 ?v6 .
    FILTER(bif:contains(?v6, "'downton'")) .
  } UNION {
    ?r ?v4 ?v7 .
    ?v7 ?v5 ?v6 .
    ?v5 rdfs:subPropertyOf rdfs:label .
    FILTER(bif:contains(?v6, "'downton'")) .
  } .
}

Like the user query the SPARQL query has three main parts: the graph pattern checking the nmm:season, the graph pattern checking the episode and the graph pattern checking the full text search term “downton“. The latter we can safely ignore in this case. It is always a UNION so full text searches also include relations to tags and the like.

The interesting bit is the second UNION. The query parser matched the term “episode” to properties nmm:hasEpisode and nmm:episodeNumber. On first glance this is fine since both contain the term “episode“. However, the property nmm:season which is used in the first non-optional graph-pattern has a domain of nmm:TVShow. nmm:hasEpisode on the other hand has a domain of nmm:TVSeries. That means that the first pattern in the UNION can never match in combination with the first graph pattern since the domains are different.

The obvious optimization is to remove the first part of the UNION which yields a much simpler and way faster query:

select distinct ?r where {
  ?r nmm:season "2"^^xsd:int .
  ?r nmm:episodeNumber "2"^^xsd:int .
  {
    ?r ?v4 ?v6 .
    FILTER(bif:contains(?v6, "'downton'")) .
  } UNION {
    ?r ?v4 ?v7 .
    ?v7 ?v5 ?v6 .
    ?v5 rdfs:subPropertyOf rdfs:label .
    FILTER(bif:contains(?v6, "'downton'")) .
  } .
}

Well, sadly this is not generically true since resources can be double/triple/whateveriple-typed, meaning that in theory an nmm:TVShow could also have type nmm:TVSeries. In this case it is obviously not likely but there are many cases in which it in fact does apply. Thus, this optimization cannot be applied to all queries. I will, however, include it in the parser where it is very likely that the user does not take double-typing into account.

If you have good examples that show why this optimization should not be included in the query parser by default please tell me so I can re-consider.

Now after having written this and proof-reading the SPARQL query I realize that this particular query could have been optimized in a much simpler way: the value “2” is obviously an integer value, thus it can never match to a non-literal like required for the nmm:hasEpisode property…

A Little Drier But Not That Dry: Extracting Websites From Nepomuk Resources

After writing about my TV Show Namer I want to get out some more ideas and examples before I will be retiring as a full-time KDE developer in a few weeks.

The original idea for what I am about to present came a long while ago when I remembered that Vishesh gave me a link on IRC but I could not remember when exactly. So I figured that it would be nice to extract web links from Nepomuk resources to be able to query and browse them.

As always what I figured would be a quick thing lead me to a few bugs which I needed to fix before moving on. So all in all it took much longer than I had hoped. Anyway, the result is another small application called nepomukwebsiteextractor. It is a small tool without a UI which will extract websites from the given resource or file. If called without arguments it will query for resources which do not have any related websites and and extract websites from them. Since it tries to fetch a title for each website this is a very slow procedure.

As before the storing to Nepomuk is the easy part. Getting the information is way harder:

using namespace Nepomuk;
using namespace Nepomuk::Vocabulary;

// create the main Website resource
NFO::Website website(url);
website.addType(NFO::WebDataObject());
QString title = fetchHtmlPageTitle(url);
if(!title.isEmpty()) {
  website.setTitle(title);
}

// create the domain website resource
KUrl domainUrl = extractDomain(url);
NFO::Website domainWebPage(domainUrl);
domainWebPage.addType(NFO::WebDataObject());
domainWebPage.addPart(website.uri());
title = fetchHtmlPageTitle(domainUrl);
if(!title.isEmpty()) {
  domainWebPage.setTitle(title);
}

// relate the two via the nie:isPartOf relation
website.addProperty(NIE::isPartOf(), domainUrl);
domainWebPage.addProperty(NIE::hasPart(), website.uri());

// funnily enough the domain is a sub-resource of the website
// this is so removing the website will also remove the domain
// as it is the one which triggered the domain resource's creation
website.addSubResource(domainUrl);

// save it all to Nepomuk
Nepomuk::storeResources(SimpleResourceGraph() << website << domainWebPage);

Once done you will have thousands of nfo:Website resources in your Nepomuk database, each of which are related to their respective domain via nie:isPartOf (I am not entirely sure if this is perfectly sound but it is convenient as far as graph traversal goes). We can of course query those resources with nepomukshell (this is trivial but allows me to pimp up this blog post with a screenshot):

And of course Dolphin shows the extracted links in its meta-data panel:

I am not entirely sure how to usefully show this information to the user yet but it is already quite nice to navigate the sub-graph which has been created here.

Of course we could query all the resources which mention a link with domain http://www.kde.org:

select ?r where {
  ?r nie:links ?w .
  ?w a nfo:Website .
  ?w nie:isPartOf ?p .
  ?p nie:url <http://www.kde.org> .
}

Or the Nepomuk API version of the same:

using namespace Nepomuk::Query;
using namespace Nepomuk::Vocabulary;

Query query =
  ComparisonTerm(NIE::links(),
    ResourceTypeTerm(NFO::Website()) &&
    ComparisonTerm(NIE::isPartOf(),
      ResourceTerm(QUrl("http://www.kde.org"))
    )
  );

It gets even more interesting when combined with the nfo:Websites created by KParts when downloading files.

Well, now I provided screenshots, code examples, and a link to a repository – I think it is all there – have fun.

Update: In the spirit of promoting the previously mentioned ResourceWatcher here is how the website extractor would monitor for new stuff to be extracted:

Nepomuk::ResourceWatcher* watcher = new Nepomuk::ResourceWatcher(this);
watcher->addProperty(NIE::plainTextContent());
connect(watcher, 
        SIGNAL(propertyAdded(Nepomuk::Resource,
                             Nepomuk::Types::Property,
                             QVariant)),
        this,
        SLOT(slotPropertyAdded(Nepomuk::Resource,
                               Nepomuk::Types::Property,
                               QVariant)));
watcher->start();

[...]

void slotPropertyAdded(const Nepomuk::Resource& res,
                       const Nepomuk::Types::Property&,
                       const QVariant& value) {
  if(!hasOneOfThoseXmlOrRdfMimeTypes(res)) {
    const QString text = value.toString();
    extractWebsites(res, text);
  }
}

Something Way Less Dry: TV Shows

After my rather boring blog about change notifications I will now to write about something that I wanted every since I started developing Nepomuk. But only now has Nepomuk reached a point where it provides all the necessary pieces. I am talking about TV Show management – obviously I mean the rips from the DVD boxes I own.

So what about it? Well, I wrote a little tool called nepomuktvnamer (inspired by the great python tool tvnamer) which works a bit like our nepomukindexer except that it does not extract meta-data from the file but tries to fetch information about TV Shows from thetvdb.com. You can run the tool on a single file or recursively on a whole directory. It will then use a set of regular expressions (based on the ones from tvnamer)  to analyze the file names and extract the show title, season and episode numbers.

The nepomuktvnamer will ask the user in case multiple matches have been found and cannot be filtered according to season and episode numbers

It will then save that information into Nepomuk through our powerful Data Management API. The code looks a bit as follows ignoring code to store actors, banners and the like.

const Tvdb::Series series = getSeriesForName(name);
Nepomuk::NMM::TVSeries seriesRes;
seriesRes.setTitle(series.name());
seriesRes.addDescription(series.overview());

Nepomuk::NMM::TVShow episodeRes(url);
episodeRes.setEpisodeNumber(episode);
episodeRes.setSeason(season);
episodeRes.setTitle(series[season][episode].name());
episodeRes.setSynopsis(series[season][episode].overview());
episodeRes.setReleaseDate(QDateTime(series[season][episode].firstAired(), QTime(), Qt::UTC));
episodeRes.setGenres(series.genres());

seriesRes.addEpisode(episodeRes.uri());
episodeRes.setSeries(seriesRes.uri());

Nepomuk::SimpleResourceGraph graph;
graph << episodeRes << seriesRes;
Nepomuk::storeResources(graph, Nepomuk::IdentifyNew, Nepomuk::OverwriteProperties)

(This code uses my very own LibTvdb which is essentially a Qt’ish wrapper around the thetvdb.org API.)

The result of this can be seen in Dolphin:

Here we see the actors, the series, the synopsis and so on. Clicking on an actor will bring up all they played in, clicking on the series will bring up all the episodes from that series, and so on.

Now let us have a look at the series itself using my beefed up version of the Nepomuk KIO slave:

As we can see the nepomuktvnamer also fetched a banner which is stored as nie:depiction. (A reason why to compile nepomuktvnamer you need the git master version of shared-desktop-ontologies. Oh, and also nepomuktvnamer is linked against libnepomukcore from nepomuk-core instead of libnepomuk. So you either have to install nepomuk-core which cab be a bit tricky or quickly change the CMakeLists.txt to link to libnepomuk instead.)

We can of course also query the newly created information. Simple queries in Dolphin could be “series:Sherlock” or “sherlock season=1″. Well, things to play with.

I also created the smallest Nepomuk service to date: the nepomuktvnamerservice uses the ResourceWatcher to listen for newly created nfo:Video resources and simply calls the nepomuktvnamer on the related file.

Last but not least the git repository contains a python script which checks for each existing series if a new episode has been aired. The output looks a bit like this:

White Collar - New episode "Withdrawal" (02x01) first aired 13 July 2010.
Freaks and Geeks - No new episode found.
The Mentalist - Upcoming episode "Red is the New Black" (04x13) will air 02 February 2012.

Now obviously this is more a task for a Plasma applet. So if anyone out there is interested in doing that – please go ahead. I think it could be a cool thing. One basically only has to update whenever a new nmm:TVShow is created or when the new day dawns.

And the cherry on top is of course Bangarang:

Something Dry: Change Notifications

Ignoring the fact that I did not blog in nearly two months I will simply get some developer information out there. Getting notified about changes in the Nepomuk database has always been a problem. All we had for a long time where the ugly statementAdded and statementRemoved signals from Soprano which, when actually used, would slow down the whole system as one would have to check each single statement for the information one needed.

Thus, with the introduction of the Data Management Service a while back we also gave birth to the ResourceWatcher which can be used to watch resources, properties, and types for changes. The concept is simple. Just create an instance of the watcher and tell it which resources or which types of resources you want to watch for changes. In addition you can restrict it to specific properties. Then you get nice signals which inform you about the changes when they happen.

Nepomuk::ResourceWatcher *watcher = new Nepomuk::ResourceWatcher(this);
watcher->addType(NCO::Contact());
connect(watcher, SIGNAL(resourceCreated(Nepomuk::Resource, QList<QUrl>)),
        this, SLOT(slotCreated(Nepomuk::Resource, QList<QUrl>)));
watcher->start();

The problem with this has been that it only works with data manipulation which happens through the Data Management Service and libnepomuk did not use that for a long time. Now we finally fixed that (sadly I did not manage to push it in time for 4.8 but it will be in 4.8.1) and the change notifications become really useful. I also implemented a bunch of unit tests and made sure the most important types of notifications actually work.

So all in all an important step for developers using Nepomuk which was overdue.

Symbolic Links in Nepomuk – A Solution

Until now symbolic links were not handled in Nepomuk. Today I commited the last patch for the new symlink support in Nepomuk. The solution I chose is not the theoretically perfect one. That would have taken way to much effort while introducing all kinds of possible bugs, regressions, API incompatibilities, and so on. But the solution is nice and clean and simple.

Essentially each direct symlink is indexed as a separate file using the content of its target file. (This is necessary since a direct symlink might have a different file name than the target file.) The interesting part are the indirect symlinks. Indirect symlinks are files in a folder which is a symlink to another folder. An example:

/home/trueg/
|-- subdir/
   |-- thefile.txt
|-- link/ -> subdir/
   |-- thefile.txt

Here I have a folder “subdir” which contains a file “thefile.txt”. The folder “link” is a direct symlink to “subdir” whereas “link/thefile.txt” is an indirect symlink to “subdir/thefile.txt”.

Indirect symlinks are simply stored as alternative URLs on the target file resources using the kext:altUrl property. (The property is not defined in NIE since it is not theoretically sound with respect to the design of NIE. It needs to be considered a beautiful hack.)

The only situation in which the alternative URLs are actually needed is when searching in a specific folder. Imagine searching in “/home/trueg/link” only. Since there are no nie:url values which match that prefix we need to search the kext:altUrls, too.

The result of all this is that nearly no additional space is required except for the kext:altUrl properties, files are not indexed more than once, and files in symlinked folders are found in addition to “normal” files.

In my tests everything seems to work nicely but I urge you to test the nepomuk/symlinkHandling branches in kdelibs and kde-runtime and report any problems back to me. The more testing I get the quicker I can merge both into KDE 4.8.

Lastly the pledgie campaign is done but the search for funds goes on:

Finding Duplicate Images Made Easy

It is a typical problem: we downloaded images from a camera, maybe did not delete them from the camera instantly, then downloaded the same images again next time, maybe created an album by copying images into sub-folders (without Nepomuk Digikam can only do so much ;), and so on. Essentially there are a lot of duplicate photos lying around.

But never fear. Just let Nepomuk index all of them and then gather all the duplicates via:

select distinct ?u1 ?u2 where { 
  ?f1 a nexif:Photo . 
  ?f2 a nexif:Photo . 
  ?f1 nfo:hasHash ?h . 
  ?f2 nfo:hasHash ?h . 
  ?f1 nie:url ?u1 . 
  ?f2 nie:url ?u2 . 
  filter(?f1!=?f2) .
}

Quick explanation: the query does select all nexif:Photo resources which have the same hash value but are not the same. This of course can be tweaked by adding something like

?f1 nfo:fileName ?fn .
?f2 nfo:fileName ?fn .

to make sure that we only catch the ones that we downloaded more than once. Or we add

?f1 nie:contentCreated ?cc .
?f2 nie:contentCreated ?cc .

to ensure that the photo was actually taken at the same time – although I suppose the probability that two different photos have the same hash value is rather small.

Maybe one last little detail. In theory it would be more correct to do the following:

?f1 nfo:hasHash ?h1 .
?f2 nfo:hasHash ?h2 .
?h1 nfo:hashValue ?h .
?h2 nfo:hashValue ?h .

However, with the introduction of the Data Management Service in KDE 4.7 similar hash resources are merged into one. Thus, the slightly simpler query above. Still, to be sure to also properly handle pre-KDE-4.7 data the above addition might be prudent.

Of course this should be hidden in some application which does the work for you. The point is that Nepomuk has a lot of power that only reveals itself at second glance. :)