A Little Drier But Not That Dry: Extracting Websites From Nepomuk Resources


After writing about my TV Show Namer I want to get out some more ideas and examples before I will be retiring as a full-time KDE developer in a few weeks.

The original idea for what I am about to present came a long while ago when I remembered that Vishesh gave me a link on IRC but I could not remember when exactly. So I figured that it would be nice to extract web links from Nepomuk resources to be able to query and browse them.

As always what I figured would be a quick thing lead me to a few bugs which I needed to fix before moving on. So all in all it took much longer than I had hoped. Anyway, the result is another small application called nepomukwebsiteextractor. It is a small tool without a UI which will extract websites from the given resource or file. If called without arguments it will query for resources which do not have any related websites and and extract websites from them. Since it tries to fetch a title for each website this is a very slow procedure.

As before the storing to Nepomuk is the easy part. Getting the information is way harder:

using namespace Nepomuk;
using namespace Nepomuk::Vocabulary;

// create the main Website resource
NFO::Website website(url);
website.addType(NFO::WebDataObject());
QString title = fetchHtmlPageTitle(url);
if(!title.isEmpty()) {
  website.setTitle(title);
}

// create the domain website resource
KUrl domainUrl = extractDomain(url);
NFO::Website domainWebPage(domainUrl);
domainWebPage.addType(NFO::WebDataObject());
domainWebPage.addPart(website.uri());
title = fetchHtmlPageTitle(domainUrl);
if(!title.isEmpty()) {
  domainWebPage.setTitle(title);
}

// relate the two via the nie:isPartOf relation
website.addProperty(NIE::isPartOf(), domainUrl);
domainWebPage.addProperty(NIE::hasPart(), website.uri());

// funnily enough the domain is a sub-resource of the website
// this is so removing the website will also remove the domain
// as it is the one which triggered the domain resource's creation
website.addSubResource(domainUrl);

// save it all to Nepomuk
Nepomuk::storeResources(SimpleResourceGraph() << website << domainWebPage);

Once done you will have thousands of nfo:Website resources in your Nepomuk database, each of which are related to their respective domain via nie:isPartOf (I am not entirely sure if this is perfectly sound but it is convenient as far as graph traversal goes). We can of course query those resources with nepomukshell (this is trivial but allows me to pimp up this blog post with a screenshot):

And of course Dolphin shows the extracted links in its meta-data panel:

I am not entirely sure how to usefully show this information to the user yet but it is already quite nice to navigate the sub-graph which has been created here.

Of course we could query all the resources which mention a link with domain http://www.kde.org:

select ?r where {
  ?r nie:links ?w .
  ?w a nfo:Website .
  ?w nie:isPartOf ?p .
  ?p nie:url <http://www.kde.org> .
}

Or the Nepomuk API version of the same:

using namespace Nepomuk::Query;
using namespace Nepomuk::Vocabulary;

Query query =
  ComparisonTerm(NIE::links(),
    ResourceTypeTerm(NFO::Website()) &&
    ComparisonTerm(NIE::isPartOf(),
      ResourceTerm(QUrl("http://www.kde.org"))
    )
  );

It gets even more interesting when combined with the nfo:Websites created by KParts when downloading files.

Well, now I provided screenshots, code examples, and a link to a repository – I think it is all there – have fun.

Update: In the spirit of promoting the previously mentioned ResourceWatcher here is how the website extractor would monitor for new stuff to be extracted:

Nepomuk::ResourceWatcher* watcher = new Nepomuk::ResourceWatcher(this);
watcher->addProperty(NIE::plainTextContent());
connect(watcher, 
        SIGNAL(propertyAdded(Nepomuk::Resource,
                             Nepomuk::Types::Property,
                             QVariant)),
        this,
        SLOT(slotPropertyAdded(Nepomuk::Resource,
                               Nepomuk::Types::Property,
                               QVariant)));
watcher->start();

[...]

void slotPropertyAdded(const Nepomuk::Resource& res,
                       const Nepomuk::Types::Property&,
                       const QVariant& value) {
  if(!hasOneOfThoseXmlOrRdfMimeTypes(res)) {
    const QString text = value.toString();
    extractWebsites(res, text);
  }
}
About these ads

18 thoughts on “A Little Drier But Not That Dry: Extracting Websites From Nepomuk Resources

  1. This is the error displayed when I tried to compile this project.

    /home/ignacio/devel/nepomuk/git/nepomukwebsiteextractor/htmlentities.c:1:22: fatal error: entities.h: No such file or directory

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s