A Little Bit Of Query Optimization


Every once in a while I add another piece of query optimization code to the Nepomuk Query API. This time it was a direct result of my earlier TV Show handling. I simply thought that a query like “downton season=2 episode=2” takes too long to complete.

Now in order to understand this you need to know that there is a rather simple QueryParser class which converts a query string like the above into a Nepomuk::Query::Query which is simply a collection of Nepomuk::Query::Term instances. A Query instance is then converted into a SPARQL query which can be handled by Virtuoso. This SPARQL query already contains a set of optimizations, some specific to Virtuoso, some specific to Nepomuk. Of course there is always room for improvement.

So let us get back to our query “downton season=2 episode=2” and look at the resulting SPARQL query string (I simplified the query a bit for readability. The important parts are still there):

select distinct ?r where {
  ?r nmm:season "2"^^xsd:int .
  {
    ?r nmm:hasEpisode ?v2 .
    ?v2 ?v3 "2"^^xsd:int .
    ?v3 rdfs:subPropertyOf rdfs:label .
  } UNION {
    ?r nmm:episodeNumber "2"^^xsd:int .
  } .
  {
    ?r ?v4 ?v6 .
    FILTER(bif:contains(?v6, "'downton'")) .
  } UNION {
    ?r ?v4 ?v7 .
    ?v7 ?v5 ?v6 .
    ?v5 rdfs:subPropertyOf rdfs:label .
    FILTER(bif:contains(?v6, "'downton'")) .
  } .
}

Like the user query the SPARQL query has three main parts: the graph pattern checking the nmm:season, the graph pattern checking the episode and the graph pattern checking the full text search term “downton“. The latter we can safely ignore in this case. It is always a UNION so full text searches also include relations to tags and the like.

The interesting bit is the second UNION. The query parser matched the term “episode” to properties nmm:hasEpisode and nmm:episodeNumber. On first glance this is fine since both contain the term “episode“. However, the property nmm:season which is used in the first non-optional graph-pattern has a domain of nmm:TVShow. nmm:hasEpisode on the other hand has a domain of nmm:TVSeries. That means that the first pattern in the UNION can never match in combination with the first graph pattern since the domains are different.

The obvious optimization is to remove the first part of the UNION which yields a much simpler and way faster query:

select distinct ?r where {
  ?r nmm:season "2"^^xsd:int .
  ?r nmm:episodeNumber "2"^^xsd:int .
  {
    ?r ?v4 ?v6 .
    FILTER(bif:contains(?v6, "'downton'")) .
  } UNION {
    ?r ?v4 ?v7 .
    ?v7 ?v5 ?v6 .
    ?v5 rdfs:subPropertyOf rdfs:label .
    FILTER(bif:contains(?v6, "'downton'")) .
  } .
}

Well, sadly this is not generically true since resources can be double/triple/whateveriple-typed, meaning that in theory an nmm:TVShow could also have type nmm:TVSeries. In this case it is obviously not likely but there are many cases in which it in fact does apply. Thus, this optimization cannot be applied to all queries. I will, however, include it in the parser where it is very likely that the user does not take double-typing into account.

If you have good examples that show why this optimization should not be included in the query parser by default please tell me so I can re-consider.

Now after having written this and proof-reading the SPARQL query I realize that this particular query could have been optimized in a much simpler way: the value “2” is obviously an integer value, thus it can never match to a non-literal like required for the nmm:hasEpisode property…

16 thoughts on “A Little Bit Of Query Optimization

  1. Trueg, Thanks for those last posts, they are a must-read for anyone that wants to have a full searchable desktop. I found myself trying to write nepomuk queryes on the KIO directly because the latest Dolphin Query UI was a bit limited, and sttumbled upon an very sad thing: I didn’t know how to write nepomuk queries =)
    This really, really helps, Thanks.

  2. You are really filtering for episode number in a hard way :).

    This kind of entries are really interesting. When I add support to simple query API in nepoogle I can see how you are generating your queries so I can compare with mine.

    My approach is different and a bit slower because at the time when I wrote the engine I don’t know anything about rdfs:subPropertyOf and, on the other side, I’m not using bif:contains() because I had many troubles using it with unicode. I’m trying to write a new engine but I’m really busy actually and current one works reliable for me.

    The query generated by my engine, is the next, please consider that the search was developed when virtuoso has several encoding bugs ;) so I need special cases to handle unicode properly:

    SELECT DISTINCT ?x0 AS ?id
    WHERE {
    ?x0 nao:userVisible 1 .

    FILTER(
    (bif:exists ((
    SELECT *
    WHERE {
    { ?x0 nao:description ?x1 . FILTER(REGEX(?x1, “downton”^^xsd:string, ‘i’)) }
    UNION
    { ?x0 nao:identifier ?x1 . FILTER(REGEX(?x1, “downton”^^xsd:string, ‘i’)) }
    UNION
    { ?x0 nie:url ?x1 . FILTER(REGEX(?x1, “downton”^^xsd:string, ‘i’)) }
    UNION
    { ?x0 nao:hasTag ?x1 . ?x1 nao:identifier ?x2 . FILTER(REGEX(?x2, “downton”^^xsd:string, ‘i’)) }
    UNION
    { ?x0 nco:fullname ?x1 . FILTER(REGEX(?x1, “downton”^^xsd:string, ‘i’)) }
    UNION
    { ?x0 nie:title ?x1 . FILTER(REGEX(?x1, “downton”^^xsd:string, ‘i’)) }
    }
    )))
    &&
    (bif:exists ((
    SELECT *
    WHERE {
    { ?x0 nmm:season ?x1 . FILTER(?x1 = 2) }
    }
    )))
    &&
    (bif:exists ((
    SELECT *
    WHERE {
    { ?x0 nmm:episodeNumber ?x1 . FILTER(?x1 = 2) }
    }
    )))
    )

    }
    ORDER BY

    • Ok, wordpress removed some text from the query. I’m trying again.

      SELECT DISTINCT ?x0 AS ?id ORDER_FIELDS
      WHERE {

      ?x0 nao:userVisible 1 .

      OPTIONALS_FOR_ORDER_FIELDS

      FILTER(
      (bif:exists ((
      SELECT *
      WHERE {
      { ?x0 nao:description ?x1 . FILTER(REGEX(?x1, “downton”^^xsd:string, ‘i’)) }
      UNION
      { ?x0 nao:identifier ?x1 . FILTER(REGEX(?x1, “downton”^^xsd:string, ‘i’)) }
      UNION
      { ?x0 nie:url ?x1 . FILTER(REGEX(?x1, “downton”^^xsd:string, ‘i’)) }
      UNION
      { ?x0 nao:hasTag ?x1 . ?x1 nao:identifier ?x2 . FILTER(REGEX(?x2, “downton”^^xsd:string, ‘i’)) }
      UNION
      { ?x0 nco:fullname ?x1 . FILTER(REGEX(?x1, “downton”^^xsd:string, ‘i’)) }
      UNION
      { ?x0 nie:title ?x1 . FILTER(REGEX(?x1, “downton”^^xsd:string, ‘i’)) }
      }
      )))
      &&
      (bif:exists ((
      SELECT *
      WHERE {
      { ?x0 nmm:season ?x1 . FILTER(?x1 = 2) }
      }
      )))
      &&
      (bif:exists ((
      SELECT *
      WHERE {
      { ?x0 nmm:episodeNumber ?x1 . FILTER(?x1 = 2) }
      }
      )))
      )

      }

      ORDER BY ORDER_FIELDS

    • What do you mean when you say that I am filtering episode number the hard way? Is it because your bif:exists only requires one match while my simple graph pattern basically looks for all possible matches?

      • I wrote about your optimization:

        {
        ?r nmm:hasEpisode ?v2 .
        ?v2 ?v3 “2”^^xsd:int .
        ?v3 rdfs:subPropertyOf rdfs:label .
        } UNION {
        ?r nmm:episodeNumber “2”^^xsd:int .
        } .

        in the optimized query

        ?r nmm:episodeNumber “2”^^xsd:int .

        It’s a really big improvement :).

  3. Hi Trueg, for your example can I do just sparql like that?

    select distinct ?r where {
    ?r nmm:season “2”^^xsd:int .
    ?r nmm:episodeNumber “2”^^xsd:int .
    {
    ?r ?v4 ?v6 .
    ?v4 rdfs:subPropertyOf rdfs:label .
    }
    FILTER ((bif:strstr(str(?v4), “downtown”)) > -1)
    }

      • ok, I wrote it wrong =S the correct var is ?v6

        I’m using bif:strstr because and bif:contains not working here =D

        So another question, this query is little bit faster that the query with union?

            • Sure, this query using regex is working:
              ————————————–
              SELECT DISTINCT ?x0 AS ?id
              WHERE {
              ?x0 nao:userVisible 1 .
              FILTER(
              (bif:exists ((
              SELECT *
              WHERE {
              { ?x0 nie:url ?x1 . FILTER(REGEX(?x1, “downton”, “i”)) }
              }
              )))
              )
              }
              —————————————-
              But the next, that must be equivalent, don’t:
              —————————————-
              SELECT DISTINCT ?x0 AS ?id
              WHERE {
              ?x0 nao:userVisible 1 .
              FILTER(
              (bif:exists ((
              SELECT *
              WHERE {
              { ?x0 nie:url ?x1 . FILTER(bif:contains(?x1, “‘downton'”)) }
              }
              )))
              )
              }
              —————————————–

            • Yes, I read about this in Virtuoso help but the problem is that I don’t know which ontologies are full indexed and which not and, a nepoogle user probably has minor knowledge than me, so I need a method that works always with all ontologies, full indexed or not. Regex() is slow than bif:contains() but works always and for my particular case is the best choice :).

              I know that I could write code to handle this exceptions but the result is a code more complex. I don’t do this in my job, until there will be no other alternative, so I do even less when I write code for fun ;).

        • Of course using a union will always be a strain on performance. But your query is not generically applicable as it will never find resources by content, only resources which are related to other resources with a label matching the term.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s