A Word (or Two) on Removable Storage Media Handling in Nepomuk

While fixing existing Nepomuk bugs and trying to close them as they come in I also look into other things. Last week it was the improved file indexer scheduling and file modification handling. This week it is about another improvement in the handling of queries which involve removable media. Ignacio Serantes already found one bug in the URL encoding before. This time he wanted to search through all mounted removable storage media and realized that he could not. I just fixed that. In order to understand how I did that we need to go into detail about how Nepomuk handles removable media.

Removable Storage Media in Nepomuk

Files on removable storage media are a problem when it comes to meta data stored in Nepomuk. As long as the medium is mounted we can simply identify the files through their local file path. But as soon as it is unmounted the paths are no longer valid. To make things worse we could mount the medium at another mount point the next time or mount another medium (which obviously does not contain the files in question) at the same mount point. So we need a way around that problem. Ever since 4.7 Nepomuk has a rather fancy way of doing that.

Internally Nepomuk uses a stack of Soprano::FilterModels which perform several operations on the data that passes through them. One of these models is the RemovableStorageModel. This model does one thing: it converts the local file URLs of files and folders on removable media into mount-path-independent URLs and vice versa. Currently it supports removable disks like USB keys or external hard disks (any storage that has a UUID), optical media, NFS and Samba mounts. The nice thing about it is that this conversion happens transparently to the client. Thus, a client simply uses the local file URLs according to the current mount path and does not care about anything else. It will always get the correct results.

To understand this better we should look at an example. Imagine we have a USB key inserted with UUID “xyz” which is mounted at /media/disk. Now if we add information about a file /media/disk/myfile.txt to Nepomuk the following happens: The RemovableStorageModel will convert the URL file:///media/disk/myfile.txt into filex://xyz/myfile.txt. This is a custom URL scheme which consists of the device UUID and the relative path. When querying the file the model does the conversion in the other direction. So far so simple.

Queries are where it gets a little more complicated. Imagine we want to query all files in a certain directory on the removable medium (ideally the SPARQL would be hidden by the Nepomuk query API). We would then perform a query like the following simplified one.

select ?r where {
  ?r nie:isPartOf ?p . 
  ?p nie:url <file:///media/disk/somefolder> . }

If we would pass this query on to Virtuoso we would not get any results since there is no resource with nie:url <file:///media/disk/somefolder>. So the RemovableStorageModel steps in again and does some query tweaking (rather primitive tweaking seeing that we do not have a SPARQL parser in Nepomuk). The query is converted into

select ?r where {
  ?r nie:isPartOf ?p .
  ?p nie:url <filex://xyz/somefolder> . }

And suddenly we get the expected results.

Of course this is still rather simple. It gets more complicated when SPARQL REGEX filters are involved. Imagine we wanted to look for all files in some sub-tree on a removable medium. We would then use a query along the lines of the following:

select ?r where {
  ?r nie:url ?url .
  FILTER(REGEX(STR(?url), '^file:///media/disk/somefolder/')) . }

As before passing this query directly on to Virtuoso would not yield any results. The RemovableStorageModel needs to do its magic first:

select ?r where {
  ?r nie:url ?url .
  FILTER(REGEX(STR(?url), '^filex://xyz/somefolder/')) . }

This is what the model did before Ignacio wanted to query all his removable media mounted somewhere under /media at once. Obviously he did something like:

select ?r where {
  ?r nie:url ?url .
  FILTER(REGEX(STR(?url), '^file:///media/')) . }

The result, however, was empty. This is simply because there was no exact match to any mount path of any of the removable media and RemovablStorageModel did not replace anything. The solution was to include additional filters for all the candidates in addition to the already existing filter. We need to keep the existing filter in case there is anything else under /media which is not a removable medium and, thus, has normal local file:/ URLs.

If we imagine that we have an additional mounted removable medium with UUIDfoobar” then the query would be converted into something like the following.

select ?r where {
  ?r nie:url ?url .
  FILTER((REGEX(STR(?url), '^file:///media/') ||
          REGEX(STR(?url), '^filex://xyz/') || 
          REGEX(STR(?url), '^filex://foobar/'))) . }

This way we get the expected results. (The additional brackets are necessary in case the filter already contains more than one term.)

Well, I personally think this is a very clean solution where clients only have to consider filex:/ and its friends nfs:/, smb:/, and optical:/ if the media are not mounted. One way of handling that I already drafted a while back. But that will be perfected another day. ;)

For now let me, as always, close with the hint that development like this is still running on your donations:

Click here to lend your support to: Nepomuk - The semantic desktop on KDE and make a donation at www.pledgie.com !
Click here to donate to Nepomukvia Moneybookers

7 thoughts on “A Word (or Two) on Removable Storage Media Handling in Nepomuk

  1. Now that is working it has been shown that it was a good solution and is a good base too for a future replication subsystem.

    I keep looking for errors but seems that there isn’t serious bug in external disk management :).

  2. I would really love to see nepomuk work with a NFS plugin in a similar way as amarok’s one. And in my best dreams there is also a akonadi external MYSQL interface… I have a NAS with mysql database, and it works very nice with amarok!

  3. Hey Mr. Trueg, nepomuk/strigi isn’t indexing files on my NTFS drive in Kubuntu 11.10. /etc/fstab mounts it by UUID under /mnt and then I have symlinks to it in my home directory. I tell System Settings > Desktop Search > Desktop index folders > Customize index folders… to index certain subdirs of those symlinks, yet Dolphin doesn’t find matches in them. A mounted NTFS drive doesn’t seem like removable media, but just in case I changed Removable media handling to “Ask individually when newly mounted”, and I still get no indexing of its files. ??

    I followed the debugging suggestions in http://kdeatopensuse.wordpress.com/2011/11/09/debugging-nepomukvirtuosos-cpu-usage/ but I don’t see any obvious output explaining “can’t index NTFS” or “inotify not available, ignoring volume”, or whatever the problem is.

    It would be nice if Dolphin’s Find [From here] warned you “This directory is not indexed in Desktop Search”.

    Is this a Nepomuk or strigi issue? Should I ask somewhere else for assistance?
    (I donated 10 euros, thanks for your work!)

      • Thanks for responding. I made some test .txt files, and I see lots of metadata for both indexed and unindexed files: Type, Size, Characters, Comment, Lines, Modified, Rating, Tags, Words. The only extra metadata for text files in the index is “Has hash” and “Created at”; are those the indication something is in Nepomuk? Is that documented somewhere?

        You’re right, my problem is symbolic links. I told Customize index folders… to index the actual directories under /mnt as well , and now it finds the files! lxr.kde.org reveals the source code
        !fileInfo.isSymLink() &&
        in indexscheduler.cpp (in *four* different copies of this file ?), so it’s intentionally bailing on symlinks. This should be documented, where, userbase’s Nepomuk page? I’d like to help! Also seems a BUG: System Settings’ Customize index folders… should a) disallow checkmarks on symlinks; b) display symlinks in italics; c) have a tooltip “This is a symbolic link, and is not followed by the index. Instead select the target of the symbolic link if you want it to be indexed.”; d) Not expand symlinks, instead maybe if you click on a symlink you jump to their target of the symlink in the directory tree. I’m unclear if all this only applies to symlinked directories, or symlinked files as well.

        Again, thanks! Is there a better mailing list, (e.g. strigi-user@lists.sourceforge.net ?) for this sort of Q&A and bug reporting? I have more comments and questions.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s