Portable Meta-Information Yet Again (Only this time there is code!)


There has been quite some discussion about portable meta data lately. David Nolden blogged about the usage of sidecar-files, Jos van den Oever answered and many many people commented. So I felt forced to also give my 2 cents. The only difference is: I wrote some code. To be presise I wrote a Nepomuk service.

But first things first: what is my opinion on the matter? Obviously I think sidecar files are no solution. For one they only provide a means to store simple key/value pairs (which would throw us back into the meta data stone age) and no way to link between files. Secondly, Nepomuk aims to store way more than just file meta data. Nepomuk also stores meta data about emails, about persons, about projects, it even stores resources that do not exists anywhere else (Yes, the idea is that you create entities in Nepomuk and only there. An example would be the city Berlin which has a link to its wikipedia page). Last but not least, we want to query the vast graph of meta data which is impossible using sidebar files (but Jos and others already mention that).

Thus, we need a different solution. We need to maintain the central Nepomuk store but it alone does not seem to solve the problem. So I came up with an idea (only to be pointed to the Metadata on Removable Devices on gnome.live idea a few hours later. Looks very similar indeed).

Basically instead of having sidecar files all around, each removable storage has one meta data file (stored in some obscure location like .cache/metadata/nepomuk.turtle). This file contains all meta data related to the files stored on the device (excluding the information which can be extracted by strigi) and the very basic information about resources the files are related to (if for example a file is related to a project we only store the project’s type and its label). File paths are saved relative to the storage root. Example:

<file:/Pictures/IMG_0012.jpg>
    a nfo:FileDataObject ;
    nfo:fileUrl <file:/Pictures/IMG_0012.jpg> ;
    nao:hasTag <nepomuk:/tags/Summer08> ;
    nao:relatedTo <nepomuk:/KDE> .

<nepomuk:/KDE>
    a pimo:Project ;
    nao:prefLabel "KDE" .

<nepomuk:/tags/Summer08>
    a nao:Tag ;
    nao:prefLabel "Summer 09" .

Once the device is mounted, this information is imported into the local Nepomuk store (using a temporary graph), relative URLs are replaced with absolute ones according to the mount point. This allows to search these files like any other local files. Now if the meta data changes it needs to be written back to the cache file. Again absolute URLs are converted back to relative ones. This way the data can simply be reused on any Nepomuk-enabled system.

Now the service I implemented (which to date can be found in the Nepomuk playground) does the importing automatically. However, the writing back of data has to be triggered manually. Here an integration with KIO would be necessary.

IMHO this is already a nice start. However, a few things are still not solved:

  • As mentioned data is not written back automatically. KIO should somehow trigger that.
  • libnepomuk is not aware of removable devices yet and thus, the data will be written twice: once in the local store and once in the cache file. I see two solutions: 1. delete all traces of the meta data locally and only keep the cache file or 2. use relative URLs locally, too, and link the file resources to some volume resource that describes the removable storage. The latter solution would also allow to find files that are on storages not currently mounted.

Well, that is it for now I think. The service is there, it works and at least gives an idea of a solution. If anyone is up to the task to perfect it, please step up. :)

About these ads

27 thoughts on “Portable Meta-Information Yet Again (Only this time there is code!)

  1. Hey, great that you also care about this issue. This is a nice first step. But this approach still has the problem that it cannot deal with copied/renamed files directories/files when the application that does it does not explicitly care about the metadata.

    Couldn’t the sidecar files be in .rdf format, and thus allow storing anything in it, and that information would be collected by strigi?

  2. This is very, very cool. Could it be used to tag a remot/external file as being the backup of a local one and trigger an update as needed? A nepomuk-based backup system sounds like THE killer feature.

  3. (I know we already discussed this general issue, but still)

    Even that only works if the user uses only KIO (or its GNOME equivalent) for file operations.

    Nobody relies on KIO all the time. Maybe some people rely on it 99% of the time but even so, what if a “mv” operation suddenly breaks all the assumptions of the database?

    Your proposal will certainly help with other issue, but as long as that one isn’t adressed, I don’t think that many people will be able to rely too much on metadata that could be broken by a simple “mv”.

    So I really think that we need:
    1) either a whole new filesystem (Reiser had plans for that)
    2) or a meta-filesystem that wraps around any filesystem and adds metadata. Frankly I think that’s much better than 1 because people will want to use different actual filesystems.
    3) or, as you once suggested, a patch to glibc, assuming that glibc is the lowest common denominator of all file operations on files that may have metadata

    In any case I think that the real solution to this problem lies on a much much lower level than what KDE touches…

  4. How about filesystems with permissions? Would you locate the topmost ancestor directory of each file where the user has read/write permissions and write the metadata for that file in a store below that? Or how would you do it?

    • I did not think about that. My focus was on simple storages such as USB sticks and the like.
      I think removable storages with permissions are something for advanced users anyway. So they could be forced to make the cache path writable for now.

  5. I think the only real solution is to have this at the filesystem level, like resource forks or xattr or something. For this to really work, it has to be resilient under mv, etc.

  6. (1) semantic information, “nepomuk stuff”, isn’t just about file metadata but:

    (2) a lot of it is. or is at some point tied to files

    (3) nepomuk and KDE are not going to break their support of a whole lot of filesystems so any solution needs to work with everything

    (4) injecting functionality into or wrapping filesystems to influence the behaviour of applications outside the scope of NEPOMUK/KDE is not practical.

    (5) Embedding extra information directly info files in a generic way is not practical.

    (6) Broken databases are evil. If there is any way to help NEPOMUK salvage as much information as possible when it finds its self in an inconsistent state, it should be considered.

    (7) Broken databases are evil. simple things like “mv” shouldn’t break shit.

    Using a UUID for files held in the extended attributes may help with the robustness problem of a central database that is not all-seeing: If a file is not found where it is expected then the filesystem can be scanned for that UUID. If a second file with a registered UUID is found it can be given a new UUID and treated appropriately as a copy/derivative of the first (for example it might make sense to inherit the attributes of the first if they do not fundamentally relate to a unique file).

    Extended info is (pretty much) universally supported in hard-disk filesystems. There are places where it isn’t supported and this would need to be implemented as a “where available” feature.

    This does not really overlap what you talk about above but rather, complements it.

    I don’t know whether there is any good platform independent (in “platform” I include filesystem) way of asking whether extended attributes are available and querying/manipulating them…

    Just an thought from a non-expert.

    • You are right. The only real “solution” I see at the moment is to use as many “hacks” as possible to track the information. This includes xattr and maybe checksums and the like.

  7. oh…yes (aside from the fact that my suggestion is probably fundamentally flawed), there are still gaps that are not covered by metadata store on removable devices + UUID in extended info but together I think they cover 95% of the weak points of the current central database model.

    maybe ;¬/

  8. @Benoit
    that shouldn’t be a problem if you are monitoring the directories in question with [i|d]notify, the issue is really unmounting from the command line, I guess we might need a new kernel call, that says “hey I’m going away” instead of file/directory changed.

  9. @maninalift you don’t address the issue of finding the moved file. what happens when I do a command line “mv” then a nepomuk search that returns that file in the result set? if I double click on it, the place it linked to(in the fs) is gone.

  10. > The latter solution would also allow to find files that are on storages not currently mounted.

    This is something I would love to see. For awhile now I’ve wanted to be able to browse removable storage as though it were local… it would make it so much easier to find the data I have all over the place.

  11. I think you guys will never come to consideration because all of you are right from different points. I would like to see Nepomuk supporting all possible ways to store metadata at the same time like plugins, and the time would show which one (or which ones) is more viable. Actually it already has at least two metadata sources – strigi and user-entered tags – why not adding the others as plugins (metadata-based filesystem, filesystem wrapper, sidecarfiles).

  12. maninalift: Extended attributes are not supported on NFS mounts on many distributions. That is the huge weak point there from what I can see.

    Personally I’d love for UUID and a large numeric version-id (as apposed to just using the mtime>other.mtime test) to be tracked by the kernel filesystem itself, but that ain’t gunna happen.

    In libferris I have a few “smushing” commands which allow me to do this sort of reconnection thing. If mv or cp are used to move the data itself, a quick run of smush on the destination relinks the metadata based on user selected heuristics. Such heuristics could be EA-UUID or fall back to matching relative paths so a tree that is copied gets redetected.

    Note that smushing can also be done automatically. But trying to do it automatically and writing a detached metadata file for removable media in a generic way is “tricky”.

  13. Why not store multiple pointers to the file?
    – inode
    – uuid in extended attributes
    – data hash

    This increases the chance that nepomuk can relocate the file. This problem might not be 100% solvable. If I copy files from NTFS to ext3 I also loose information. bummer. But it would at least work in 90% of the cases already.

  14. @Ben Martin “Such heuristics could be EA-UUID or fall back to matching relative paths so a tree that is copied gets redetected.”

    This is just the sort of thing I was thinking… though I didn’t go into further heuristics it was at the back of my mind. It is not perfect but it is a lot better than ignoring the issue and more practical than anything else.

    NFS: perhaps if the behaviour of metadata on NFS starts to be out of step with local disks it will add some pressure to sort out extended attributes.

    Never hear of libferris before… having a look…

  15. p.s. @Ben M , @Trueg

    I’m interested, do you two communicate / cooperate much?

    libferris NEPOMUK = profit

    ;¬/

  16. my angle-bracket-arrows were treated as HTML tags and stripped from my comment, it should have read

    “libferris –(friends with)– NEPOMUK = profit”

  17. What is wrong with extended attributes or sidecar files in the case that eattrs aren’t supported? (this is 2009 not 2003, every filesystem supports them except for vfat).

  18. Pingback: . -

  19. It seems a lot of the problems here is about how to make sure that the Nepomuk db is really in sync with all the files listed, both on removable drives and the local one. Apparently this just can’t be solved on the KDE level. (Even on the kernel level this could be tricky, cf. NFS drives over unstable connections).

    How about just giving up the idea that the db is perfect, and store data about removable drives on the main local hard drive? Of course, then when a drive was replugged, the nepomuk service would have to scan it, and see if it matched something already in the db. In most cases this would work quickly and reliably by comparing type, size, directory structure, md5sums etc, and it would also work for indexing files on non-writable media. Besides, it wouldn’t clutter my (or other peoples) drives with all kinds of stuff. As an extra feature, you could allow search for files that weren’t mounted, and get the name of the drive they were on.

    A big question is how much reindexing is needed once a drives identity is (preliminarily) established. System load or bandwidth consumption could quickly become unpleasant. Of course, this would be a problem anyway, even with metadata stored on the thumbdrive, as my thumb drive (or whatever) would regularly be updated on other non-Nepomuk-aware computers. The decision would be greatly supported by knowing about the types, change patterns and access speeds of different removable drives. I don’t know if KDE already stores data like this. Otherwise this service would have to do its own timing of typical operations.

    For the benefit of programs using the service, it could return several levels of results: “Preliminary” (based on current db status) and “Checked” (read the file to see if it’s as expected). This would allow programs to choose between speed and accuracy, or possibly update the view as checked results come in. One could argue about whether to allow applications to request re-indexing of an entire directory tree, the API could be extended along the way I guess.

    Of course security should be considered. This approach potentially keeps data on all files ever mounted on the system – when I unplug my drive I generally expect that the system no longer has access to very much information about it. I don’t feel like borrowing some friends thumb drive to transfer a few files, and then later accidentally discovering metadata about his “/thumbdrive/a/b/c/d/midgetporn” folder on my system. I just don’t want to know ;-) Of course whenever you plug a drive into a computer it will get access to all files, but this breaks the assumptions one would usually make about a responsible desktop system like KDE. One would also have to take care in the procedure for establishing the identity of an untrusted remote drive: Don’t query the existence of specific files or folders, as that could be used by the remote impostor to deduce that you had had access to such files. That should be easy to work around though.

    Either way, there would be a bit of leg work and experience needed in order to optimize the re-indexing heuristics, but with a bit of work I think this would result in a system that was (by design) fast and accurate 95% of the time, or (by run-time choice) either 99.8% fast or 99.8% accurate.

    Now please tell me why I’m wrong!

    • The answer is always the same: Nepomuk is not about files, they are just one use case. If you store metadata on the removable drives themselves you loose all the power of the RDF graph: the data is not linked anymore, there is no way of linking files with other information such as persons or projects and the like.
      So storing the data in some other place is not an option.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s