Proposal for metadata interoperability on OS X

  • A lot of discussion on different application user forums seem to be
    going on regarding the exchange of metadata between different
    applications. Apple has provided parts of a possible solution in the
    latest versions of Mac OS X but nothing that can be seen as the final
    verdict.

    First some background: with the advent of Spotlight, a lot of metadata
    is made available by application developers for use in Spotlight
    importers. It is relatively easy to extract this from your own data
    files and hand them over to Spotlight. Apple has also put a lot of
    work in providing a long list of keywords that can be used by these
    importers to store this data in the Spotlight database.

    There are however problems with types of applications that are
    primarily using third party file formats (such as PDF for example) and
    that want to add application specific metadata to these files. In this
    case you cannot add these in most cases without some cumbersome
    workarounds.

    Another problem is with types of applications that want to import
    metadata from third parties where it is difficult to parse the file-
    format.

    One mechanism that is currently used by some applications to resolve
    this is the use of Finder/Spotlight comments. Elaborate formatting
    options try to make some order out of what is essentially a free-form
    text string. Moreover, this text string is under user control and can
    be changed by her at any point in time, thereby destroying potentially
    vital data.

    What is needed is a mechanism that allows application developers to
    add metadata to files without having to touch the actual file data. In
    the Mac OS days resource forks were used for this purpose, but these
    caused problems with foreign filesystems. In OS X (since Tiger) there
    is a useful mechanism called eXtended ATTRibutes that allows for
    metadata to be tacked on files. And since Leopard there is a way to
    preserve this while using the Cocoa or Unix file manipulation classes/
    functions and even when storing these on non-HFS disks.

    What is missing is a standardized way to set and interpret the
    metadata. What I'm proposing is to use the benefits of the Spotlight
    indexing mechanism, i.e. a dictionary of standard keywords with
    arbitrary values and use this on top of the extended attributes. This
    would allow for transparent transfer of metadata between applications,
    yet retain the use of Spotlight-based keyword searching. This would
    even work with extra keywords that might have been defined for
    Spotlight because the file type and application are known so these
    could be loaded from the application's bundle dictionary.

    An example Objective-C class to implement part of this: Uli Kusterer's
    UKXattrMetadataStore class that can be found at <http://codebeach.org/code/show/15>. Missing from this is some error checking with regards to the
    limitations of xattr and a way to map keywords to localized
    descriptions (these can be found from third-party Spotlight schemas
    but seem to be hidden for the Apple keywords).

    I'm looking to start a discussion on this list that can be of benefit
    to all of us and hopefully Apple will take notice and may take our
    ideas to heart while they're working on 10.6.

    Annard Brouwer
    (contractor for DEVONtechnologies LLC and therefore very much involved
    in this subject at the moment)
  • On Jul 8, 2008, at 3:52 AM, <ab_lists...> wrote:

    > I'm looking to start a discussion on this list that can be of
    > benefit to all of us and hopefully Apple will take notice and may
    > take our ideas to heart while they're working on 10.6.

    while the discussion would be interesting, I'll note two things.

    One, the spotlight list is probably more relevant than here.

    Two, the list isn't the way to provide feedback like this to Apple. It
    says that right in the guidelines. You can almost guarantee that it
    won't be picked up from the list. Filing enhancement requests is the
    only way to make bring this stuff to Apple's attention.
  • I'm following up on my own post because I received two comments that
    made me realise I should clarify why I want this discussion here on
    this list.

    1. "You should file a bug report with Apple."

    This is the intention (and I know that at least one person has filed a
    request already), but I would like to have a hashed out plan that
    covers things we discuss here so there are not many ambiguities left.
    Secondly, we at DEVONtechnologies need a working solution sooner than
    later. I was hoping that several third parties may be able to agree on
    something that we can work with in the Leopard time frame.

    2. "You should discuss this on the Spotlight list."

    I thought about it, but for me this isn't directly about Spotlight. I
    want to use some of the infrastructure that Spotlight brings but in a
    way it's about application interoperability. I want to access metadata
    from other applications' files and provide metadata to them and not
    through Spotlight queries. This may be too self-centered but we can
    always move to the Spotlight list if people here disagree with me.

    Thanks,
    Annard
  • There exists a great way for all applications to share 'Address Book'
    data on OS X. The same cannot be said for user - entered meta data.
    Things such as tags, urls, etc, have no conventions for
    interoperability.

    Right now the only 'standard' user entered meta-data is the Finder
    comments and the Finder labels, both of which date from 199x in System
    7 or thereabouts.

    It seems that a fairly straight forward agreement on some extended
    attribute keys and values would be in order.

    An example seemed a better way to describe what I am talking about:
    EG: (all names and numbers are fanciful)

    User Entered tags:
    --------------------------
    stored: under the keyword: kXATTR_UserTags
    value: NSArray of NSStrings. No hierarchy, maximum tag length 100
    chars, maximum number of tags 100, Guidelines: tags should be entered
    by the user, and not be things like GUIDs, paths, etc.

    URLS:
    --------------------------
    urls: under the keyword kXATTR_URL
    value: NSArray of NSDictionaries, each dict has a url field and a name
    field, etc.

    It would be really nice if some of these attributes made their way
    into the spotlight database, to facilitate searching.

    This really has nothing to do with Apple, unless they are planning on
    adding some standardized extended attributes along these lines for
    Snow Leopard. It seems likely that even if Apple does add a few
    'official' xattrs on files, that a richer set of attributes should be
    agreed on by several interested developers. We need more than an API
    to set xattrs, we need to agree on names and formats.

    --Tom Andersen
    Ironic Software

    On 8-Jul-08, at 3:52 AM, <ab_lists...> wrote:

    > A lot of discussion on different application user forums seem to be
    > going on regarding the exchange of metadata between different
    > applications. Apple has provided parts of a possible solution in the
    > latest versions of Mac OS X but nothing that can be seen as the
    > final verdict.
    >
    > First some background: with the advent of Spotlight, a lot of
    > metadata is made available by application developers for use in
    > Spotlight importers. It is relatively easy to extract this from your
    > own data files and hand them over to Spotlight. Apple has also put a
    > lot of work in providing a long list of keywords that can be used by
    > these importers to store this data in the Spotlight database.
    >
    > There are however problems with types of applications that are
    > primarily using third party file formats (such as PDF for example)
    > and that want to add application specific metadata to these files.
    > In this case you cannot add these in most cases without some
    > cumbersome workarounds.
    >
    > Another problem is with types of applications that want to import
    > metadata from third parties where it is difficult to parse the file-
    > format.
    >
    > One mechanism that is currently used by some applications to resolve
    > this is the use of Finder/Spotlight comments. Elaborate formatting
    > options try to make some order out of what is essentially a free-
    > form text string. Moreover, this text string is under user control
    > and can be changed by her at any point in time, thereby destroying
    > potentially vital data.
    >
    > What is needed is a mechanism that allows application developers to
    > add metadata to files without having to touch the actual file data.
    > In the Mac OS days resource forks were used for this purpose, but
    > these caused problems with foreign filesystems. In OS X (since
    > Tiger) there is a useful mechanism called eXtended ATTRibutes that
    > allows for metadata to be tacked on files. And since Leopard there
    > is a way to preserve this while using the Cocoa or Unix file
    > manipulation classes/functions and even when storing these on non-
    > HFS disks.
    >
    > What is missing is a standardized way to set and interpret the
    > metadata. What I'm proposing is to use the benefits of the Spotlight
    > indexing mechanism, i.e. a dictionary of standard keywords with
    > arbitrary values and use this on top of the extended attributes.
    > This would allow for transparent transfer of metadata between
    > applications, yet retain the use of Spotlight-based keyword
    > searching. This would even work with extra keywords that might have
    > been defined for Spotlight because the file type and application are
    > known so these could be loaded from the application's bundle
    > dictionary.
    >
    > An example Objective-C class to implement part of this: Uli
    > Kusterer's UKXattrMetadataStore class that can be found at <http://codebeach.org/code/show/15
    > >. Missing from this is some error checking with regards to the
    > limitations of xattr and a way to map keywords to localized
    > descriptions (these can be found from third-party Spotlight schemas
    > but seem to be hidden for the Apple keywords).
    >
    > I'm looking to start a discussion on this list that can be of
    > benefit to all of us and hopefully Apple will take notice and may
    > take our ideas to heart while they're working on 10.6.
    >
    > Annard Brouwer
    > (contractor for DEVONtechnologies LLC and therefore very much
    > involved in this subject at the moment)
  • On Tue, Jul 8, 2008 at 10:45 AM, Tom Andersen <knobsturner...> wrote:
    > User Entered tags:
    > --------------------------
    > stored: under the keyword: kXATTR_UserTags
    > value: NSArray of NSStrings. No hierarchy, maximum tag length 100  chars,
    > maximum number of tags 100, Guidelines: tags should be entered by the user,
    > and not be things like GUIDs, paths, etc.

    You want all of this in one extended attribute? Even assuming no
    overhead for NSArray or NSStrings, wouldn't 100 chars/tag * 100 tags
    already be about 2-3 times the maximum amount of storage currently
    allowed in one generic extended attribute? As I understand the maximum
    is roughly 4K, with the ResourceFork being a special case clearly.
  • As I said, all numbers are fanciful. I just wanted to get a discussion
    started.

    From our point of view I would prefer a much much smaller limit.
    There needs to be limits so that we can design interface around these
    attributes, and it seems also for technical reasons on the size of an
    attribute. I think that the way to go about this is to talk this back
    and forth here for a bit, then have someone go and write some code to
    implement the guidelines and turn them into a nice cocoa api. For
    instance, one thing that I run into with tags is that some programs/
    users enter (say 6) tags as a comma delimited string, and this gets
    stored as a single string in an array of length one. It's not
    impossible to catch this sort of thing and deal with it, but if it
    were all in one open sourced class, it would be a lot easier and more
    consistent.

    Suggested Fields
    ---------------
    Tags
    Authors
    URLs
    Character encoding for text files
    Checksum

    + application defined?

    It seems like a small well defined set of attributes to use would be
    better than a larger set. At least as a first step.

    Where are the maximum xattr lengths documented? I read on a wiki page
    that HFS+ limits xattrs to 'one b-tree node' which I take it is 4k.
    http://en.wikipedia.org/wiki/Extended_file_attributes

    --Tom
    On 8-Jul-08, at 10:59 AM, Mac QA wrote:
    > On Tue, Jul 8, 2008 at 10:45 AM, Tom Andersen <knobsturner...>
    > wrote:
    >> User Entered tags:
    >> --------------------------
    >> stored: under the keyword: kXATTR_UserTags
    >> value: NSArray of NSStrings. No hierarchy, maximum tag length 100
    >> chars,
    >> maximum number of tags 100, Guidelines: tags should be entered by
    >> the user,
    >> and not be things like GUIDs, paths, etc.
    >
    > You want all of this in one extended attribute? Even assuming no
    > overhead for NSArray or NSStrings, wouldn't 100 chars/tag * 100 tags
    > already be about 2-3 times the maximum amount of storage currently
    > allowed in one generic extended attribute? As I understand the maximum
    > is roughly 4K, with the ResourceFork being a special case clearly.
    >

    On 8-Jul-08, at 10:59 AM, Mac QA wrote:

    > On Tue, Jul 8, 2008 at 10:45 AM, Tom Andersen <knobsturner...>
    > wrote:
    >> User Entered tags:
    >> --------------------------
    >> stored: under the keyword: kXATTR_UserTags
    >> value: NSArray of NSStrings. No hierarchy, maximum tag length 100
    >> chars,
    >> maximum number of tags 100, Guidelines: tags should be entered by
    >> the user,
    >> and not be things like GUIDs, paths, etc.
    >
    > You want all of this in one extended attribute? Even assuming no
    > overhead for NSArray or NSStrings, wouldn't 100 chars/tag * 100 tags
    > already be about 2-3 times the maximum amount of storage currently
    > allowed in one generic extended attribute? As I understand the maximum
    > is roughly 4K, with the ResourceFork being a special case clearly.
  • On Tue, Jul 8, 2008 at 4:20 PM, Tom Andersen <knobsturner...> wrote:

    > I think that
    > the way to go about this is to talk this back and forth here for a bit, then
    > have someone go and write some code to implement the guidelines and turn
    > them into a nice cocoa api. For instance, one thing that I run into with
    > tags is that some programs/users enter (say 6) tags as a comma delimited
    > string, and this gets stored as a single string in an array of length one.
    > It's not impossible to catch this sort of thing and deal with it, but if it
    > were all in one open sourced class, it would be a lot easier and more
    > consistent.
    >
    > Suggested Fields
    > ---------------
    > Tags

    It's worth noting that there is already something of a de facto tag
    structure in SpotMeta
    (http://www.fluffy.co.uk/spotmeta/spotmeta_org.html).

    SpotMeta hasn't been updated for Leopard (the COM swizzling stuff has
    stopped working) but it's under the GPL.

    Hamish
  • On 08 Jul 2008, at 17:43, Hamish Allan wrote:
    >
    > It's worth noting that there is already something of a de facto tag
    > structure in SpotMeta
    > (http://www.fluffy.co.uk/spotmeta/spotmeta_org.html).
    >
    > SpotMeta hasn't been updated for Leopard (the COM swizzling stuff has
    > stopped working) but it's under the GPL.
    >

    I don't like encoding datatypes in keywords (it reminds me of very
    heated discussions with certain DB admins in my WebObjects days). I
    was thinking of a set of defined keywords and using a plist
    (traditional or xml) to encode the values with. That's a lot more
    flexible. And if you use a schema, you will also have localised names
    of keywords. That can be handy if you need to populate a user
    interface where you need to display attributes that are out of your
    control.

    Annard