Automatic for the People

My audiobook ended while I was driving into the office. With 30 more minutes of back roads to navigate, I opted to listen to the radio. I have a few different stations programmed into my radio, as my epic commute takes me in and out of the range of several. Clicking until I hit a live station, I was immediately intrigued by an accented voice discussing the history of the Nobel Prize. I glanced up to where, on another day, my satellite radio receiver would sit, to find out who was speaking. Alas, it was analog. So I had to wait until the end of the Democracy Now! program to learn that it was Peter Zander, curator of the Nobel Museum in Stockholm, Sweden.

Woe is analog. I want my metadata! I have grown spoiled by having the ability to answer (at least 10 times on the way to preschool) the question hailing from the backseat: "Mommy, what’s this song called?" When I want to know the name of a band, there’s no need to wait until I have access to a search engine; there it is neatly scrolling by. Tune into an interview midway? It’s easy to see who is talking to whom, so I can tell at a glance if I might care.

With digital cable, myriad information about TV programming is a click or two away. With video on demand, content is neatly segmented by channel, movie type, theme, and other categories. With this very article, which I started and then returned to a few days later, I looked at the document’s properties to see when I began so I could efficiently search (narrowed by date) for the spelling of Zander’s name.

Metadata informs so much of what we take for granted in information access. Yet, lacking any dot-oh hyperbole, it suffers from its lack of chic. Metadata has been around almost as long as I have, emerging as MARC from a Library of Congress-led initiative. The Dublin Core Metadata Initiative originated in 1995, and the W3C published a specification for RDF’s data model and XML syntax as a recommendation in 1999. I know you just dozed off. This stuff is so not hot.

Luckily, a lot of it gets created without us having to think much about it because, despite its usefulness, metadata isn’t something most people want to think about. It is also lucky that there are some very smart people out there who do want to think about it, such as those at the Metadata Research Center (MRC) at the University of North Carolina–Chapel Hill and at the HyperMedia and DataBases Research Group of the Katholieke Universiteit Leuven in Belgium. Both study automatic metadata generation: The MRC primarily focuses on harvesting data for library categorization. The Leuven project is broader and certainly not the most current research on the topic. However, those involved clearly express the issues in play: "We cannot (solely) rely on humans for metadata creation: Humans don’t scale and humans are not perfect. More importantly, producing metadata is not exactly fun!"

I couldn’t agree more. The only thing less fun than making metadata may well be reading about it. That said, the most recent related research seems to revolve around the semantic web, which has a much higher hipness factor (as these things go). An extension of this concept that is also getting some press is semantic publishing.

The first of the two different approaches to semantic publishing involves producing information using semantic web languages such as RDF and OWL. The second approach requires that publishers use markup languages such as RDFa and microformats to embed formal metadata in documents. Oh no, not metadata.

I like the microformats concept in that its approach repurposes XHTML and HTML tags to convey metadata. Though content can already be "automatically processed," it hasn’t been particularly applicable because traditional markup tags don’t describe what information means. Microformats bridge the gap by attaching semantics to allow for the extraction of things such as events, contact information, relationships, etc.

According to, "Designed for humans first and machines second, microformats are a set of simple, open data formats built upon existing and widely adopted standards." The approach is limited, but importantly, it doesn’t require everyone to change their behavior. Microformats represent an effort to make data easier to publish in a standardized way so that data is optimized for indexing and searching or to allow users to do things such as download a contact’s information or add an event to a calendar from a website. Thus, content that already exists in HTML formats can be enriched to take on semantic characteristics without a radical retrofit.

Users increasingly expect to reap all the benefits of metadata and the sorts of informed and interactive experiences it enables. However, having tried to go back through 4 years of digital pictures and tag them, I know that the labor-intensive approach is only likely undertaken by rare metadata devotees or those having the most to gain—entertainment providers and STM publishers being two prime examples. For the rest of us, automatic metadata or at least metadata-made-easy may be the best route to provide some of the benefits we all want to enjoy.