To Metadata or Not To Metadata

Page 3 of 3

Information Infrastructure
A lot of the discussion about metadata has really been about one metadata field: keywords. This is probably due to the unfortunate fact that keywords are perhaps the most difficult metadata field to implement, or at least to do well enough to get the results people expect. For example, when selecting keywords for a document, should you select words that are frequently found in the document, terms that are unique to the document, or terms that try to express the "aboutness" of the document. There are arguments for all three, but none are completely compelling nor do they uniformly produce good results, and what is worse, it usually becomes an individual author's choice which means an unorganized mix of answers and results.

So does this mean that keywords are of no real value? No, it means that they have to be approached from an infrastructural perspective. And the essential first step to producing good keywords is to develop a controlled vocabulary or, even better, a taxonomy-based set of controlled vocabularies with which to populate the keyword field. As we have seen, asking authors to create good keywords simply does not work very well, but asking them to select the right keywords from a predetermined list is much easier and produces better and more consistent results.

In addition to doing keywords better, it is important to realize that there is more to metadata than just keywords. Other metadata fields can often produce high value and can be much easier and cheaper to produce. It is important to focus on achieving value from all fields, such as titles, descriptions, publisher, author, and the like. And often even more valuable are fields like audience and DocumentObjectType (with values like an FAQ document, a policy document, and so on).

While software that claims to solve all your metadata needs is still illusory, there are a number of products, like Entopia's K-Bus, that can generate a great deal of very useful metadata and thus reduce the overall cost of metadata projects, as well as support the development of the sophisticated search applications.

Infrastructure Context
Implementing metadata initiatives as a fundamental component of the intellectual infrastructure of an organization rather than simply as keywords used to influence relevance ranking supports a wide range of interesting and valuable applications that go beyond simple search, and, at the same time, enhances the search experience in a variety of ways.

Implementing metadata initiatives as a fundamental component of the intellectual infrastructure of an organization rather than simply as keywords used to influence relevance ranking supports a wide range of interesting and valuable applications that go beyond simple search, and, at the same time, enhances the search experience in a variety of ways.

One such application was the faceted metadata display presented at the DCMI Workshop by Marti Hearst, associate professor at SIMS (School of Information Management and Systems at Berkeley). In this application—Flamenco—search results are mapped to a large number of facets, which basically function as a well-structured set of advanced, tightly defined searches. This allows the user to select likely areas from which to browse to the document they seek. The well-defined facets like Products, Geography, Health Effects, and Document Characteristics work much better at limiting the results in meaningful ways than the usual mixed, broad categories you find in browse applications.

Research at SIMS with the Flamenco Search Interface Project has shown that even though the display is complex users find it quite easy to master. We all know that advanced search using metadata fields doesn't work for the simple fact that users won't do it. Advanced searching is an advanced skill which most users (outside of the library) don't have. But as Igor Perisic, chief scientist of Entopia, put it, "give them a simple, empty search box and then add structure to the results," which what Entopia's K-Bus does. Just as selecting keywords from a list works better than making up your own keywords, so selecting from the results of multiple advanced searches works better than making up your own.

What If I Can't Get There From Here?
Rosenfeld pointed out that in his experience, not many organizations are willing to commit to such a huge undertaking as developing a corporate taxonomy, an enterprise-wide metadata standard, and associated controlled vocabularies, and then implementing that standard in 100,000 documents or more; integrating that metadata into an entire range of projects and technologies like search, content management, portals, and the like; and, at the same time, creating a complete metadata or knowledge architecture team to manage the whole thing.

As Rosenfeld so aptly expressed it, "It is a worthy pursuit, but we can start with other easier, low-hanging fruit, before taking on the huge honking thing like an enterprise thesaurus."

So how should one proceed on a practical level?

I would argue that in my experience (and Rosenfeld agrees) the best results start with creating the overall infrastructure vision including metadata standards. While actually implementing this vision can be expensive (though likely not as expensive as not doing it), creating the vision itself is a relatively small project. What having the strategic vision does, however, is to create the right context within which to implement and justify any and all piecemeal or smaller projects, avoiding reinventing the wheel for each information project and leveraging each project as a foundation for the next project.

The next essential step is to create a team, which need not be a large team nor is it essential that it be a full-time, dedicated team. It can be a virtual team made up of members from a library, IT, business partners, and so on. What is essential, however, is that the team has some sort of official recognition, including incorporating their central team functions into their job descriptions and reward structure. Another early step could be a content management initiative—before the initial reaction to your new portal project changes from rave reviews to user complaints of still not being able to find anything.

As far as metadata itself, I would recommend that you not start with keywords if you don't have the resources to develop them with controlled vocabularies. Instead, focus on getting value from other metadata fields. Another option is to buy and customize an existing taxonomy and/or vocabulary. Finally, don't focus on trying to tweak relevancy rankings with keywords, but try such approaches as best bet metadata (where someone designates a document as the most likely target document that someone wants when they type in a particular search term), browse or dynamic classification, or faceted metadata interfaces.

There are many other tips and techniques for implementing a full-scale, enterprise-wide infrastructure solution to metadata, but that would take us too long, and they will vary from organization to organization. So let me sum up the approach with this slogan and a question: Think Big, Start Small, Scale Fast.

You wouldn't think of running a company without organizing your employees, why do you think you can create access to information without organizing that information?

Companies Featured in This Article
Flamenco Search Interface Project


Page 3 of 3