Can XML Drive Taxonomies and Categorization?

Page 1 of 2

XML might seem like it is everywhere, but it is not--at least not yet.

If you google "XML," you do get a stunning 20.5 million hits, which is about four times as many as "Britney," but—sensibly—half as many as "God." So I guess XML falls short of omniscience. Still, the prevalence of XML has led to its being a too-ready answer to seemingly every question about information technology in general and content management in particular. The assumption seems to be that, no matter the requirement or problem, XML is the answer.

When it comes to a specific question, such as, "How does XML support taxonomy development and usage?" the answers that come forth can be far-flung, disparate, and confusing. The waters get muddied further if you tune into all the announcements from the vendors, industry groups, and analysts who have opinions about such matters. What you might call a taxonomy, a vendor might call a thesaurus, or vice versa. What one vendor accomplishes automatically, the next one might accomplish with some manual intervention, and so on.

The very word "taxonomy" is a bit of a landmine. While information specialists and taxonomists have their own understanding of taxonomies as mechanisms for organizing content into logical groupings, some of the taxonomy and search vendors who support different "taxonomies" seem to be referring to taxonomies, keyword lists, and various kinds of thesauruses. Moreover, many of the vendor offerings are broader than taxonomy-building itself. Even a relatively specialized tool such as Autonomy combines content categorization with other core services such as content distribution and personalization. Verity combines its leading search engine, linguistic tools, and other features into a single product line, K2 Enterprise, while content management vendor Documentum includes taxonomy tools as part of an add-on to its core platform called Content Intelligence Services.

How Analysts Define It
Documentum's focus on "content intelligence" coincides nicely with how some of the industry analysts tend to view taxonomy software as part of a larger set of tools. Forrester Research includes taxonomy vendors as part of a larger market focused on what they call "Intelligent Content Services." Forrester analyst Laura Ramos says this larger market "includes technology that not only categorizes content into taxonomies, but also looks for relationships outside the taxonomy and related to the context in which the user operates." This would include technologies like pattern recognition, collaborative filtering, recommendation engines, visualization, and text mining.

The overlap between taxonomy and other technology makes sense to Ramos, as she finds that taxonomy efforts are typically part of some larger effort. "They are often part of another technology initiative including document and Web content management," said Ramos. "Taxonomies can leverage metadata and common descriptions to associate text information with structured data in the developing area of content-centric applications."

Ramos also correctly points out that taxonomies predate the many technologies that are being used today to manage content. "Some large organizations," noted Ramos, "like those in the pharmaceutical, healthcare, and aerospace industries, established taxonomies prior to the advent of computers (and more recently the Web) to create a common language and reference points for research." As a result, taxonomy efforts today combine both human analysis and the computer tools that can support such analysis.

What Practitioners Say
Information specialists and others who build taxonomies emphasize the need to combine human analysis with the tools and technology that can support such analysis. "There is no substitute for the manual work and thinking that goes into the start of a project," said Tina Baker, president and CEO of The Cadence Group, an Atlanta information management company. "No matter which tools will be used—and whether XML is part of the project or not—organizations still need to look at their requirements for acquiring and accessing information."

Still, XML tools and content can speed the process of implementation. For example, consider the taxonomies and thesauruses available in XML format. While The Cadence Group has created and customized a number of taxonomies for clients, across a broad range of industries, one key focus has been medical content. For this, Cadence taxonomists and information professionals rely heavily on MeSH, the Medical Subject Headings-controlled thesaurus from the National Library of Medicine. "We have used MeSH as a basis for a number of taxonomies," explained Baker. Cadence maintains their own version of MeSH in-house, and keeps it in sync with changes as the Library releases them.

Using such public resources is often a good starting point, but practitioners emphasize that the real work of making a taxonomy successful for users is to understand an organization's requirements for using content. Such requirements analysis can take the form of a "content audit," said Seth Earley, president and founder of Massachusetts-based Earley and Associates. Earley often works with clients in creating taxonomies for large content sets and knows that useful approaches can be based on a number of existing tools and approaches. "You can look at navigation schemes that others are using, existing indices, and table-of-content style sources," said Earley. "But whatever you use as your starting point still needs be tuned to the needs of the given organization and application."

XML and Your Content
So specifically where does XML fit into the process? The answer is different for every organization and application, of course, but XML is likely to exist in three places in a taxonomy project: in the content itself, in the metadata for the content, or in the taxonomy files, thesaurus(es), or keyword list(s).

Some of the content itself may exist as XML-tagged documents. Organizations are still very likely to have all kinds of heterogeneous content, stored in a variety of systems across the enterprise. Ron Kolb, director of technology and strategy for Autonomy, points out that the best technologies "need to deal equally with well-structured and unstructured content." Autonomy views XML as an emerging lingua franca for systems to ingest documents for indexing and classifying. According to Kolb, "Autonomy's processes for ingesting, storing, and delivering data are all XML-based now." A system like Autonomy's IDOL Server indexes documents that are already tagged in XML, and can automatically insert XML tags in other documents. These tags can then be used during the indexing and classification phases to support the taxonomy.

Some of the metadata for the content may exist as XML tagged text. Documents that are stored in a document management system, as an example, are often stored with metadata. Increasingly, this metadata is available as XML-tagged data. Indeed, even the current version of Microsoft Office stores the "Properties" tab as XML, and the upcoming version of Microsoft Word will be able to store the entire document as XML. The categorization tools and search engines can use this metadata when indexing the documents, and the fielded metadata can be mapped to terms in the taxonomy. Thus, Microsoft Word documents that include title keyword metadata properties can be more accurately indexed and organized.

Page 1 of 2