Can XML Drive Taxonomies and Categorization?

Page 2 of 2


The taxonomy, thesaurus, or keyword list(s) may exist as XML. A number of "public" thesauruses, metadata specification, and keyword lists are available in XML format (and some are still in HTML and in XML's predecessor language, SGML). These include MeSH, the IEEE Metadata Elements and Structure, and the thesaurus supporting GLIN, the Global Legal Information Network. Such resources can be at least a starting point for organizations beginning a taxonomy project. Significantly, several of the taxonomy and categorization vendors are able to import and use these resources, and are also able to export and distribute their internal taxonomies in an XML structure that can be shared with other software applications. Verity, for example, can import and export XML taxonomies, according to Andrew D. Feit, senior vice president of marketing, while Documentum supports importing and exporting taxonomies through a tool they have developed called Taxonomy Exchange Format (TEF).

So XML plays several roles in taxonomy development and in the classification of content. In some cases, the content itself—or its metadata—contains structural and semantic tags that can directly or indirectly support categorization and indexes. In other cases, less structured content is scanned and automatically categorized by intelligent engines such as those from Verity and Autonomy, with XML tagging added by the intelligent engines. And the taxonomies themselves, while often the result of both human and automated processes, are now often managed and integrated as XML documents. XML is now a key part of the infrastructure on which content is managed, indexed, and classified. As a result, XML in many ways supports the development and maintenance of taxonomies that are being built and exploited to improve search, access, and navigation of content.

Content management technology is almost de rigueur now in medium and large organizations, and along with it the problems of information overload. As a result, taxonomy development is now viewed as a core business issue. As practitioners like The Cadence Group's Baker explains, organizations are well aware of how much time is spent locating information. Making the process work better affects both the bottom and top line directly. To this end, taxonomy tools represent one area of software that continues to show significant growth. Nathaniel Palmer, vice president and chief analyst of Boston's Delphi Group, expects the market for taxonomy tools "to grow to $386M in 2004 from its current level of $270M for 2003 (and up from $228M in 2002)."

Significantly, Delphi's numbers represent spending on both the software itself and the accompanying professional services. In other words, there is still plenty of work to do, even if your content already exists in XML. As one industry executive noted, "XML is going to help, and will be part of infrastructure, but the same hard work needs to be done."

Sidebar: Verity, Autonomy, and Documentum
The vendors that support the taxonomy development process use a variety of processes and tools, but with some common approaches and concepts. Each of three vendors discussed here—Autonomy, Verity, and Documentum—use some kind of content categorization technology that organizes content and documents into categories. In each case, the categorization tool works on a training set of documents; the bigger the training set, the more precise the results, typically.

Beyond that, each technology takes its own highly engineered and often specialized approach. Autonomy relies heavily on what it calls "Advanced Probabilistic Content Modeling," a technique that combines Bayesian and information processing algorithms. Verity's core engine uses a "Logistics Regression Classifier." Documentum has its own approaches, but is also quick to point out that they partner with both Verity and Autonomy and work well with both.

Happily, the level of XML support seems to be fairly consistent among the major vendors. They all provide a means for working with XML content and metadata, and they all seem to be able to import XML taxonomies and thesauruses.

However, everyone's content, infrastructure, and requirements will differ. As a programming colleague of mine likes to say, your mileage may vary. When looking at these tools, it may be worthwhile to have a demonstration of the engines working with your content and your taxonomy, if possible. Some organizations go to great lengths to test the various engines against each other. While you may or may not want to engage in such an extended "shoot out," you should learn enough about the product to ensure that it will support your content and your taxonomies.

Companies Featured in this Article
The Cadence Group
The Delphi Group
Earley and Associates
Forrester Research
Global Legal Information Network
IEEE Metadata Elements and Structure www.imsprojectorg/metadata/mdbestv1p1.html

Page 2 of 2