Why Taxonomies Need XML

Page 2 of 3


Fear of Baggage Handling 
Despite the advantages of capturing the structure of a taxonomy, most organizations are still hesitant to invest the resources necessary to do so. Creating the necessary tags adds a lot of baggage to the vocabulary. After all, even the official XML specification states that "terseness in XML markup is of minimal importance." A single-word term can suddenly explode into a dozen or more lines of XML depending on the complexity of the schema. It is understandable that people are hesitant to try justifying this to management when simply getting resources to create the taxonomy in the first place is often a hard sell. 

Unfortunately, this is a case of "pay me now or pay me later." When every application has its own way of looking at a taxonomy file and every taxonomy has its own way of organizing terms, you have to explain to each and every tool and taxonomy how to talk among themselves. This means writing adapters, and with the number of available taxonomies growing exponentially, it is keeping many a developer gainfully employed. "Adapters are easy to write and in many cases they are the right approach," said Colby Dyess, senior product manager for information access solution provider Endeca. "However, XML has a number of advantages specifically when you're trying to perform integration tasks." Endeca's co-founder Pete Bell agrees. "While Endeca is neutral towards how a taxonomy is represented, the plumbing should conform to some standard and XML is perfect. It's nice and abstract, making it fairly easy to translate existing sources. You do the work once to create an adapter and then programmatically convert to XML." 

Without XML, you would be required to create a specific adapter for every taxonomy and application that needs to interact. If you have four tools that need to share a single taxonomy, you must create and maintain six adapters. This number grows exponentially as the number of taxonomies in use increases. If six partners each want to exchange their own taxonomy with the other members of the group, 36 transformations are involved. 

XML can simplify the process by leveraging XSLT as a commonly understood translation mechanism, but even so there are still challenges. Dyess explains, "It's not very difficult to map from XML to XML, but what gets really complex is in understanding the relationships between different kinds of XML. I have some vendors that conform to a standard XML-based format, but I have others, some mom and pops, that use their own proprietary XML. You need something that is able to understand the relationship, how you would map those two together. Whatever your systems are where you want to push the taxonomy, they are not going to natively understand unless perhaps it understands the standards, then absolutely." Those standards are beginning to emerge with Zthes leading the pack. 

Standard protocols for describing, exchanging, and navigating controlled vocabularies have been available for a couple of decades in the form of Z39.19 and Z39.50. These standards are maintained by the Library of Congress and until recently their use has been mostly confined to large libraries that need to ship big vocabularies back and forth. Both protocols are pre-web technologies and their implementation in commercial systems has been inconsistent at best. A single query can return wildly different results even from among systems that embrace the protocols. Zthes provides a common implementation of Z39.50 and Z39.19 as an XML schema where all elements are universally defined. This eliminates the need for custom implementations and idiosyncratic interpretations of queries and taxonomies. As a result, Zthes has opened adoption of the standards to the masses. As vendors, system implementors, and integrators adopt the standard, the need for custom transformations and schemas is rapidly diminishing. If the vocabulary is structured and encoded according to the standard, you can just throw out a taxonomy from one system that speaks Zthes to another and it will just know how to deal with it. 

Standardization is lowering the barrier to adopting XML for taxonomies and most published vocabularies are moving in that direction. Taxonomy Warehouse, an online clearinghouse for controlled vocabularies, has adopted Zthes as one of three standard representations for its available taxonomies. Perhaps an even more significant driving force, however, is the changing nature of taxonomies. 

Taxonomies Evolve
When taxonomies first began moving from the library into the enterprise, they brought along their rigid, hierarchical view of the world. Organizations would go to great lengths to create the one true taxonomy that represented every aspect of their business, which could easily fit into Excel. People are beginning to realize that this approach no longer works. Dave Clarke, global taxonomy director for Factiva, has noticed the shift. He says, "People are beyond simple folder structure. In a spreadsheet taxonomy, what you're looking at is a two-dimensional array, just up and down coordinates. If all you are building is a strict hierarchical taxonomy then it can be represented in just two dimensions. However, organizations today are getting much more sophisticated than two dimensional representations of the world. The world just doesn't fit into two dimensions at all."

The way organizations are moving beyond a two-dimensional worldview is by turning to faceted classification. Facets are mutually exclusive properties or characteristics of a specific subject that can be combined to describe something of interest. Rather than needing to develop an exhaustive collection of pigeonholes where everything can be slotted into a single correct location, facets allow you to build up descriptions from different properties as needed. Facets are at the heart of systems from Autonomy, Siderean, and Endeca for good reason. According to Bell, "The faceted approach used by Endeca allows multiple, orthogonal authorities about the same subject to live side by side in peaceful coexistence. It is up to the user to decide which authority to choose from." Bell points out that the benefits of applying facets extend to both information consumers and creators. "Facets help users find what they are looking for far more easily, while helping content owners manage content far more easily," he said. "It lets you make do with exponentially fewer nodes in your taxonomies."

The rule of thumb for facets, according to Joseph Busch of Taxonomy Strategies, is that four facets of 10 nodes each have the same discriminatory power as one taxonomy of 10,000 nodes. This discriminatory power can be extended at any time by adding additional facets, moving the taxonomy from two dimensions to n dimensions. This power comes at a cost. Each facet is a taxonomy in miniature and must be managed as such. Again, XML encoding is the key. "Once you go n-dimensional, you can't express that by a traditional coordinate method," points out Factiva's Clarke. "But certainly in an XML schema you can start expressing that multi-dimensional, highly expressive, semantically rich set of relationships."

Factiva's customers have begun to take advantage of this approach. "All these things can be combined together into a multi-dimensional view of how to get at the information," Clarke said. "A lot of large enterprises—be they government agencies, private enterprises, or anything else—are finding that to provide richer information access you need to go n-dimensional in terms of how you describe the world." That said, Clarke does not see facets as the final answer. "Facets are one way, but we should be thinking about making the transition from taxonomies to ontologies, where there is extremely rich semantic expression to define the relationships between different concepts, people, events, places, and so forth. Faceted classification is a precursor to ontology, which is going to be a much richer semantic representation than faceted classification ever attempted." 

Page 2 of 3