Taxonomies and Topic Maps: Categorization Steps Forward

Page 2 of 3


Time To Do Something
Around the same time of Berners-Lee's presentation at XML 2000, Cambridge, MA-based Forrester Research published a report with the pithy title, "Must Search Stink?" Forrester more or less answered the question by saying, "Not if you begin to implement search technology better, as well as develop some best practices." As Forrester pointed out in a related report, "Managing Content Hypergrowth" (January 2001), "Mushrooming online assets force both contributors and end-users to wade through even deeper content haystacks for the 'needles' they want, dramatically increasing the likelihood of confusion and frustration." And this is quickly becoming everyone's problem, since the amount of online content continues to grow and the effective use of search technology continues to lag.

Ellerbeck suggest-develop a taxonomy with the help of experts, apply it consistently, and create feedback mechanisms so that you can continuously improve the data. Content management ven- dor Eprise makes such tagging a central component of its recommended best practices. According to Hank Barnes, vice president of strategy for Eprise, "A key aspect of making content more effective is metatags for classification. These tags enable content users to more easily find relevant information and to get more in-depth information on specific subjects." Barnes notes that Eprise uses these types of tags to dynamically locate information in response to user actions, such as following a certain path through a Web site. Adds Barnes, "Often, this approach of content delivery based on classification is much more effective than full-text or general-purpose searching."

Categorization has advantages beyond the core systems of search and content management. Orlando-based DigitalOwl develops and markets solutions for content syndication and Digital Rights Management (DRM). According to DigitalOwl president and CEO Kirstie Chadwick, "Content that is tagged with relevant keywords is ideally suited to be marketed and distributed through a broad range of distribution channels."

According to Chadwick, DigitalOwl provides tools that simplify the classification and tagging of content for the purpose of distribution and marketing through digital distribution channels. Once tagged and classified, DigitalOwl's KineticEdge technology is able to automatically drive highly relevant content items directly to the desktops of corporate end-users that have expressed a need for specific topics or areas of interest.

Of course, the idea of well-tagged content is something that information professionals know well, and Web publishers rely on some widely used processes to apply categories. In HTML-tagged content, category is typically indicated in the values of the metatag. Savvy Web publishers pay careful attention to how they populate the Description attribute and, more significantly, the Keywords attribute. These two attributes have much to do with how the HTML-tagged page is indexed by the various search engines.

Yet, despite the understanding of how categorization can aid in retrieval, it has not been used to its full advantage. Within many companies and organizations, there has traditionally been resistance to categorizing large volumes of data, as "hand categorization" has been viewed as human- intensive and unscalable, and automated categorization techniques are viewed as less effective.

Technology Finally Catching Up?
Categorization technology seems finally to be overcoming the conventional wisdom. This is partly because the tools seem to be improving, and also because the tools allow for the kinds of user intervention that improve the technology's results. Ideally, the technology allows for the user to create the high-level categories and hierarchies, and then the tools are used to tag individual documents. As more and more content is tagged, the user can intervene to shape the categories and refine how documents are being tagged. This kind of iteration and continuous improvement leads to the best results in a cost-effective manner.

Two leading providers of categorization tools are Semio and Inxight. Semio and Inxight are both perhaps better known for their visual navigation tools, Semio Map and Inxight's Star Tree Studio. In fact, under the flashy visual tools, both companies rely on a core of sophisticated linguistic software that each has been developing for years. Contemporary search engines are supported by a variety of linguistic approaches that have become de rigueur: conjugation, including inflection and uninflection; at least rudimentary noun-phrase analysis; and spelling correction. Inxight and Semio add some of the newer approaches that are less widely available and perhaps in some cases less proven, including tagging for categories.

Underneath any of these tools are two things: databases of words, and software to help interpret them. Clearly, the better the software, the better the resulting tool, but the underlying database is perhaps just as important. When you begin to enter areas such as categorization and summarization, where the software is trying to divine the meaning of the text, the words begin to have many facets: multiple meanings, and varying meanings in different contexts. Linguists offer many examples: "leaves" as the plural for "leaf," as well as a form of the verb "leave"; the word "mole" as both a thing that burrows through the ground and a spy that goes underground, as well as that mark on your arm. The database of words needs to support the software in its efforts to interpret these words and their many facets.

Linguistic technology continues to advance for many reasons: computers are faster, and both disc and RAM costs continue to drop. But the databases also improve. As research has progressed, so has the availability of tagged corpus material to work with. And, on a more practical level, once a group of words has been captured and codified, adding to that database becomes easier. This is one business, and one process, that benefits from critical mass.

To better understand this idea of critical mass, consider Inxight Categorizer, which employs a process of "categorization by example." In this model, the publisher hand-tags a set of "training" documents for different categories. The publisher then begins the more automated process of comparing a new document with this collection of manually coded documents. Using Inxight's linguistic analysis technologies, the Categorizer selects similar documents from the training set and infers the probable coding for the new document from these examples. Over time, the savvy user can refine the training sets to get increasingly accurate results.

Autonomy's similarly named Categorizer also employs a method of categorization by example, and includes an XML tagging function that can automatically add XML tagging to the data. For a vendor like Autonomy, the value-add comes in the automation. It positions its tool as a key component in eliminating "costly and time-consuming" manual tagging of individual documents, allowing the Web publishers to concentrate on the user-facing hierarchy and high-level topics. Autonomy reasons that it is the high-level topics that ultimately must resonate with the end-users. The technology should help Web publishers assign documents automatically to the topics. If a Web publisher decides to reorganize or realign the user-facing topics, the underlying data should also change easily.

Semio's offering in the categorization space, Tagger, provides support for a wide variety of data sources, including data stored in relational databases, and a graphical user interface that gives the user powerful tools for working on the categories and hierarchies, which are then easy to populate automatically using the tools. Significantly, LexisNexis has licensed the Semio technology to include in LexisNexis Portal, it's core offering to law firms. The LexisNexis Portal is an integrated, customized desktop solution that allows law firms flexible access to the rich LexisNexis database. Semio Tagger will be available as an optional, add-on component to the LexisNexis Portal.

According to Michele Vivona, vice president of Large Law Market Planning for LexisNexis, "Semio Tagger creates customized, browsable category structures for Web portals, giving users better and faster access to the information they need. It also allows LexisNexis Portal customers the ability to create and implement customized, automated text categorization and browsing capabilities as part of a complete Web portal solution."

Page 2 of 3