Auto-Categorization: Coming to a Library or Intranet Near You!

Page 2 of 2

Nov 01, 2002

November 2002 Issue

I'd like to look at a special and difficult area, but potentially very rewarding—the corporate intranet. It is difficult because all the things that make newsfeeds work very well when pushed through the auto-categorizer are missing on almost all corporate intranets.

In intranets, the content is written by a really wild mix of writers, some good, some bad, and some downright scary. Some of the content is pure literature that unfortunately sits right next to an accounting document, which sits next to a scientific research paper. Some of the content has good titles and some has very bad titles and in some unfortunate cases, every page on the site bears the same title. Some of the documents are a single page of HTML, some are book length PDF documents, and some have only a paragraph of content, but contain links to all sorts of other pages or sites.

In addition, the economics aren't quite right for today's intranets to deploy auto-categorization. Most corporate intranets, no matter how big, don't have thousands of new pages being published every day. The one exception, when newsfeeds are being posted, still doesn't quite justify auto-categorization because they have their own specific categorization that then needs to be integrated with the categorization schema or taxonomy of the rest of the intranet.

Nevertheless, I think that ultimately, the corporate intranet will see the most lucrative employment of auto-categorization software. Certainly, the need is great although different and there are a very large number of corporate intranets, which makes the challenge worthwhile. One trend that should emphasize the need for categorization tools is the current drive to "portalize" intranets. Portal software is itself an attempt to solve the infoglut problem, but without a good taxonomic foundation, portals too often end up as very expensive replacements for bookmarks.

First and foremost, auto-categorization cannot completely replace a librarian or information architect although it can make them more productive, save them time, and produce a better end-product. The software itself, without some human rules-based categorization, cannot currently achieve more than about 90% accuracy—which sounds pretty good until you realize that one out of every ten documents listed in the results of a search or browse interface will be wrong. And more importantly, it will be wrong in inexplicable ways—ways that will cause users to lose confidence in the system.

While it is much faster than a human categorizer and doesn't require vacation days and medical benefits, auto-categorization is still simply not as good as a human categorizer. It can't understand the subtleties of meaning like a human can, and it can't summarize like a human, because it doesn't understand things like implicit meaning in a document and because it doesn't bring the meaningful contexts that humans bring to the task of categorization. One thing that early AI efforts taught us is that while speed is significant, speed alone cannot make up for a lack of understanding of meaning.

It is still very difficult to accurately evaluate this type of software or even know what to look for, what is important, and what is hype. The answers will vary from situation to situation, but there are a few things to keep it mind.

First, be very wary of the results of bake offs, which purport to prove that one product is more accurate. I've seen results that show that Mohomine beats Inxight, but Inxight beats Inktomi, but Inktomi beats Mohomine, and then in a last minute come from behind victory, Inxight beats Mohomine and Mohomine beats everyone. In other words, something is fishy. What it comes down to is that there is no clear method of comparing results; too much depends on the specific content that is chosen as the test material, the editors, or information architects that administer the test, and perhaps also the time of day or phases of the moon.

Also, be very wary of vendors that tell you that their software really is automatic and you can just open the box, install the software, point it at your content, and out pops your corporate yahoo ready to solve your information needs. It is true that the software is getting better and that with the addition of features like Applied Semantic's Ontology or H5Technologies' Subject Matter Framework, or TopicalNet's 1 million node Internet-derived taxonomy, the software can do a pretty good job of assigning documents to a general category without any human intervention. Nevertheless, even with these world knowledge starting points, there is still a lot of work for information professionals to do.

In fact, the early success stories show, it is in conjunction with information professionals that you find the most successful implementations. For example, according to Gary Szukalski of Verity, they were able to achieve 99% accuracy by incorporating editor-derived rules with their automatic-categorization in a recent Gale Group project. Working well with humans then becomes an important area to investigate in your evaluation.

Another feature is how well the software supports a distributed workflow model that supports subject matter experts and authors with provisional categorization and metadata and then routes their work to a central team of editors or information architects who, with the aid of features like auto-summarization, can quickly say "good job" or "I don't think so" (or perhaps the more politically prudent, "May I respectfully suggest that your document might better belong in the HR vacation policy category not in the large green things that grow category.").

Oh I hope so! Very dangerous! I hope it gets more and more dangerous as it develops and we find more interesting uses for it. But not necessarily dangerous in the ways you might think.

First, it is not particularly dangerous in terms of job security for information professionals, unless someone makes a big mistake. The danger is someone with a background more on the computer side of information science deciding that this new software really is automatic and that means they can get rid of those pesky librarians. This is a possibility, but the good news is that it will tend to be a self-correcting mistake. When users start to howl about their automatically-categorized content as loudly as they do now about search, the "software is all we need crowd" should get the message.

Rather than a danger to information professionals, auto-categorization can, in fact, not only enhance their ability to solve user's information problems, it may even elevate their status to something closer to the level it should be. Not only will librarians and information architects produce more and more economically, but they will have expensive software associated with the task and, as we all know, in today's corporations, unless there is expensive software involved, no one will think you're valuable.

Well, OK, maybe that's a bit overstated, but auto-categorization software has the potential to highlight what should already be clear—that the information professional is engaged in a fundamental infrastructure activity. Information professionals are or should be involved in the creation and maintenance of the intellectual infrastructure of their organization. While technology and organizational infrastructures have received more attention and resources, some of the imbalance could be righted through the intelligent utilization and integration of new software, new methods of working with both content providers and content consumers, and new ways of presenting information.

So, in conclusion, I think it's likely that auto-categorization will ultimately enhance both the power and the prestige of the information professional. Until it gets so good, so intelligent, so insightful, that it takes over completely. Of course, by then, we will have either merged humanity with machines and created a new cyborg race or all humanity will have retired to a life of the pursuit of idle pleasure and machines will have inherited the earth.

I wouldn't worry about it, yet.

Page 2 of 2