Auto-Categorization: Coming to a Library or Intranet Near You!

Page 1 of 2

Nov 01, 2002

November 2002 Issue

Automatic-categorization—it's coming, it could be big, and it could hurt. One thing, though, it's not actually automatic. I have yet to see a product that did not need or was not improved by human intervention. But, other than that, almost everything they are saying about it is true.

In this article, I will offer answers to a few simple questions that need to be asked when exploring this new type of software: What is it? Why is it suddenly everywhere? What can it do? What can't it do? How should information professionals approach it? And is it dangerous?

Simply put, automatic-categorization is a new type of software that assigns documents into subject matter categories based on a wide variety of techniques. These techniques include statistical Bayesian analysis of the patterns of words in the document, clustering of sets of documents based on similarities, advanced vector machines that represent every word and its frequency with a vector, neural networks, sophisticated linguistic inferences, the use of pre-existing sets of categories, and seeding categories with keywords.

From this list of techniques, it seems pretty clear that none of this type of software categorizes the same way humans do. And that is both its strength and its weakness.

To further complicate things, there are also a very large number of companies offering their version of this new software and, of course, most claim that their approach is the best, the fastest, and the smartest. They are all wrong. And they are all right. In other words, the whole product space is wide open with no clear leader and no clear correct or best approach. This was my conclusion after evaluating about 20 offerings and a conclusion that a recent Delphi Group study echoes. Thus, the information professional's need to evaluate the promise and pitfalls of auto-categorization becomes a difficult one indeed.

An informal survey puts the number of categorization companies at nearly 50 with more and more search and content management companies scrambling to incorporate categorization into their products. Why has there been such an explosion of companies in the last two years? It seems to me that the answer is twofold: First, the development of new techniques in recent years has allowed companies without major resources to create versions of categorization software that work as well or better than existing software from the early leaders like Autonomy and they are offering them at a fraction of the price.

However, the second and more compelling reason that auto-categorization has taken off in the last year or so is actually much simpler: Searching stinks so users can't find anything. Categorizing content enables browse or search/browse functionality and users prefer browsing and browsing is more successful than a simple keyword search and facilitates knowledge discovery and even after years of trying to teach users how to do advanced searching, they still won't.

Oh, and one more factor in this list of ands: Companies don't want to pay librarians to categorize their content because they think it's too expensive. They are wrong, at least when you factor in the time employees waste trying in vain to find that document that they must have in order to answer that customer's question, without which the customer will scram and go with a competitor who had the answer instead. Despite that, many companies still won't pay for humans to categorize their content, but they are more likely to pay anywhere from 250K to 750K for software that frequently does a less effective job.

The first and best thing auto-categorization software can do is to very quickly scan every word in a document and analyze the frequencies of patterns of words and, based on a comparison with an existing taxonomy, assign the document to a particular category in the taxonomy.

Some other things that are being done with this software are "clustering" or "taxonomy building "in which the software is simply pointed at a collection of documents, say 10,000 to 100,000, and it searches through all the combinations of words to find clumps or clusters of documents that appear to belong together.

Another capability that can be found in some of the software is the ability to create an automatic summary of a document. Of course, it's not really a summary—certainly not in the way a human creates a summary. Rather, the software scans through the document and tries to find important sentences using rules like: the first sentence of the first paragraph is often important.

Another feature of auto-categorization is metadata generation. The idea is that the software categorizes the document and then searches for keywords that are related to the category. This can be useful even if the suggested metadata isn't simply taken, since. According to Gil Ebaz, CTO for Applied Semantics, authors or editors work better selecting from an existing set of keywords than when starting fresh with a blank field.

A closely related feature offered by some companies is noun phrase extraction or, as one company Inxight calls it, "a thing finder." This list of noun phrases can be used to generate a catalog of entities covered by the collection. One example might be to generate a list of company names and then use that list to scan other or new documents to determine which documents deal with which particular companies.

Auto-categorization software had its start in the news and content provider arena and it is there that it still finds its most successful and developed application. The reason is clear: It is an environment in which you generate thousands or tens of thousands of documents a day that you need to categorize and one very clear advantage that this software has is speed. Also, the material is written by professionals who know how to write good titles and opening paragraphs that the software can use to categorize the material, and there are already a number of editors who not only know the subject matter and vocabulary, but also have had experience in categorizing similar content.

A new and intriguing market is in the intelligence industry. Companies like Stratify, H5Technologies, and Inktomi (which recently acquired Quiver) are all active in this area. They, like the publishing industry, have huge volumes of content to categorize. However, there are two features of the intelligence industry that are different according to Lewis Shepard, head of business development for H5Technologies: they need a finer granularity of categorization and they also need to categorize content, not just at the document level, but at the paragraph level. This requires a level of sophistication beyond the early simple Bayesian statistics.

Basically, whatever you use categorization for now can be improved, enhanced, and made more economical by the addition of auto-categorization software, although the highest value areas are still where there is a large influx of documents, preferably well written by professionals, that need to be categorized into a fairly shallow or general taxonomy, or else have very highly developed and specialized vocabularies like the pharmaceutical or legal industry.


Page 1 of 2