I seem to have spent most of my life, and certainly the three decades of my professional life, searching for things. To be more specific, searching for information. It started at university, when as a chemistry major the only way to survive was to be familiar with the Chemical Abstracts index structure. When I graduated in 1970, the use of computers for information retrieval was just on the horizon, but in my first job I was using 10,000 hole optical coincidence cards to locate documents in a very large internal collection that had been built up in the research laboratory that I was working in.
By 1980, search technology was able to search not just the text of an abstract of a scientific article, but the text of news stories (the New York Times Information Bank) and laws and case reports (the Lexis service from Mead Data Central). These were services providing access to the published literature, but there was also a number of search software applications that were sold to handle corporate document collections, such as IBM STAIRS, Batelle Laboratories BASIS, and the U.K. Atomic Energy Research Laboratories STATUS. And there were many more. Some of these were subsequently adapted for use with CD-ROM products in the 1990s.
With the arrival of the World Wide Web, the need for search engines suddenly became top priority, and the search wars between Alta Vista, Lycos, Excite, Northern Light, and Google have been very well documented. Meanwhile, the need to search very large collections of corporate documents was continuing to stimulate the development of products such as Verity and Autonomy. Until the end of the 1990s there were three search worlds: The external Web world; the large scientific, medical, and legal database services; and the world of corporate documentation. And then along came intranets.
Intranets and Search Technologies
Intranets present some difficult challenges for the user. In an ideal world, the navigation should be so intuitive that only on rare occasions should it be necessary to use the search function. However, we do not live in an ideal world, and all too often intranet managers put in a "powerful" search engine to overcome the problems of inadequate metadata and classification. Two problems then arise. The first is that the intranet manager (or more often the CIO) finds that there is no significant improvement because the search engine is hunting for documents in which there is no consistency in the use of language. The second is that the search engine is being used by staff who use Google to search the Web, and so assume that putting a single word in the search box will result in a thousand hits.
Over the last year or so, there has been considerable awareness among intranet managers about the benefits of good content management software, but the message about the importance of good search software has still to be heard, let alone understood.
In a column of this length, there is not enough space to go into all the complexities of search technology, but I hope I can shed some light on a rather dark corner of most intranets. The process of "search" is actually a conflation of three individual steps. The first is helping the user to frame the enquiry in a way that expresses exactly what he or she is looking for, which can then be handled by the search engine itself. An engineer may start out by wanting to find a titanium alloy that is resistant to strong solutions of sulfuric acid at 300ºF. Just searching for "titanium alloys" and "corrosion resistance" could end up with a lot of results, and the need to look at each one individually. Of course, some documents may refer to sulphuric acid (the U.K. spelling) and have the corrosion results in Centigrade. Will the search engine be able to cope with these variations?
The second step is the way in which this query is then run against the collection of documents and other material. This is where life gets interesting in the intranet world, as there may be a need to search both internal and external sites with the same query. Last year, I was working on an intranet project in the U.S. for an organization in which the published versions of its reports were only on the public Web site, and used a different version of the search engine than that used for the intranet. Staff had to remember to search both repositories using different search sites and terminology.
The final step is the presentation of the results to the user, and this is where things start to get very interesting. Two performance parameters are relevance and recall. Relevance is somewhat intuitive, but recall is the percentage of relevant documents that have been retrieved. All too often a user is so pleased to get any results that he or she never wonders what has not been retrieved.
Relevance itself is a very difficult concept. Various search engines use different algorithms to "calculate" relevance, often presenting it as a percentage. One of the reasons for the success of Google has been the quality (usually!) of the relevance because of the use of a reverse-citation evaluation. When we look through the list of documents or sites returned from a search engine there are always a few errors, called "false drops." When we can work out to our satisfaction how the false drop occurred (perhaps a misspelled word), then we feel we can trust the search engine. The minute we fail to work out how the result set was generated, our level of trust in the search engine falls off dramatically.
So You Want To Buy A Search Engine?
If you think that making a decision on which content management software product to buy is difficult, then the search engine decision is several orders of magnitude more difficult. There are fewer search products than there are CMS products, but the variety is so much greater. As well as the core search technology, there is a range of products offering natural language processing of the initial query (e.g., Albert and Lexiquest). Add to that products that generate taxonomies to aid the search process (e.g. Quiver, Semio, ActiveNavigation). Then include an ever-increasing range of visual outputs of search results (Inxight and Antarctic.ca). The possible combinations are almost endless, but the key question is, how do you know which will work best for your intranet?
Making the Comparisons
A number of companies provide analyses of CMS products, such as Doculabs and Ovum. At least here the problem can be partially reduced to a feature and functionality comparison. With search engines, the problem is to evaluate the performance of the search engine on your own intranet. Using just a sample of documents is not really a sensible test. Would you just drive a car you were thinking of buying around the garage car lot? For some time, the academic community has been able to use the TREC databases, set up by the National Institute of Testing and Standards (http://trec.nist.gov), which are primarily designed to enable fundamental work on text retrieval to be carried out on a comparative basis. Some commercial vendors have used these collections, and in the proceedings of the Ninth Text Retrieval Conference 2000, there is a paper from Hummingbird on the performance of the Fulcrum software (http://trec.nist.gov/pubs/trec9/t9_proceedings.html).
Although valuable, using a standard test collection is not the ultimate answer. The process of selecting a search vendor needs more care than any other element of intranet technology, and certainly you will benefit from the advice of staff (librarians and information managers) or consultants, who understand the fundamentals of search processes. The installation will also not be an overnight step, and if you can go from the realization that you need something better than Microsoft Site Server to a fully functioning search environment in less than six to nine months, then you probably deserve an entry into the Search Procurement Hall of Fame. How much is this going to cost? The average contract value for Autonomy is $360,000.
In this column, I have been concerned with text retrieval. In many organizations, there is a pressing need to search through diagrams, pictures, audio and video collections, and much more. Solutions are available for some of these, and much more is on the way. The year 2001 was for me the year that intranet managers started to recognize the benefits of content management, but getting content into an intranet is one thing, and finding it again is quite another.
If you want to learn more about search technology then look for the Search Engine Meeting in San Francisco on April 15-16, 2002; www.infonortics.com/searchengines/index.html.