I make no apology for returning to the subject of search technologies and products so soon after covering some aspects of the topic on my April column. Attending the Eighth Infonortics Search Engine Meeting that took place in Boston in early April has prompted my return. Over one hundred delegates participated in the meeting, with papers from academic researchers, vendors (with only minimal product placement), and corporate users. The papers can be downloaded from: www.infonortics.com/searchengines/ sh03/03pro.html. Raul Valdes-Perez of Vivisimo gave one of the papers, and he referred to the problems and costs of "information overlook." We are all familiar with information overload, the result of information being pushed to our desktop. Information overlook refers to the situation whereby the search process fails to discover relevant information. It must be there, but we can't find it!
A theme of several papers at the conference was the differences between retrieving information from unstructured content (mainly text) and from structured content that is held in relational databases. Sue Feldman (International Data Corporation) compared the similarities and differences (mainly differences!) between the two technologies. There is going to be a requirement for convergence in the near future to handle enquires such as, "I am looking for trendy Italian shoes costing less than $100." The request for shoes under $100 is met easily by a relational database, which can do a great job on range searching prices. But the "trendy Italian" is an unstructured enquiry because there are many synonyms for trendy and, as for Italian, does the enquirer want shoes made in Italy, or shoes that have an Italian design or an Italian brand? Translate this to an enterprise environment and then you have queries such as, "How many clients do we have in the midwest that spend more than $500K?" Just what does the enquirer mean by midwest?
In his presentation, Verity CTO Prabhakar Raghavan ruefully remarked that the total market value of the unstructured retrieval business was probably around $500 million, compared to the value of the perhaps hundred times greater RDBMS business. This despite the fact that most experts agree that the volume of unstructured information in an organization is not only substantially greater than structured, but is growing more rapidly. The problems of information overlook are, therefore, likely to become substantially worse, not better.
There are a number of resources available that facilitate the comparison of CMS software, notably CMSWatch (www.cmswatch. com). Until recently, finding comparable information on search engines was virtually impossible, despite the efforts of Avi Rappoport and her Search Tools site (www.searchtools.com). The problem is made worse by the fact that search engines need to be considered within the context of a larger group of companies that provide taxonomy, categorization, and visualization functionality. Perhaps we need a new industry sector category of Information Discovery Software!
The challenge has been taken up by the Swedish consulting company Infosphere (www.infosphere.se) that published a 110-page report in March 2003 entitled: "Unstructured Information Management—An Overview of the Enterprise Search, Text Analysis, and Visualization Market." After an excellent overview of market needs and technology options, the core of the report is a highly structured comparative analysis of the offerings from forty vendors, many of them new to me. As well as the comparative analysis, the key features of the products of each vendor are neatly summarized with a concise summary of the business prospects of each company. It is clear that the authors of the report, Magnus Stensmo and Mikael Thorson, know what they are writing about, but given their background in IBM's research community that is not surprising. The cost of a single copy of the report at the time of writing is $325, and anyone with an interest in search technologies should click on the site and order a copy today. To complement the report, the company has also set up a Weblog at: www.unstruct.org, which comments on news and developments in this field.
Another issue raised at the Infonortics conference was the need to understand how users search. The dictum "You have 12 minutes before a user gives up" was proposed and generally accepted. There has not been a great deal of research into this area, which makes the publication of a paper on The IIR evaluation model: a framework for evaluation of interactive information retrieval systems in the electronic journal Information Research (http:// informationr.net/ir/8-3/paper152.html) especially timely. As I remarked in April, the sooner the balance is restored between content contribution and information retrieval the better, but I fear that by the time the next Search Engine Meeting takes place progress will still have been minimal, and information overlook will have reached endemic levels.