Antique Search Appliance Roadshow


      Bookmark and Share

About 50 years ago I fell in love with chemistry. I could think of no more interesting subject and spent 3 very enjoyable years at the University of Southampton. It was a time when the department was home to some exceedingly able researchers, and when I crept into the research meetings I heard phrases such as, "We are beginning to think ..." and "It's starting to look as though ..." as new techniques were being developed just down the corridor in the undergraduate labs.

As an information scientist, and not as a chemist, I've done a lot of work in the pharmaceutical sector, but it is only over the last few years, working with the e-delivery team at the Royal Society of Chemistry, that I have really come back to this beloved subject ... and found it almost unrecognizable! Techniques that were emerging in the 1970s are now commonplace and, indeed, almost antique.

There is a tendency in enterprise search, especially among IT managers, to assume that the current generation of search applications is now so powerful that they will never need to buy another one. If only life were that simple. The reality is that enterprise applications are already running out of power as the volume of information continues to increase. There are still so many challenging problems in search it is hard to know where to start. Federated search might be a good place. Any enterprise collection of content will be made up of multiple repositories, and users cannot be expected (though they often are!) to know in which repository the information they need resides.

The challenges are difficult enough when dealing with just text, without adding in business intelligence and other database applications and access to external business and technical information resources. Users do not complain because they do not know what is possible. Even more important, they don't know what will be possible in the near future as new search algorithms from new players in the market offer better solutions.

Another significant problem in search is that of synonyms and related terms. Of course, in theory, you could build vast directories, but the effort to keep them current would be enormous, and the latency of the look up would also be significant. Coming to a desktop near you before long will be search tools based on topic modeling, which use Bayesian statistics and machine learning to infer the relationship between topics in a document. What is fascinating about this technique is that it dates from the development of latent semantic indexing in the late 1980s, which was then refined to give us probabilistic latent semantic indexing a decade later. PDSI is the core technology used by Recommind. Now the buzz is about latent Dirichlet allocation (LDA), which itself has formed the basis for correlated topic models (CTM) and dynamic topic modelling (DTM). Incidentally, Dirichlet was a German mathematician who died more than 150 years ago, which is an interesting reflection on the longevity of mathematical techniques.

I think that's enough tech-speak for now. The point I am making is this: There is a significant amount of research being undertaken to find relevant information in very large collections of documents. Much of this research is being funded and implemented by national security agencies, which might inhibit the speed with which it becomes commercially available, but a check through Google Scholar will show that research teams at Google and Microsoft are doing a lot of work in this area.

Another search challenge is in assessing search engine performance and, in particular, search recall. Precision is a measure of the percentage of retrieved documents that are relevant, and that is relatively easy to determine. Recall is a measure of the percentage of all relevant documents that are retrieved, and that in theory requires knowledge of how many relevant documents there are in a collection. One approach is to use a test collection, but that is not a real-world option. There is now a lot of interest in using crowd-sourcing techniques to assess recall, in particular the Mechanical Turk service developed by Amazon.com (www.mturk.com). Again, this technique is still in the experimental stage, but it could be of significant value both to search vendors seeking to improve search performance and to organizations wishing to compare search applications.

To get some indication of the range of current information retrieval research, go to the ACM SIGIR site at www.sigir.org. Much of this research could be commercially available in the next 3 to 5 years. How could you make use of it, assuming that your current search vendor is able to take advantage of these significant advances in search effectiveness? Certainly a member of your search support team should be tracking and evaluating information retrieval research. In 10 years' time I'm certain that today's search technology will look very antiquated.