The Search Continues: Three Publishers’ Site Search Solutions

Page 1 of 4

Put an infinite number of journalists at an infinite number of keyboards and, sooner or later, they'll generate something that you need to read. But how will you find it?

Your local search engine may work well enough for tracking down a Web site. But suppose you've been publishing an encyclopedia for five generations or a newspaper for 300 years—your readers want a search that avoids 5,367 off-target hits at one extreme, and a "0 Results Found" message on the other, yet lets them follow any given thread back to result 1,750 as easily as they can walk into a mall, find the right bookstore, the right department, and the right shelf.

Convera's RetrievalWare brochure quite aptly defines two key criteria for a successful search—Recall, which measures how well the system can find all the relevant documents in a database. (It answers the question, "How much is out there?") and Precision, which measures the system's ability to return only relevant documents. (It answers the question, "Is what I found relevant to what I'm looking for?")

A good research tool must be more robust than a word processor or free Web search. These elementary tools generally require an exact word match to bring up a "hit" (possible match). However, a good professional tool relies on a rich semantic database that can recognize all of the variances that a researcher might use (President Harry Truman, Truman, or "Give'em Hell Harry"), and whether a particular variance is tightly or loosely related to the search term. It also needs a good list of idioms ("high blood pressure" must equal one token) and a good list of synonyms (so that a search for "high blood pressure" would return articles on "hypertension"). It must also be able to strip out "noise words" (such as and, the, and what is), and then generate a successful relevancy score.

To do so, a professional-grade search engine puts your search string in context (concept searching), so that the engine can evaluate the true meaning of your request, not just the words you used. For example, when you search for "army tanks," a concept search will consider the words in the article that appear around the words you entered. It gives a higher weight to articles about military or vehicle, and ranks those articles at the top of the "hit" list.

Therefore, a good concept search depends on a set of semantic relationships—an understanding of which words are related to others. Furthermore, the engine should recognize that certain businesses and professions rely on idioms, and that those idioms take on extra importance in articles for and about those professions.

Initially, searchers might address the universe of all possible options in the publisher's data. Then the software should dynamically guide the searcher to refine the search by asking contextually relevant questions.

Of course, the criteria in the previous paragraph only apply to documents already in electronic format. How does one store organize and retrieve 300 years of newspaper articles? Ideally, you would simply scan your collected content and just add a few keywords for searching. But in reality, for the information to be useful, scholars need full-text searching, and Optical Character Recognition (OCR) is far from perfect. You can either hire a score of humans to clean up scanned documents, or find a search engine that tolerates OCR errors.

Most search engines overcome OCR errors through "pattern matching." Since scanned documents, like newspaper photographs, begin as patterns of dots and blank spaces, pattern matching finds matches in patterns, rather than matching words. (OCR tries to recognize those patterns and convert them into letters; pattern matching simply ranks the similarity of the patterns). Therefore, even a mistake in the first letter of a word does not affect the accuracy of pattern matching.

The three case studies that follow represent three distinct approaches to providing sophisticated searches at three publishers' Web sites.

Page 1 of 4