Searching Between the Covers: Convera and Encyclopedia Britannica
Despite their original mandate to serve a broad range of consumers, Encyclopedia Britannica went online in 1994 available only to universities by subscription. Gradually, they opened the site to other institutions and consumers. However, since consumers and researchers really have different needs, they soon set up a separate consumer site.
In 1999, the prevailing Web model was to offer free content that attracted traffic and then to advertise to those who came. That changed when the ad market on the Web collapsed. As a result, Britannica's site currently operates on two levels: Any subscriber can access the entire site, including the beginnings of all articles, the Merriam-Webster collegiate dictionary, and articles from many popular and professional magazines; but Britannica now charges a subscription fee to read full articles. Customers are not going to pay for what they can't find.
"An encyclopedia article offers special challenges for a search engine," according to Tom Lang, Britannica's executive director for product technology. "An article may be about a specific topic, but that topic, itself, might not be mentioned inside the article. For example, an article on chocolate might cover cocoa leaves, caffeine, even Hershey, Pennsylvania, while it only mentions the word chocolate in the title. Similarly, an article on President John Kennedy may be titled KENNEDY, John Fitzgerald, but then use Title Variance, referring to the subject as Mr. Kennedy, JFK, Jack, or the President. Also, inside the article, the content tends to be both complex and structured. So, when we looked at search engines, our primarily criteria were flexibility and the ability to fine tune a search." The tool that provides Encyclopedia Britannica with this level of flexibility is Convera's RetrievalWare.
To begin with, RetrievalWare offers several levels of content security. This allows Britannica to make one set of data open to anyone, yet restricts additional "levels" to access by paid subscribers; RetrievalWare only returns documents that a given user is allowed to see.
Although Convera offers several attractive front-end templates, Encyclopedia Britannica felt that they needed a site based on their own architecture. Thus, they chose to employ only the RetrievalWare engine to identify relevant articles. It retrieves the ID numbers of all relevant articles, and passes those numbers to businessware that performs look-ups against EB's Oracle database, and displays articles for the user.
"We're very happy with the way the site has developed," says Tom Panelas, Britannica's director of corporate communications. "We've signed over 50,000 paid subscribers since we initiated the paying model, in July 2001. When we decided to introduce additional databases, such as magazines, our Student Encyclopedia Britannica, and the Concise Encyclopedia Britannica, it went wonderfully. RetrievalWare allowed us to offer new and different ‘concept sources' [databases] with a minimum of effort and very little developmental programming."
"Britannica came to us," says Convera's senior manager, product marketing John Henry Gross, "because a typical keyword search only produces the documents that you know exist. So anyone using the online encyclopedia would have to search using precisely the terms that appear in an article. For example, if you want medical information, typing ‘newest high blood pressure medication info' would retrieve ‘the newest high blood pressure medication,' but not ‘the latest hypertension drug.'"
Not only must a good search engine recall the relevant documents, it must be precise enough to ensure that the best 20-30 documents appear on the top of the list. "A basic text search ranks relevance only using the words that the searcher types in," Gross says; "it can't evaluate words with a conceptually similar meaning. RetrievalWare uses a semantic-based ‘concept search' for both wider recall and greater precision."
A concept search relies on a strong set of semantic relationships—an understanding of which words are related to others. Also, since idioms take on extra importance in articles for and about most professions, it requires any number of industry-specific "cartridges" to supplement a search engine's general vocabulary.
Convera's solution is also very scalable. "An encyclopedia can require terabytes of data," says Gross, "and RetrievalWare is one of the only search engines that can handle that large a range. It can also search across Encyclopedia Britannica's multiple file types and media types, including text, audio, video, and images."
"Our goal," says Panelas, "was to bring together a number of research sources that used to be spread out in many places. When I researched a term paper in college, I began with the encyclopedia, and then went to the card catalog to track down related articles in magazines or on microfilm, and then perhaps I hit the ‘stacks' to browse. The Internet makes it possible to put all the sources that belong together in one place. So," he concludes, "Britannica is helping to realize one of the great dreams of the Internet."