NewspaperARCHIVE: A Case of Broken News

Page 1 of 3

      Bookmark and Share

BEST PRACTICES SERIES

Founded in 1999 by parent company Heritage Microfilm, NewspaperARCHIVE is the world's largest online archive of historical and contemporary newspapers. With an archive reaching back to 1700, the service provides a fully searchable, graphical, and textual database of more than 4,500 newspapers from 1,100 cities. Users can search by names, keywords, or dates, giving amateur and professional researchers alike instant access to news reports, obituaries, birth announcements, sports coverage, and other useful content.

www.newspaperarchive.com

BUSINESS CHALLENGE
With more than 100 million pages averaging about 6,000 words each, it goes without saying that NewspaperARCHIVE doesn't go easy on its database and search systems. The service is also constantly expanding and updating its database, which means that a solution that works today might easily prove inadequate tomorrow. When problems began cropping up with its old search solution, NewspaperARCHIVE realized it needed a platform that would be robust enough to handle the company's vast database of content while simultaneously ensuring that users had rapid access to search results and allowing the company to easily update its database to reflect new acquisitions and changing agreements with publishers.

VENDOR OF CHOICE
Exalead is a developer of enterprise search solutions and web search technologies based in France. The company was originally founded in 2000 by two former AltaVista employees whose initial goal was to create a Google-like search engine for Europe. Its focus quickly turned to enterprise search. In addition to its flagship Exalead CloudView search platform, the company offers unified personal search in the form of Exalead Desktop, as well as SaaS enterprise search through Exalead On Demand. In June, Exalead announced that it had been acquired by product management developer and existing business partner Dassault Systèmes.

www.exalead.com; www.3ds.com


THE PROBLEM IN-DEPTH
Unlike many enterprise databases that are used primarily for internal record keeping and data management, NewspaperARCHIVE's database serves customers as well as internal users. Since the service is targeted at individual researchers (such as genealogists and history buffs) as well as larger institutions, poor service and slow searches can lead directly to lost business. As NewspaperARCHIVE continued to grow, the company found that Autonomy, its previous search solution, was struggling to keep up with the volume of data and frequent updates that the service required. Derek Fiscus, NewspaperARCHIVE's director of technology, knew he had a real problem on his hands.

"Because we deal with small clients, they expect every document to be searchable. And it got to the point where I would present a document to our last product and it would say it made it searchable, but it really didn't," says Fiscus. "So we had to constantly audit it."

NewspaperARCHIVE's Autonomy-based implementation was also becoming difficult to update, which impacted the service's ability to grow. "Because we just OCR [optical character recognition] the text off microfilms and newspaper pages, sometimes [the result] can be really, really, really poor. So you get a lot of words that really aren't words," says Fiscus. To compensate for this, NewspaperARCHIVE had to create search dictionaries. Unfortunately, that gave rise to yet another problem. "If we added in content and needed to expand our dictionary, not only did we have to fix all these documents that we were presenting-120 million-we also had to recall them all," says Fiscus.

Increasing the size of the search dictionaries solved one problem, while it created another: reduced search speed. "The previous product couldn't ingest all these words and provide search that was reasonable to what the public expects, which is as fast as Google," explains Fiscus.

NewspaperARCHIVE's final problem was one that will sound familiar to anyone who has managed a major information project: cost. Tasked with handling the company's 1.5 terabytes of searchable data, the company's old platform was gobbling up resources. "It only allowed us to have about 5.7 million documents per server," says Fiscus. "It became an issue of how much rackspace and power you were going to use."
Confronted with all these issues, Fiscus decided that enough was enough. "Taking into consideration what my customers were saying, I made the decision that looking for a new search engine was the best option," he says.

Page 1 of 3