Preserving seemingly ephemeral web content is a daunting task. The problem is even more difficult because the content of web pages changes and the pages themselves come and go with great frequency, which means simply collecting URLs isn't enough to keep tabs on valuable content. To help make digital content preservation possible, Internet Archive, a San Francisco-based nonprofit has led a charge to effectively capture and store web content.
Since its inception ten years ago, Internet Archive has focused on ensuring the availability and accessibility of internet content by creating an internet library to permanently store digital content for anyone to view at any time. Beyond the content it has chosen to preserve, last year Internet Archive launched a service called Archive-It to help organizations seeking an easier way to archive valuable web content. The project recently released Archive-It 2 in its continued effort to archive the web.
"It's a fallacy that if something is on the web, it will stay there," says Kristine Hanna, director of web archiving services for Internet Archive. "It's not like a piece of paper you put in a file folder and it will be there forever. There's an urgent need for people to understand that the web is who we are. It's our culture and our social fabric, and we don't want to lose any of it."
At present, Internet Archive's complete library contains 65 billion pages of web content—including books, moving images, and software (about 40,000 in each category). To archive material, Internet Archive uses a web crawler that scans the entire web for documents created during a specific time period. The documents are then catalogued and placed on the organization's servers. The content is stored in repositories around the world—San Francisco, Egypt, Amsterdam, and France.
In mid-2005, Internet Archive launched the beta version of Archive-It, a web-based subscription service to help "memory institutions" create and archive their own web collections, in order to provide two main benefits. First, these institutions are able to preserve their desired web content. Second, their collections are available for viewing by the general public on the Internet Archive site, enabling the nonprofit to build its offerings simultaneously. Archive-It 1 officially launched in January, followed by 1.5 in May and the most recent point release, 2, in late July.
Hanna says Archive-It was designed mainly for institutions (state archives, state libraries, and university libraries) that have a mandate to archive their web content and that lack the resources (staff, budget, and technical capabilities) to do so. "We are collaborating with institutions to save material that normally wouldn't be, that we probably wouldn't save on our own, that they couldn't save on their own," Hanna says. "We're joining forces to make sure that all of this knowledge is not lost."
To begin creating a collection, subscribers can select as many as 300 websites associated with a particular topic to be crawled, and Archive-It can be programmed to crawl those sites as often as desired (from daily to weekly to quarterly). Once archived, subscribers can subsequently search (either by text or URL) the archived web pages—which look exactly like the pages when they were captured on the web. Those searches can be conducted on either the Internet Archive or Archive-It sites. Users can search by a variety of criteria, including subject, date, relevance, institution, and collection. Advanced search options include the ability to search between dates.
Version 2 of Archive-It offers several new features not available in previous editions. Subscribers can now conduct test crawls, which enable them to see the type of web material that would populate a specific collection before it is archived permanently. There is also a metadata search capability, which allows metadata to be included in the text searches of materials in a collection. Archive-It Pro enables subscribers to set caps on how many web documents are collected from a website. It can also block the collection of materials from desired websites.
The collections created by subscribers cover a wide range of subject matter. The North Carolina state government has created a collection of web pages from various state boards, commissions, and agencies. Indiana University, for example, wanted to archive all of the university's web pages. "They're not sure what's going to be of value later," says Hanna. "They are able to capture everything now." Subscribers can view and download reports regarding the status of the crawls. Archive-It is available by annual subscription—the most popular package is priced at $10,000 and allows subscribers to build three collections with up to 10 million URLs.
Internet Archive has plans to expand the reach of the Archive-It service by targeting smaller entities, such as independent researchers, local libraries, and small non-governmental organizations, with a lower-priced version. As Hanna says, "The Internet Archive's universal approach to the dissemination and access of information is embodied in its Archive-It service that anybody or any organization can use."