Keep and Use Your Content Forever


Most of us—even we pack rats—must deal with the practical limits of magnetic and physical storage space. Like it or not, we have to be selective about what we keep and what we delete. While on the corporate side, the threat of litigation might provide incentive to toss stuff as soon as possible to avoid preserving content that could be the target of discovery in a lawsuit, there are also requirements that some things be maintained.

At the other extreme, there are those who will want to keep all authoritative content for the foreseeable future and (here's the tricky part) ensure that others can view and use it. If you want to preserve family pictures or documents for future generations (setting aside the question of selecting storage media), what are the most likely formats to assure future generations can view or listen to what you preserve? PDF, JPEG, MPEG, XHTML, or XML (and which schema)?

Now suppose you have millions of documents to preserve, hundreds more are pouring in every day, you must manage 600-1500 print-related projects daily, you are required to deliver both print and authenticated electronic copies for future use. This torrent of documents comes to you in virtually any format. With increasing use of XML, your plans must support XML-based office documents from OpenOffice, StarOffice, and soon MS Office 2007. While you're at it, you might as well take advantage of XML and not just deliver documents but repurpose them, providing tailored, relevant renditions to your customers. Lastly, you had better protect this content from natural and other disasters. Does your brain hurt yet?

There is a Federal agency that has had to face this very problem and, suprisingly, is addressing it with remarkable speed. The 145-year-old U.S. Government Printing Office is transforming itself from one of the world's largest printing services to one that can preserve and deliver millions of electronic documents for the foreseeable future, and achieve this with minimal funding. Yes, this is a story that will actually make you feel good about government at work. It also provides valuable lessons, though the scale of the GPO's endeavor likely dwarfs yours and mine.

The GPO's project, called Future Digital System (or FDsys), began with strategic planning in July 2004 and developed a strategic vision for the twenty-first century. This vision provides a plan to provide printing and electronic delivery services to the three branches of federal government, 1,250 Federal Depository libraries (providing protection from disastrous losses), and to the general public. FDsys is packaged into six phases, is currently midway through phase 4 (implementation planning), and expects a full system implementation in October 2007.

I have been communicating with Mike Wash, the GPO's Chief Technical Officer, over the past few months to get a better handle on FDsys. My questions ranged from typical IT concerns to more general issues. The short timeframe to complete the project suggests minimizing the vendor solutions and emphasizing integration, applications built to work together out-of-the-box. However, FDsys must deliver its services for the foreseeable future. That suggests picking open-source and best of breed solutions. What was Wash's approach? "GPO is taking a best of breed approach to acquiring and integrating the technology components that will comprise FDsys…FDsys is a standards-based system," he says.

Then there is the issue of document formats and how a GPO customer would supply content to make sure FDsys can import, use, and transform that content. GPO currently receives submissions in a surprisingly small number of formats, but that will change as content technologies evolve. GPO's approach is a standards-based packaging format for submissions. "FDsys architecture is based on the Open Archival Information System (OAIS) model, which develops the concept of submission, archival, and dissemination packages," Wash says. The OAIS SIP concept divides submissions into three categories or types of information. First there is the content information itself, the digital objects and the way they are represented. The other two categories are information about how the submission is packaged and a descriptive segment.

The GPO will share its experiences with the rest of us, too. You can visit my blog ( and the FDsys blog ( for all the details. Because, as Wash says, "The FDsys team has committed to full transparency of our system development process. We would be happy to share lessons learned with other organizations at the appropriate time." And there's a lot to learn.