Future-proof Your Records

A subtle shift is occurring in the way we value and manage our office content—those files that constitute 80% of the investments we all make in our mainstream office work: text documents, spreadsheets, and presentations. Today there are tremendous legal pressures to ensure that we abide by various mandated schedules to keep documents as long as the law requires (but no longer). On the flip side, practices are emerging to selectively destroy many of our documents that we need not keep at all. Destruction provides a measure of protection from widely cast subpoena nets.

It is very difficult to manage formerly paper documents as they morph into electronic records. Identifying which are the records to retain and then applying the appropriate retention periods to satisfy the alphabet soup of legislation mandating retention (think: SOX, HIPAA, etc.) is no small feat. Some document management systems consider a document's "creation date" to be the date you put it into the system, not the date you created it. Record management problems increase when documents reside everywhere from Windows' "My Documents" to file stores, email systems, intranets, and disciplined content management repositories. A complementary retention requirement is to find quickly what you've retained. After a legitimate legal request for documents, courts are generally unforgiving about assertions that you can't quickly find all documents relevant to the request.

The record management problem is magnified by the records' internal structures. Analysts often call office documents "unstructured," not because they have no structure (if that were so, they would be useless), but because analyzing that structure programmatically is such a challenge. We usually create documents in an undisciplined way, perhaps through inconsistent use of styles (if authors use styles at all). Then we often store those documents in vendor-proprietary formats like "rich text format" for Microsoft Word. These internal formats began to be replaced with XML via Microsoft's WordML, and the trend is mainstream now with OpenOffice, StarOffice 8, and the upcoming Office 2007 suite.

Internal formats do not usually help much in complying with retention rules, except in these cases: you need to find all relevant documents quickly, or you are either compelled or choose to save a document forever. It is much easier to restrict your search for phrases in document headings or table captions, for example, than to cast a broad net to find any occurrence of those phrases. Since XML-based documents, by definition, cordon off all document elements, such searches are easier than those in proprietary formats.

Proprietary formats hamper document longevity, while XML enhances long life. If you don't have the exact version of PowerPoint that Millie used when she created that slide presentation, you probably will have trouble viewing it as it was originally delivered. You can always render the presentation in PDF, but that won't preserve presentation animations or layers in those presentations. In other words, you can't be sure you can view—never mind re-use—that presentation in its original form.

In the last year, there has also been an increasing practice of using XSL to render XML internal formats in office documents, which makes having the right document viewer a more manageable problem. If the XML internal structures have been peer-reviewed and standardized, as is the case with OpenOffice and StarOffice 8 (built on OpenOffice), you may have additional opportunities to extract value from your records. Because this XML structure has been peer-reviewed, it can provide even more enterprise leverage than a proprietary XML structure to extract value from assets. Microsoft's XML schema (WordML) expresses Word's rich text format as XML, but neither you nor I got to vote on the features in Word's tables, for example. Documents whose internal XML structure has literally been reviewed by the world gave us all a chance to weigh in on and improve internal features.

If providing longevity and faster searching of document records were the only benefit from the shift to XML, that would be enough to encourage its use. Yet the use of XML also provides other advantages. XML by nature promotes modularity, and these new document formats package content separately from graphics and metadata. This separation automatically can provide a measure of protection, for example, from corrupt graphics. If you open a document containing an invalid graphic, the word processor doesn't crash; it simply can't display the graphic.

And what if you want to do more than merely satisfy record-retention requirements? What if you'd like to extract additional value from those files? This too is getting easier through XQuery, letting you search and transform collections of documents with standard, database-like queries. XML truly does enhance the value of office documents, and protects their future.