Word to XML

Jan 09, 2007

      Bookmark and Share

The holy grail of structured content is a tool that lets authors write their comfortable and familiar unstructured content, but which then auto-magically converts their text to structured XML when it is saved. But skeptics cite the old maxim "garbage in, garbage out" here. If every document is arbitrarily different, they say, there is no way it can be exported to useful XML.

Another magical creator of structure resides in the document conversion service that promises to extract the structure from existing legacy documents, even some written long before the idea of structure became an integral part of standards like SGML and XML. Since document conversion services have been around for many years, they seem to know how to cast the right magical spells. What is the sorcerer's secret of their business success?

It turns out that the trick is the same thing that enables some authoring tools to convert Word documents to useful XML--a little bit of implicit structure.

Structured Migration
The legacy document converters, who are most often trying to migrate old content into a new content management system (CMS), get lucky when the content was prepared using consistent templates, so every document has the same combination of headings and paragraphs. It's not uncommon for companies to have thousands of text files all with the same basic information, whether based on a real template or just always filled out in the same way because of standard company policies and business procedures.

A document conversion service analyzes the content to make sense of these repeated structures. They then write scripts to parse the original documents. Starting with small samples of documents from a large collection, they refine the scripts until they can convert documents with a high success rate, say 95%, and leave the rest for hand corrections. For example, specialists like InTech Solutions take collections of Framemaker documents and develop scripts that convert them to structured Framemaker and XML.

Going forward, any organization looking to manage content more easily in the future should be looking at ways to add that structure explicitly. Especially those who want to repurpose or reuse elements of their content in different places, possibly publishing it to many different channels like the web, PDFs, and mobile devices, should be helping their authors who must write in Microsoft Word to create XML.

Creating Structure
An easy way to begin, one that content contributors may be comfortable with, is consistent use of the same Word templates for the same document type. Of course, they would then still be free to deviate from the template, which will cause problems down the road.

So there is a new class of tools that look exactly like Microsoft Word, but which can force your authors to create perfectly structured documents. By perfectly structured, we mean that when exported to XML, the document can be validated against a DTD (document type definition) or XML Schema Document (XSD). 

Microsoft has provided an API that allows developers to customize Word. They can selectively disable Word's menus to allow only those options that are valid at a given point in the document (context-sensitive controls).

In last June's EContent we reviewed many XML editors that create valid XML documents. Their WYSIWYG editors all try to look like Word. But for many writers, they are simply not the real thing. If your writers won't accept substitutes, take a look at the tools that convert standard Word to XML. And look also at the more expensive solutions that have customized Word so it produces structured content.

Accept No Substitute
I have identified a dozen or so Word to XML tools at www.cmsreview.com/XML/WordXDirectory.html. Since my ECXtra columns are online, we can provide you with links to the more interesting tools, like Information Mapping's ContentMapper and In.vision Research's Xpress Author.

You should also be looking soon at the new 2007 Microsoft Word, which claims better XML conversion than ever. But we found that Word continues to generate bloated XML (as it does HTML), probably because it is adding tags to cover every possible document element and style. The specialized tools create much cleaner XML.

For those of you wanting to convert large numbers of legacy documents to XML, we also surveyed the major content migration tool vendors at www.cmsreview.com/Tools/Migration. One of these vendors, Vamosa, offers a free download of its Content Analysis tool, limited to use with one thousand URLs (which sounds like a lot to me). Besides analyzing your content to discover any hidden structure, the Content Migrator part will help you to get your content ready for import into a new CMS.

So whether you want to discover the implicit structure in your old documents or get your Word authors to add it eXplicitly in their future content creation, sophisticated XML (eXtensible Markup Language) tools are out there today to help you.