In all the discussion of the Web, we tend to get lost in the buzz and the technical detail. There's plenty to distract-the sheer numbers and the rapid growth, the boom and bust economics of dot com companies, the flashy new technologies. It's easy to forget at times that the Web on some level isn't about solving whole new problems of computing, but solving some longstanding ones in an open and scalable fashion, where the sources of the data are readily available and not centralized.
Take the problem of search. There are so many technologies, and so many approaches to the problem, that we may forget at times that the ultimate goal of search technology is simple-providing users with quick access to meaningful information. In the recent history of content management, much has been made of XML and the related standards, streaming media and Content Distribution Networks (CDNs), and the steady increases in performance through broadband options such as DSL and cable modems. Yet one of the biggest challenges the Web faces is one of its oldest, and indeed is a problem that predates the Web itself. How do average users find the information they need amidst a flood of irrelevant matter? And how do they do this quickly, easily, and consistently?
According to some, the path to improved information retrieval on the Web lies in intelligently applied taxonomies. In this view, content needs to be more accurately identified by category in such a way that search engines and other navigational aids can be better tuned to help the user. As content moves increasingly to the Web, these data sources need to benefit from technologies and techniques that allow people to view, navigate, and search data by broadly understood categories.
Happily, categorization technologies seem to have matured to the point where they can be useful to more and more publishers. Increasingly, Web publishers are investing in both the technologies to categorize content and the labor associated with implementing the technology. And looming on the horizon are "topic maps," an intriguing approach to tagging data for categories, especially for collections of data as opposed to singular documents.
This Does Not Have To Be Expensive
The process of categorizing data need not be either expensive or overly complex. In recent correspondence on the email list xml-dev, Carol Ellerbeck, a taxonomy expert with Harvard Business School's Baker Library and formerly of Lycos, made this very point. Responding to a writer who suggested that one needed to be "king of the world" and have "an unlimited budget" to create effective taxonomies, Ellerbeck wrote, "If you 'were king of the world'…you would not need 'an unlimited budget'...just a modest one, to have experts build your taxonomy/domain vocabularies. I say this as a taxonomist who has been in the vocabulary trenches with electronic information for years. Automation is wonderful (and I would say, even essential), but start with not just humans (albeit smart humans), start with humans who have some expertise, and you will accomplish your goal faster, with fewer people, more efficiently, and have a more solid foundation to build on."
This same point was made recently by none other than the father of the Web, Tim Berners-Lee. This past December, World Wide Web Consortium director Berners-Lee addressed this point as part of the Knowledge Management track at the XML 2000 conference in Washington, DC. In a far-ranging and fast-moving presentation, Berners-Lee outlined the current Web infrastructure, current standardization efforts at the W3C, and necessary efforts and improvements to arrive at a "Semantic Web." For Berners-Lee, something has semantics when it "can be processed and understood" by a computer, such as how a bill can be processed by a software package such as Quicken. Getting to that level of semantics, in a broad, open, and public infrastructure such as the Web, is easier said than done, of course. It involves, for Berners-Lee, the entire existing infrastructure, including XML, namespaces, XML Schemas, and a suite of new things. These new things include agreed-upon means of sharing and distributing application logic, and new layers that provide both proof (who you are, who the other party is) and trust. Together, these will provide a complete Semantic Web.
A great deal of Berners-Lee's discussion had to do with the theoretical difficulties of shared application logic and other esoteric detail. But at one point in the talk, Berners-Lee matter-of-factly stated that the question of taxonomies was a simple one and relatively easy and inexpensive to solve. And while he stopped short of endorsing topic maps or any other particular approach, he made clear that some such approach was necessary and should be used to unify disparate efforts now underway.