One Million and Counting


      Bookmark and Share

As you walk up Walton Street in Oxford, England, the road bears slightly to the left, and a large 19th-century building comes into view: the headquarters of the Oxford University Press (OUP). OUP is the largest university press in the world, dating its origins from about 1480. In 1983, I arrived at this building carrying a Texas Silent 700 printer. This used thermal ink technology and had two rubber ears on the top into which a telephone handset could be inserted to link the printer into the public telephone network. It was a beautiful April morning, and carrying this terminal into a circa 1830 building seemed rather inappropriate.

At that time, I was heading up the initial attempts by Reed Publishing to develop electronic publishing products. Reed owned International Computaprint Corp. (ICC), based in Fort Washington, Pa., which specialized in keyboarding and printing directories. We had been working with IBM and the University of Waterloo, Canada, on the New Oxford English Dictionary (NOED) project, which was to create a digital version of the Oxford English Dictionary (OED). The proof of concept was to digitize one of the supplements to the first edition, starting at the letter S. Once the digitization and indexing had been completed, I, together with Hans Nickel, the founder and CEO of ICC, were to demonstrate what we had achieved to the NOED project team, led by Tim Benbow and Edmund Weiner. Many members of the team of lexicographers were skeptical of the project’s value, and there was a mixture of expectation and disinterest around the table.

The OED seeks not only to provide a definitive definition of a word but also the origins of when the word was first used, with examples of subsequent use that may have modified the definition. All these examples were contained on about 4 million slips of paper. We set up a connection with a terminal (at 300 baud) to the computer in Fort Washington. I recall the first question, which came from one of the more skeptical lexicographers, who wanted to know how many words in the OED originated in The Times (London) newspaper. Because all the text had been marked up in Standard Generalized MarkUp Language (SGML; a forerunner of XML), we could identify the source and provide not only a count but also a printout of the results (albeit a very slow task). There was a short period of silence, and then these distinguished scholars realized the potential of information retrieval. They also recognized that it was not going to put them out of a job but enable them to improve the value of the product. A host of queries were proposed, and the session only came to an end when we ran out of supplies of thermal paper.

The NOED project was an enormous success, not only for the OUP but also for the University of Waterloo, as the project team became the Open Text Corp. IBM used the knowledge gained from the project in the development of its search technology. For me, it was also a day of discovery about the power of search to discover new relationships between items of information.

But there were some other lessons to be learned. The first of these was the value of metadata structure in searching. Because of the way the individual elements of the entries had been marked up in SGML, searching for words that had first been used by, say, Charles Dickens could be efficiently executed. The second lesson was gained in listening to the members of the project team from IBM and the University of Waterloo as
they talked about the importance of computers being able to understand the structure of sentences, work that would lead to the development of semantic search technologies. (If you want to read more about the OUP and the OED, the Wikipedia entries are excellent. Also read James Gleick’s view on the subject at http://around.com/oed.html.)

Search is all about the meaning of words, and we need to take this into account in developing search technology. There are now more than a million words in the English language. But when I see people searching, I am often intrigued by their lack of knowledge of synonyms, in particular when trying to get the best out of a search application. We’ve been working on this since the early 1960s, and many of the vendors in the EC100 this year have innovative solutions. There is still much to be done by the search industry to make information retrieval success less dependent on the literacy and subject knowledge of the searcher.