EContentmag.com Home
Search EContent:
25,000+ articles now available in ITI's new full-text digital archive: ITI-InfoCentral.com!
Visit ITI's Enterprise Search Center!
Newsletter
EContent Xtra
Research Centers
Content Commerce
Content Creation & Digital Publishing
Content Delivery
Content Distribution
Content Integration
Content Management
Content Security
Digital Asset Management
Fee-Based Information Services
Intranets and Portals
KM & Collaboration
Mobile & Wireless Content
News/Finance/Business
Online Community
Rich Media
Sci-Tech/Medical
Search Technology
Taxonomy
Web Services


Columns
After Thought
Agile Minds
Behind the Firewall
DisContent
Edit This
Eureka
Follow the Money
Guest Column
I Column Like I CM
Info Insider
Info Pro
Technology Watch

In Focus
EContent 100
EContent 100 Videos
Past Issues

Services
About EContent
Advertising
Subscribe to
EContent Magazine
EContent Xtra
Newsletters
RSS Feeds from EContentMag.comFeeds


Awards
2009 Apex
2008 ASBPE
2008 Tabbies
2008 Apex
2007 Tabbies
2007 Apex
2006 Tabbies
2006 Apex
2005 Tabbies
2005 Apex
2004 Tabbies
Losing What Counts: The Swamping Phenomenon
By Walt Crawford - June 2004 Issue, Posted Jun 23, 2004 Print Version   Page 1 of 2 next »

Remember when Amazon first introduced "Search in the Book," and some people looking for specific books suddenly found the process much more difficult? That was swamping. It was entirely predictable. It also might have been avoidable. Improved performance of late suggests that it was at least partially curable, at least in terms of which books appear first.


You should be aware of swamping, since it can affect almost any search process. When swamping occurs, the stuff that counts can disappear under a flood of other vaguely similar stuff. You're no longer looking for a needle in a haystack; you're looking for a dried stalk of fescue in a haystack.

The Amazon Case
For Amazon, the numbers are simple enough. Bibliographic records (catalog records) for books and similar items average 30 to 35 significant words per title, between author, title, and subject entries. That average is based on 1986-1988 studies using a sample of 600,000 contemporary cataloging records; there's little reason to believe the numbers have changed much. Bibliographic data is notoriously lean (not a lot per record) but deep (well-tagged and applied using consistent rules).

Based on the number of pages and books in the Search in the Book database and the typical average number of words per book page (around 300 to 350), it appears that the average book in the database has 70,000 words. In other words, a typical book has more than 2,000 times as many words in its full text as it does in the significant bibliographic fields.

Thus, if Search in the Book represented 10% of the books available through Amazon, words coming from pages would appear 200 times as often as words coming from authors, titles, and subjects. Thus, full text swamps the bibliographic records. Even if full text was only one percent of the whole, it would swamp text from the titles, authors, and subjects by a 20 to 1 ratio.

If you're looking for obscure terms, all that extra text is great. If you know what you're looking for, or if you only know a few title or author words that are within the common English vocabulary, you're in trouble.

Real Examples
To test drive book search, I tried some real-world examples: Two authors (Reva Basch and Rachel Jones) with books reviewed in the October 2003 issue of Online, and eight books reviewed in the October 2003 American Libraries. How would Amazon, Barnesandnoble.com (B&N), and Google—using the default single search box in all cases—compare with a books database vastly larger than Amazon or B&N, but offering fielded searches by default? I used the RLG Union Catalog (around 45 million books, "RUC" for brevity) because I can search it for free (I'd guess WorldCat would yield similar results). I did limit Amazon and B&N searches to books.

As an author search, Reva Basch yields 35 books in the RUC; Rachel Jones turns up five different authors totaling 29 books. At B&N, Basch showed 24 items, Jones 56. Results in Amazon were plausible for Basch—77, presumably including most of her books and some relevant in-the-book results for a fairly unusual name. Rachel Jones, on the other hand, showed up 9,777 times in Amazon: An essentially useless result. Google? 6,720 for Reva Basch—enough to make Googling someone fairly difficult—but a lot better than the 1.52 million for Rachel Jones.

The four books checked were about poker and gambling: Double Down by Frederick and Steven Barthelme, Poker Face by Katy Lederer, Poker Nation by Andy Bellin, and Positively Fifth Street by James McManus. All have subtitles, but I believe most users would search on the main title.

What about more specialized books, such as four books on librarianship? Take Lobbying for Libraries by Bernadine E. Abbott-Hoduski, Education for Cataloging edited by Janet Swan Hill, Books in Bloom by Kimberly K. Faurot, and Booktalker's Bible by Chapple Langemack. The RUC yielded one result each (except three for Books in Bloom). B&N was nowhere near as specific, with 24, 546, 1,058, and 116 respectively. Amazon was perfect on Booktalker's Bible with a single result—but with 12,532, 51,383, and 32,415 results, the other three were hard to cope with. (Books in Bloom didn't show up on the first page of results for those words, thanks to books by Harold Bloom.) Google turned up 85,000, 166,000, 1.35 million, and a surprisingly low 226.

I could provide thousands of examples: the RUC has six books with the title World Folks; Amazon shows 55,734, and 333 if you know enough to put quotes around the phrase to achieve an exact match. Balder: 6 books in the RUC, but 805 listings at Amazon—which, remember, has no more than one-tenth as many books in all. Paul Nash as an author resulted in 17 books in the RUC, 9,925 listings in Amazon (162 as a phrase).


 


Print Version   Page 1 of 2 next »
directory
»   Read the 15 minute guide to Enterprise Content Management
»   Read the 15-Minute Guide to Best Practices in Correspondence Management
»   ITIResearch.com - A collection of market research and reports for executive management and business & IT professionals
»   Publishers rely on Acquire Media's Syndication Suite to deliver content to target audiences with pinpoint accuracy.
»   Migrate Legacy Data – Register with Open Text for a FREE trial

All Content Copyright © 1998 - 2010, Online: a Division of Information Today Inc.
48 South Main St., Suite 3 · Newtown, CT 06470-2140
(203) 761-1466, (800) 248-8466 · Fax (203) 304-9300 · custserv@infotoday.com
PRIVACY POLICY