Web Content Extraction: A WhizBang! Approach


      Bookmark and Share

If you know nothing else about Bob Sherwin, co-founder, president, and CEO of WhizBang! Labs (www.whizbang.com), you've got to recognize his flair for naming companies and products. Not only did he call his data extraction company WhizBang! Labs, the first product out the door from the Labs was named FlipDog (www.flipdog.com). It's a job-hunting site. The WhizBang! name choice verged on the accidental. "When we were in meetings discussing the concepts that became the company, we had two fancy names and a third, WhizBang!, because we thought it was such incredible, whizbang technology. By the time we actually formed the company, the WhizBang! name was the one that stuck."

The extraction technology that Sherwin considers so whizbang wonderful consists of a unique approach to scouring the Web for current, very specific forms of information. FlipDog, for example, checks company Web sites for hyperlinks to pages that list job opportunities. It then crawls to the deeper page and, using the WhizBang! Extraction Framework, extracts the key elements of the postings, such as job title, name of employer, job category, and job function. Click on a job and you are transferred to the company Web site to view the job description as it appears there.

To build the original FlipDog database, WhizBang! crawled 10 million sites. The software now automatically crawls some 50,000 Web sites a week for updating purposes. Why only 50,000? Because not all of the original 10 million sites had open positions on them and because the software now knows exactly where to go on pages it's crawled before. Note, however, that WhizBang! doesn't make value judgments. If a job has been on the Web site for six months, it stays in the database, whether or not it's actually been filled, although there is a notation for "Last Updated," so you know how long it's been posted. "There are two reasons why jobs stay on a Web site," comments Sherwin. "One is that no one updated the site and the data is truly out-of-date. The other is that the company has a continuing need for that type of employee. Our software can't tell the difference and I'm not sure a human looking at the page could tell the difference either."

Defining the Crawlspace
The crawling process consists of four steps. First, there's the crawl itself. Then the software classifies, extracts, and compiles information. For each application, the software must be trained. The machine learning employed by WhizBang! Labs is patented and proprietary, but Sherwin gave some hints about how it works. Humans find examples of both positive and negative Web pages, ones that contain the desired information and ones that do not. The software analyzes these pages and, through pattern recognition algorithms that augment Bayesian and nearest neighbor algorithms, teaches itself to recognize other pages that fit the desired profile. Some of this would doubtless be words. If you're training the software to find job descriptions, some obvious words to look for would be job, position, resume, certain generic job names, and possibly a mechanism to electronically submit your resume.

There are other elements that aren't words. Take a project WhizBang! is working on for Dun & Bradstreet. In addition to crawling corporate Web sites to extract updated information about companies that are listed, or should be listed, in D&B directories—a procedure that is vastly more efficient than the telephone surveys D&B is famous for—D&B wanted to identify those with ecommerce capabilities. What on a Web site would indicate online shopping? How about a shopping cart? A payment mechanism? A product catalog?

Sometimes you could be looking at data extracted by WhizBang! and not even know it. Suppose you want a list of distance education classes. You go to America's Learning Exchange (www.alx.org) with its 6,500 training providers and 350,000 programs, seminars, and courses, select your subject, state, delivery method and retrieve a list of relevant educational opportunities. It doesn't say it's powered by WhizBang!, but it is.

A recently announced agreement with LexisNexis will see WhizBang!'s data extraction technology put to work for the Directory of Corporate Affiliations (www.corporateaffiliations.com) and Advertising Red Books directories (www.redbooks.com). Updates will begin to appear in the LexisNexis data during the first quarter of 2002. The software is being trained to recognize changes in corporate affiliations, companies bought and sold, divisions and affiliates restructured and/or renamed, any alterations in corporate family structures for DCA and to find changes in advertising agencies in terms of accounts, personnel, and fields of specialization for Red Books. In addition to making the data more timely, WhizBang! can suggest new entries in the databases, add industrial coding, and note changes in product lines. Unfortunately, the information won't be date stamped, so you won't know how timely it is. This is interesting because the data as it appears on Dialog (File 513) is date stamped. It would seem trivial for LexisNexis to ask WhizBang! to include this feature, since it works on FlipDog.

A related component to WhizBang!'s data extraction technology is its ability to transform legacy data into structured XML databases. This is most commonly used within a company to standardize disparate pieces of information. Again, the software is trained to recognize similarities, segment documents into fields, extract associated data, and either create XML files or load into a database. It's another instance of building a dynamic content database by extracting facts from unstructured text.

What's Ahead
Until this year, WhizBang! has concentrated on customized data extraction solutions, but Sherwin wants to move to more off-the-shelf type products. "We're looking at database creation engines. These could be databases of people, corporations, education, ecommerce, or careers. We train the software on the specific data extraction elements required and populate a relational database. We're also planning a sales lead generation product. This would allow sales people to identify potential customers. It would include names and addresses of contacts and, more important, it would up to date. If a salesperson finds a particularly good lead, the software would provide the opportunity for a "more like this" feature. Another product we have in mind is a URL finding mechanism. This would be a high-volume operation, not a Whois search."

Another Sherwin idea is the custom content creator. "Just think about entering a new market. All of a sudden you need to know everything about scuba diving or gardening. WhizBang! would let you enter the name of the market and the software would already be trained on the types of pages to look for that would be of interest to a business person." There are other applications of WhizBang!'s data extraction technology that Sherwin lauds. "We could tell you anyone who is the customer of your competitor or who is in partnership with a company."

He's particularly excited about the sales lead generator. "Finding leads is the holy grail. It's the food and shelter level of Maslow's hierarchy." He's also talking with other content vendors, such as LexisNexis, which he won't identify, about the value-added services WhizBang! can provide to the traditional information gatherers, aggregators, and resellers. There's no doubt we'll hear more about WhizBang!' extraction technology in econtent contexts. The only question is what exotic names will Sherwin apply to the products.