Multimedia Search Matures . . . But Not Without Growing Pains

Page 1 of 3

The web was once all about text and an occasional picture, but broadband begat bandwidth and bandwidth spawned a boom in multimedia content through which video and music sites like YouTube became wildly popular and were bought and sold for huge sums of money. Yet in spite of all this rich content, the major search engines remain mostly text-based, relying on titles, tags, and metadata when searching for multimedia content.

The big free search engines lack a way to search inside multimedia content, something that has been available for 10 years in searching text. In order to delve deeper into multimedia, you need to use a specialized search tool from the likes of Blinkx, Nexidia, Podzinger, or TVEyes. As one analyst describes it, multimedia search is still in the crawling stage, but sometime in the not-too-distant future, it’s going to grow up and take off fast.

From Genesis to Revelation
In the beginning, there was text search and we had some keywords and a title and we thought it was good. Long ago, text search matured to search every word in a document (full-text search), but according to Suranga Chandratillake, CTO and founder of video search technology company, Blinkx, until recently, search tools also focused exclusively on text. He says, “Search engine technology has always been built around the idea that you are looking for text. In that sense it’s self-descriptive because computers can read text and you know what words are there and therefore you have an idea of what’s relevant. Obviously there is a lot of detail around what is more relevant, but at least you can make pretty good decisions.”

Chris Sherman, executive editor of the site Search Engine Land, who has followed search since its early years, says most multimedia search has advanced only to the text search evolutionary equivalent of 1993 or 1994, when it looked at titles, links, and keywords. According to Sherman, “A lot of these multimedia sites that people call search sites really aren’t. They use things like tags and other types of text information to figure out what’s out there and refer people to multimedia files. They are not ‘full’ search like we understand full-text document search.”

The challenge, Chandratillake says, is moving beyond text search because video and audio may have some text available that could be searched, but that won’t give you the full sense of the contents. “Now, because of the way the web is, there will usually be some associated text. There will be files, metadata, and text on the web page around the video or image and that’s how the majority of multimedia search has occurred in the past.” He points out that if you do a video or image search on Google or AOL, it’s not really searching the video itself or the image itself. He says, “It’s looking at stuff like: Based on the fact that this image seems to be tagged with the words ‘George’ and ‘Bush,’ it’s probably a photo of the president and so on.”

One other key issue, according to Sherman, is the subtle elements of multimedia content, which humans understand intuitively and that today’s technology cannot. “When you have streaming content,” says Sherman, “you have a lot of information contained that’s non-textual, like body language or inflection. Is somebody being sarcastic when they say something or are they being straightforward? It’s still very difficult for a computer to understand that.”

Are We There Yet?
We are not at the final destination by a long shot, but multimedia search is beginning to move beyond simple text, even if Sherman thinks we still have a long way to go. He says, “If you look at true multimedia, like video, audio, or other types of streaming content, really nobody is doing search in the sense of, ‘Let’s get in and really understand this content, analyze it, and make it searchable.’ Some companies are doing simple things like speech detection analysis or looking at waveforms in audio and doing similarity analysis, but it’s still basic and doesn’t work consistently well.”

However, there are companies that are beginning to move forward with some interesting approaches. Chandratillake says Blinkx is using speech recognition technology in conjunction with visual analysis of videos to provide a way to expose what’s inside the video or audio. “Our obsession is how much can you get the computer to actually understand the video itself or, in our case, the audio as well. To that end, we specialize in technology that not only finds the video that’s out there on the web and reads text around it, but watches the video and also listens.” he says.

Gary Price, who runs the website and who is also the director of online information sources at Ask, sees TVEyes and Nexidia as two companies pushing multimedia search forward. “Right now for some of the technology I’ve seen, whether it be what TVEyes or Nexidia is doing, is really what I call the state of the art—that
is, being able to work with [multiple] languages and breaking down the text into phonetic sounds, so it’s much more accurate and cost-effective in terms of computing cycles and it’s much faster.” He adds that using a phonetic approach is also more effective than trying to do pure speech recognition, which is trying to resolve a word against a word in a dictionary and is much slower.

Drew Langham, SVP of media at Nexidia views this as a great advantage. “We take any recorded audio or video source and break it into a purely phonetic index. So what happens is we create this index, depending on the processor you are using, at about three-hundred-and-forty times real time. This means one hour of media gets processed in about twelve seconds and rendered searchable. When someone wants to search for a piece of information, they type a text query. It gets parsed to the phonetic equivalent and matched to the exact point in the audio or video where it was said.”

TVEyes monitors broadcasts for its subscriber clients using a hybrid of the phonetic and dictionary approach. David Ives, the company’s CEO, says TVEyes’ approach depends on the language being spoken. He admits there are limits to speech recognition technology, but says it gets you further than looking at tags or other text information. “We look at the audio track as the primary means of determining the content and context of what rich media is about. In a consumer-generated video like on YouTube, with lots of background music or a visual stunt, our approach would not be effective, but the vast majority of content has spoken content and we offer an economical way of determining the content,” he says.

Clients report that TVEyes is an effective way to monitor mentions in broadcasts, a task that would be impossible without this technology. Ellen Davis, senior director of strategic communications at the National Retailer Federation, uses TVEyes to monitor news about any of its members or the organization itself and says it’s particularly useful during the holidays when mentions skyrocket. She manages this by getting a daily digest of all mentions by email. What’s more, she can access broadcasts and email meaningful snippets to her boss. “With this service we can very easily splice a three-minute segment and email it to the CEO so she can watch it on her computer.”

Podzinger takes yet another approach by using voice recognition technology developed for the U.S. government by parent company BBN. Alex Laats, Podzinger’s CEO, says that by using this approach and then categorizing the audio and video files by subject, Podzinger makes it easier to pinpoint the desired file and the desired element within the file. “The basics of what we do is making audio and video searchable using full-text search and we also do natural language processing in order to topic-classify it and to make the content more readily accessible,” Laats says. This means that if you are looking for a mention of Alex Rodriguez, the New York Yankees’ third baseman, you can enter his name in the search box, or you could search for “sports” and “baseball” to find him. Once found, Podzinger not only displays the relevant podcast or video, it shows you exactly where in the broadcast the mention occurs, sparing you the time it would take to listen to the whole thing.

Page 1 of 3