Search in Any Language


      Bookmark and Share

BEST PRACTICES SERIES

Note: In print, this appeared as part of our Special Search Focus in which we took an extensive look at the trends in search today, including the sudden flurry of desktop search tool announcements and their viability in the marketplace, the state of multilingual and local search, and a round-up and discussion of a variety of multimedia search tools. Links to other articles are below.


"Search," as Steve Cohen, EVP and VP of products at Basis Technology, explains, "is made up of two stages: indexing and retrieval." Monolingual search is relatively straightforward, but things get much more complex when you start offering search options in more than one language. Given that the Web is increasingly multilingual, the need for robust search options in a wide variety of languages is growing. As such, major search engines Google, Yahoo!, and MSN Search all offer multilingual search, to varying degrees. Google is the polyglot of the group, supporting more than 100 languages. MSN Search offers the fewest, but this may be due to its relative newness to the search game.

All three use Basis Technology's Rosette Linguistics Platform, a complex program that is, according to Cohen, "designed to be used as a library for search engines and to enhance search." The Rosette Linguistics Platform is especially good at solving the problem presented by the lack of spaces between words in Asian languages. Because of this structure, all text, not just certain words, must be indexed. Google, Yahoo!, and MSN Search all offer search in at least one Asian language.

Multilingual Web search starts with the simple interface we are all familiar with: the generic search box. You type in a word in any supported language and wait for the results to appear. But, as anyone who uses these tools extensively knows, there are ways to refine your search; the same holds true for multilingual search. Depending on the engine, language options are tucked away under different headings brought to light after a little exploratory clicking: Exalead (advanced search), MSN Search (search settings), Yahoo! (advanced), and Google (language tools) all offer the possibility to limit the language and the country of the search results. This proves handy if you happen to be searching in a language that is spoken in a number of countries or if you're interested in finding sites hosted in a specific country but written in a non-dominant language.

Multilingual search isn't, however, limited to the major Web search engines. Recommind and Endeca are two companies that offer innovative multilingual search solutions to their clients for both internal and external use. Recommind's Mindserver Platform uses proprietary PLSA technology "to enable the highest-accuracy search and categorization available on the market," according to CEO Richard Tennant. "PLSA is also the technology that enables our system's language, and importantly, our domain-independence."

Mindserver's technology is such that multilingual and cross-lingual search are both supported since "it develops concept models that view each word as a token, without caring whether the word is an English or a French word," Tennant explains. What is important is the relation between the keyword and the document, not the language. The system understands, so to speak, that one word can have more than one meaning no matter what language the searcher is using. Although cross-lingual search does require that both languages queried be present in some of the documents, this aspect of Mindserver is particularly helpful for academic organizations or companies that use specific technical vocabularies.

Endeca supports more than 200 languages, but can just as easily be used for mono- or multilingual search solutions. Craig VerColen, the company's public relations manager, says that "Endeca uses a meta-relational index that both looks for instances and proximity of keywords and leverages relationships between information based on shared characteristics. This allows for Guided Navigation, a next-gen capability that allows users to hone searches and explore related content by displaying logical and valid next steps or follow-on questions in the form of navigational links." Because options presented by Guided Navigation are based on the data set being searched, language takes a back seat; search refinement is simply a matter of clicking on the preferred option. The final query results should be pertinent and in the searcher's chosen language.

Endeca does not currently support cross-lingual search. Like Cohen, VerColen sees only limited applications for this option at the moment. Market demands make this type of search less of a priority. VerColen feels that because of the lack of multilingual searchers and strong translation options, adoption of this type of search will continue to lag behind mono- or multilingual options. Cohen, on the other hand, believes that by coupling a cross-lingual search engine with a machine translation program, you could then "use the system to discover the documents that are worthy of translation." In any case, as the topography of the Web and markets evolve, the need for multilingual and cross-lingual search will grow in kind.


Related articles, which ran as part of the May 2005 Special Search Focus

Search Tools Converge on the Desktop, Ron Miller
Searching for Multimedia Tools, Paula J. Hane
Local Search Brings Results Home, Ron Miller