SAMAR Project: Mapping Arabic Language to Aid News Searchers

The French government and international news agency Agence France-Presse (AFP) have teamed up to spearhead a consortium of digital media content producers and publishers aiming to find a high-quality semantic search strategy for AFP's Arabic-language news, audio, and video multimedia content, a solution that creators hope could serve as a model platform for Arabic news organizations around the globe.

The recently unveiled SAMAR Project is a government-funded multimedia content enrichment project from Cap Digital, a French business consortium made up of 500 digital content industry companies. The organization said its latest project will address some of the thorniest issues facing AFP as it attempts to expand access and search capabilities in its multilingual news portal, particularly when it comes to Arabic-language content.

Attempting to search Arabic content presents some unique challenges to content producers, according to Cap Digital. With tough linguistic patterns and few advanced technologies suited to these particular challenges, Arabic-language content-particularly multimedia items such as video and audio files-can be difficult to track down or connect with other related items, both in Arabic and otherwise, says Charles Huot, COO and co-founder of Cap Digital member TEMIS, which makes text analytics and mining software designed to accommodate a full range of languages.

"Arabic language structure is extremely complex, and current technologies don't allow for optimal semantic tagging," Huot says. "It's also difficult to connect Arabic content to information in other languages. A semantic analysis became necessary to index Arabic-language content and make it accessible and findable through online search."

AFP is among many international news outlets eyeing opportunities to expand its news operations into Arabic-speaking countries, especially in North Africa, where information industries are still nascent. Within North Africa, online content production is low, with news agencies such as AFP accounting for nearly 40% of all Arabic-language content authored, according to Cap Digital. But the web is creating more demand for Arabic-language news from around the world.

As the SAMAR Project takes on the extensive collection, members say they will focus on three particularly thorny problems: how to transcribe Arabic vowels in search terms, how to translate speech to text in a way that accounts for the nuances of a multitude of Arabic dialects, and how to cross-match and relate proper noun search terms in French and Arabic-things such as people, places, and companies.

Among the companies taking part in the SAMAR project are AFP, which provides the content; Vecsys and Vecsys Research, which provide speech-to-text conversion and expertise in literary and dialectal Arabic-language processing; multimedia content management specialist Nuxeo; cross-lingual search experts Antidot; ontology and taxonomy management providers Mondeca, CNRS LLACAN, and INALCO CERMOM, experts in Arabic language; and automated translators LIMSI and GREYC.

TEMIS, the consortium's ninth member, provides knowledge extraction and information analysis and discovery with its text analytics enterprise solution Luxid. According to TEMIS, the SAMAR Project uses the Luxid for Content Enrichment tool to "understand" the Arabic-language syntax, extracting terms, topics, facts, and relationships from extensive resources by analyzing the use of domain and language-specific annotators in the relevant content.

TEMIS first discussed the project with AFP's Medialab in 2007, and then tapped its partners in Cap Digital to take part. While funding initially fell short, TEMIS and AFP reproposed the project in 2008, raised funds from regional and national French governmental organizations, and began planning out SAMAR's early phases in June 2009.

TEMIS's Huot says Luxid is well-suited to the demands of SAMAR Project researchers, applying lessons learned from years of experience mapping Arabic-language content in order to design more efficient and powerful annotators.

"The SAMAR project has been a fascinating technical and commercial challenge with the end product serving business, publishing, and political needs," Huot says. "The platform represents a unique source of strategic information for companies looking to expand operations into the promising Middle East and North Africa markets, and Arabic media expect to use this platform, to meet their needs for organizing and enriching information production in order to take part in [the] worldwide publishing information race."

By reaching across the Mediterranean, the French government also thinks that the SAMAR Project could help bridge the cultural and linguistic gap between European Union (EU) countries to the north and North African countries, Huot says. This is a part of French President Nicolas Sarkozy's vision for the Union for the Mediterranean, a nascent organization designed to boost cooperation among local and regional authorities on issues affecting EU and non-EU countries bordering the Mediterranean Sea.