Nov 18, 2013
Sponsored Guest Commentary
Semantic Analysis of Unstructured Information in the Age of Big Data
Big data is a huge “hype” subject. In fact, the term is rather misleading because it seems to relate to data quantity only, while ignoring the heterogeneous nature of data. Accordingly, the general definition of “big data”, as suggested by Gartner, also considers the complexity and variability of data, and the relevance of highly heterogeneous and unstructured information and how it can be analyzed in various forms.
In particular, beyond the growing network of machines and equipment, social interactions are a further source of big data: With the exchange of messages in a wide variety of formats, on various platforms, and through multiple channels, large amounts of data are generated that demand analysis and interpretation. Emails and documents remain important drivers for data growth, compounded by video content and social media.
This data is, by nature, highly unstructured and based largely on natural language – either directly or, as in the case of audio and video content, after transcription or similar pre-processing steps. Direct analysis using traditional methods, such as data mining or business intelligence, is not possible for data of this nature. Instead, content-based indexing and processing is required.
Semantics and language processing methods are used in order to extract relevant information from unstructured data streams and texts, identify structures, and create links between the data itself and with other data sources. To a certain extent, the primary goal is to make “Business Intelligence on text” a reality – and that demands innovative technologies.
The following examples illustrate this:
- Entries posted in blogs and forums are indexed, problem descriptions and symptoms are analyzed, and product and component descriptions are extracted, in order to obtain more detailed information from customer interactions.
- When analyzing social media data, texts written by users must to be analyzed and structured, thereby taking specific jargon or slang into consideration to assess the overall mood (“sentiment analysis”).
- In e-mails and documents, semantic relations and connections are identified, as well as links to other information, such as CRM or product catalogs, are established.
- Large video archives are evaluated by applying transcription and semantic methods, entities are extracted, and issues are analyzed.
All of these scenarios can be found in the “big data” world and depend on the use of semantic and linguistic technologies, like those provided by Empolis.
In spite of all the technology, at the end of the day, we should let the machines do what they do best, and allow humans to do what they do best. For example, if you consider huge commercial criminal cases in which 200 to 300 attorneys plow through stacks of documents to extract the most important pieces of information and relevant facts are marked, in order to provide lead counsel with condensed information needed for a strategy. The attorneys working their way through the information carry out menial tasks of “pre-selecting” information. In situations like these, it makes more sense for machines and the semantic and linguistic technologies to do the work. Not only would time and money be saved on manpower, but results would be achieved much faster. Let the machines do the tedious work and humans do the thinking.