Making Sense of Big Social Data

You see references to "Big Data" everywhere. Even on the personal side, we all share a common sense of information overload. We don't (can't) back up all our information-personal, business, shared, and social-since it is stored in so many places. We often can't find critical pieces of information. And that's just our own. Expand the scope to massive corporate data sets, and the problem becomes mind-numbing. We also know that we aren't getting all the value that's available in this data storm. Not getting sufficient value from that information may be the biggest problem of all.

What is Big Data? How does social media such as Twitter fit into Big Data? How do you search and use it? I have always viewed information holistically, a rainbow spectrum. Highly structured database information is at the infrared side. Loosely structured PDF and Office documents are at the opposite ultraviolet end. This side of information is often called "unstructured," yet it is anything but. It is simply harder to analyze than the red side of the spectrum. A 2011 IDC report says that 90% of Big Data is "unstructured," including social content. In between the color extremes is a range of very structured documents (think XML) along with Twitter feeds and other items that fall between the loosely structured end and highly structured databases.

Exclusively focusing on the red side of information (as in typical Business Intelligence [BI] systems) leads to color-blindness to the rest of the spectrum. This BI myopia misses such valuable information as social commentary, whether likes, hashtags, or reviews. There are strategies for deriving value from each end of the spectrum and, increasingly, via the XML middle.

In May, I spent 3 days at MarkLogic World's Big Data Big Ideas conference to understand how it views Big Data and strategies for leveraging it. MarkLogic's flagship product is MarkLogic Server, an XML server that emphasizes managing, searching, and analyzing content across the information spectrum. Server converts all extremes of the information rainbow into XML, providing a consistent and rich view. Server isn't a database. It isn't a conventional document management system. It isn't a conventional search system. Yet, in a sense, Server is all three, leveraging the many mature XML specifications that are becoming better understood. Indeed, MarkLogic Server developers had best become fluent in XQuery, even though the product will also support SQL.

I am not a developer and have only a basic XML competency. However, during that conference I developed a search application using MarkLogic Express, a freely available light version of Server, feeding it mostly PDF press files to see how it processed loosely structured information and allowed searching them. MarkLogic Express converted my PDF files into XML that captured built-in PDF attributes such as title and author. Moreover, it was able to recognize and tag content chunks such as paragraphs. The insights I received with Express and the conference itself helped me appreciate the impact and implications of Big Data and how to search and understand it.

The conference presentations emphasized the three V's mantra that is quickly being adopted by all big data vendors: Variety, Velocity, and Volume. One of my favorite Server demos showed social analytics to find new marketing approaches for Gatorade. The demo showed real-time dashboard analysis of commentary regarding Gatorade, including Twitter feeds, mentioning the product and "flu." The result showed a real-time U.S. map where there was a high correlation between these terms. That is, consumers with the flu were finding Gatorade to be helpful, suggesting new marketing themes. Then the demonstrator tweeted a comment about Gatorade and flu, and within moments the map was updated. Given that Twitter is getting about 20 terabytes of data each day, this is a clear example of Big Data analytics in real time.

To remain relevant, traditional BI applications must work with the whole information spectrum, including social media. These analytics must be able to deal with all three Big Data dimensions: variety (of formats), velocity (the lightning-quick speed of incoming data), and volume-multiterabytes in real time. And XML may play an increasing role. Preparing for this approach means cultivating a deep competency in XML, including W3C standards such as XQuery. Of course, other vendor products that are not XML-based are quickly developing similar ways of analyzing Big Data, social and otherwise. Still, MarkLogic's use of XML seems like a good bridge between the extremes of the information spectrum.