In the mid 1990s, I bought an Iomega Zip drive boasting 100MB of storage. It was the size of a small pancake. The other popular offline storage medium, a 3.5" diskette, held 1.44MB. Each Zip drive packed the equivalent of almost 70 diskettes. Who would ever need more than a few Zip drives?
Fast-forward to today, when discussions of data in general, and Big Data in particular, frequently are framed in petabytes. That's "10" followed by 15 zeros or maybe 10 million Zip drives. Worse yet, while recognizing that total data volume is growing exponentially, a 2011 McKinsey Global Institute study predicted that the volume of data will grow 40% per year. That is a doubling of data every 2 years. Exponential volume growth can't continue forever (we'll run out of atoms), but in the meantime we are increasingly overwhelmed and struggle to make sense of it.
And volume is not the only consideration. There are two more V's: variety and velocity. The types of data keep changing, and data's arrival speed keeps increasing.
How do we analyze and use all this data streaming toward us? Is the volume so large that we toss out traditional approaches to view and search information such as thesauruses, taxonomies, and concept maps? Or do we still need to consider both the medium and the meaning? I've always felt that Aristotle's approach would work to manage information, and were he here today he might say we need both. He believed that all reality consisted of matter and form, and that a substance combines both. All the building materials (matter) for a house are not themselves a dwelling. They become a house only when built to a specific set of plans (the form). Big Data is the matter. How can anyone determine the form in this mashup of tweets, geolocation and instrument data, photo metadata, office documents, structured data, and the like?
The three V's have spawned new technologies such as Hadoop, the Apache open source software framework, and MapReduce, a processing model developed by Google and now open sourced freely and distributed by Apache. These corral the data, but they don't provide powerful means to analyze it.
Does the exponentially growing magnitude of all three V's require a different analysis approach? To gain insight into this, I discussed these issues with John Felahi and Steven Toole of Content Analyst Co.
They insist that traditional search tools (and the Aristotelian approach) are great when you know exactly what you are looking for, such as a document by a specific person or title. An alternative is the Bayesian approach, a search technology that uses probability to infer meaning statistically. They feel this technique too has its Achilles' heel: language dependency, leading to the need for expensive prepackaged thesauri. Moreover, Bayesian's precision can degrade as data volumes increase. The alternative, Felahi and Toole assert, is radically different and is based on a mathematical technology they call Latent Semantic Analysis, aka LSA (Content Analyst is the original patent holder for this technology). Content Analyst has produced a product based on LSA called CAAT. Essentially a text analytics add-on to other search systems, CAAT is open source or vendor-proprietary. According to Content Analyst, CAAT can turn content in any language or format into geometry, up to hundreds of dimensions. CAAT looks for patterns that do not require knowing the language, yet (Content Analyst claims) it can produce precise, consistent categorization. CAAT skirts the issue of different languages or vocabularies (because it doesn't need them), but Content Analyst says it can serve up dynamic clusters of information, concept-based categorization, meaningful summaries, and effective conceptual search. I did not see a demo (CAAT is a toolkit that Content Analyst's technology partners build into products) but did look at some partner references. What I saw suggests CAAT tackles a wide variety of text analysis problems. I will follow up with an on-site visit.
Is it time to switch from Aristotle to Marshall McLuhan's "medium is the message"? Does understanding Big Data require a fundamentally different approach? Or should we look for a middle ground, expressed by Thomas Aquinas, an avid scholar of Aristotle's work: Truth lies between the extremes. Big Data won't be wished away. We probably must expand our analytics toolkit. This means keeping old approaches for manageable sets of data open to traditional approaches, but keeping our options open to algorithmic approaches such as LSA when volume or velocity overwhelms us.