How Big Data Analytics Can Change the Publishing Paradigm

Article ImageWhen David Feinleib, a California-based venture capitalist turned tech blogger, began to wonder what to post and when to post it, he had an arsenal of contacts in the Big Data world to help him figure this out. "These companies can all do these incredible things with their technology, but it's hard to get specific use-cases," or examples of a technology as applied to day-to-day business decisions. So Feinleib set off to find the company that would run analytics on the website to help his blog there and give him and the Forbes staff some useful input. He found Datameer, Inc., a Hadoop-based analytics software provider formed in 2009 that was nimble enough to do it.

Joe Nicholson, the VP of marketing at Datameer, explains that Big Data and analytics exploded in the past 5 years. "Business intelligence is not new. What changed is the rise of what we call unstructured data." Today, companies want to track things like commentary around a problem, employee efforts such as PowerPoint presentations, or customer paths taken through a website, emails, comments posted on websites or collaborative tools, and the omnipresent tweets. Needless to say, it's a lot to grasp and, moreover, to store.

"We interact a good deal more than we transact," says Nicholson, adding "[U]nstructured data constitutes four times the volume of transaction data." IDC's "Enterprise Disk Storage Consumption Model" report released in Fall 2011 concurs. It states that unstructured data is predicted to grow 61.8% CAGR compared to 23.7% for transactional or block data.

This rise of the content-driven enterprise means that companies will spend $22.5 billion by 2014 to deploy 67,145 petabytes of file-based storage capacity. And the question, posed by IDC, is not to ask if is this too much but rather, "Are we deploying enough capacity?"

Historically, the motivation to store all this data had a lot to do with compliance with regulations. However, no regulations that I know of require a company to report how many times its article on waistlines for pants was liked. The extent of this storage of behavioral, and somewhat momentary, content feels a bit like Auntie Mame's attic full of fun party dresses from yesteryear. After all, web-derived content left unanalyzed is basically old news. Does anyone ever need to know the biggest search terms used on his website from 10 years ago? Perhaps the best defense is a good offense: Analyze it today and make nimble business decisions this week or at least this month.

Datameer was willing to comply with this experiment just for the sake of seeing if Forbes would find it of use. For 1 month in the summer of 2012, data was analyzed using the publically available data displayed with each post, including date, time, page views, tweets, headlines, and full text. The Datameer tool can be downloaded and used on your local computer and pointed at data there or set up for a bigger installation as part of a large data farm. The interface itself relies upon an Excel-style spreadsheet form to define what you want to analyze. Then you use the tool to create data visualization. Splunk, Inc., another analytics company, uses a very similar type of interface, so this seems to be popular.

The results showed most content is published on Wednesday or Thursday, but more content is read on Monday. A lot is read on Saturday; the term associated with the most page views was "cloud," not "Big Data." Just around lunchtime on the East Coast is the best time to publish your content in terms of readership. There is a spike of early morning readers at 7 a.m. too.

As Feinleib points out, "Obviously, quality, topic and author play a big role in the visibility of content." For Feinleib, the takeaway from this ad hoc investigation is that in addition to the quality and the marketing of your content, "[Y]ou should be investing in experimenting with how and when [your] content is delivered and measuring your results."

Many media companies are engaging in analytics today to varying degrees. Specifically, NPR uses Splunk for web intelligence, according to the website. It is tracking real-time visitor metrics, performing ad hoc analyses, and viewing historical trends. It has built custom dashboards to understand how customers interact with NPR's web-based text and audio content to refine its programming and digital strategy and fundraising.

So watch this space. Better yet, find a way to take advantage of it. The Big Data industry on its own is worth more than $100 billion. It's growing at almost 10% a year, which is roughly twice as fast as the software business as a whole, according to The Economist, so it's not going anywhere.

Related Articles

The 2012-2013 EContent 100, a list of the 100 Companies that Matter Most in the Digital Content Industry.