Drowning in Data and Searching for Answers

Jul 01, 2014

      Bookmark and Share

The night before I wrote this column, I watched a classic British spy thriller. One of the old-guard spies complained that they had more information coming than ever before, but nobody knew how to make sense of it all. In an age of Big Data, any profession that deals with information faces this problem, including journalists.

When Edward Snowden revealed the extent of the National Security Agency's (NSA) surveillance, he also exposed a classic dilemma for people who deal in data and information. Those who handle data can make all kinds of connections on paper, but it doesn't always mean they are connecting the dots correctly. Eventually, you need to go out on the street and figure out if what you're seeing in the data is actually happening.

To paraphrase Peter Gabriel in "Games Without Frontiers," Hans may play with Lotte and Lotte with Jane, Jane with Willy-but what does it all mean? The NSA faces a similar problem with metadata. It may see the connections among a group of associates through cellphone metadata, but then what does it actually do with that information?

As we rely more on data and we wait for tools to help us sort it out without the help of data scientists, we have to look not just at the data, but at the story behind the data-because data is just a means to an end. In the aforementioned spy thriller, the lead character bemoaned that people didn't take old-fashioned investigating and analysis seriously. People still need to collect information, make connections, and build an understanding of what's going on. Computers can help with that, but they likely are only telling part of a story.

As an MIT professor pointed out last year during a talk on Big Data at the MIT Sloan CIO Symposium, correlation doesn't always equal causation-and he told an amusing story to illustrate this. The data found that people who own ashtrays are more likely to get lung cancer. Therefore, if we remove ashtrays from people's homes and public places, we can reduce lung cancer rates. The data in this case didn't tell the whole story, and as we increasingly rely on data in the future, we want to keep this cautionary tale in mind.

We can't look at the data and assume we have the whole story. We might even have a great story, but we have to still do the old-fashioned investigation and analysis that the veteran spy was talking about. We still have to get out of the office and ask questions. We can't simply sit at our computers with our data and assume we know something conclusively. If we do, we may end up with our own version of "Study Finds Ashtrays Cause Lung Cancer."

It's also important to understand that in spite of the pitfalls we might find in analyzing data without context, data really matters and we can learn things we never imagined. MIT professor Andrew McAfee has been advising businesses on the value of Big Data for several years, and he says that what they usually want to know is where the value in their data is.

McAfee tells a story about his MIT colleagues who were able to predict the housing market based on data from Google searches better than the National Association of REALTORS (NAR). How did they do this? They theorized if they looked at Google search data in a given housing market for searches about houses, schools, neighborhoods, and the kinds of things people would research before buying a house, then they could predict the housing market-and it turns out they could with 23.6% greater accuracy than NAR could with all its data.

There is clearly a double lesson to be learned here. Namely, you can use data to find new and better ways of making predictions, but sometimes, you can use different data-even when it seems more comprehensive-and not do as well.

Certainly, we have a long way to go in terms of analysis across all disciplines, but we are entering an age of data. As journalists, we have to learn to use that information wisely and well; as we do, we can't forget that part of our job is to investigate, analyze, and question remains. If we don't take that responsibility seriously, we could end up with a pile of data and dangerously wrong conclusions.