Big Data Is Back From the Dead

Given their reliance on statistics, data scientists were stunned when Donald Trump won the U.S. presidential election. Did the faulty numbers and broken algorithms signal the death of Big Data? Or did they underscore the fact that this is a nascent field?

The Monday-morning quarterbacking continues, but certain factors may lead to better best practices and a more conscientious climate. At a minimum, the need for higher quality data, sharper code writing, and more contextual, nuanced analysis is apparent.

The question remains: What went wrong? Data projects typically consist of the following five steps:

  1. Understand the business problem—what key questions can Big Data answer?
  2. Determine impact measurements—what data is needed?
  3. Discover available data—find the most credible, accessible, and economical data.
  4. Formulate hypotheses.
  5. Communicate the results.

Sound polling data was needed. Unfortunately, it had several flaws. Poll respondents who said they would vote but didn’t, changed their minds, did not want to admit they were voting for a particular candidate, or encountered long lines at polling places and left without voting all led to skewed numbers.

Same-day voter registration also played a role, since the sample for most polls is registered voters. One of the 11 states that has same-day voter registration is Wisconsin, where polls leading up to the election had Hillary Clinton handily beating Trump. Did this option galvanize Trump supporters to register on Election Day? It likely played a part, as Trump’s margin of victory in Wisconsin was less than 1%.

In “Why Pollsters Were Completely and Utterly Wrong,” Dan Cassino writes, “Caller ID, more than any other single factor, means that fewer Americans pick up the phone when a pollster calls.” This limited the samples to those who were either unaware the caller was a pollster or were aware of it and wanted to be polled, making the samples less likely to be random. 

These voter groups may not seem substantial enough to make a significant difference, but in a very close race, they have a decisive impact. In any data project, the data should be assessed in order to determine if it is high-quality. “Assess Whether You Have a Data Quality Problem,” by Thomas C. Redman, offers the following suggestions for determining problematic data:

  1. Pull the last 100 datapoints as a sample. Highlight 10–15 critical attributes in each.
  2. Assign 2–3 project team members to analyze each record and document errors.
  3. Total the number of perfect records and calculate the accuracy percentage of the sample as a whole.

Once the data is deemed acceptable, it is imperative that the programming algorithm work as expected. In the case of the election, the models’ overreliance or disregard for factors such as incumbency advantage, ruling party fatigue, early voting, problematic state polling data (which was, in some cases, faulty, scant, or outdated), and October surprises (such as the FBI director’s letter to Congress and rumors of Russian hacking) significantly contributed to erroneous forecasts. Also, “small changes can cause big changes,” according to Pradeep Mutalik of the Yale Center for Medical Informatics. With some voting models off by 15% to 20%, Mutalik compared election prediction to weather forecasting, noting that it is difficult to predict the weather more than 10 days out, according to The New York Times’ “How Data Failed Us in Calling an Election.”

Data best practices are needed. Are your results what you thought they would be? Look for datapoints that run counter to your expectations. What does your data show, and what might it not show? Are there any dramatic outlying datapoints? If so, investigate the surrounding datapoints. Are there any possibilities that might account for a sudden change? Also, enlist the help of colleagues in double-checking the results. 

Data may be the new oil, with the algorithm as the refinery, but the raw product needs to go through many other critically important processing stages before going to market.  

Related Articles

MTV's first music video, The Buggles' "Video Killed the Radio Star," became stuck in my head when I read that Facebook EMEA VP Nicola Mendelsohn declared that Facebook will be "all video" in 5 years. Mendelsohn said that text has been declining every year and that Facebook users now view videos 8 billion times per day, up from 1 billion a year ago. Apparently, Facebook thinks video will kill off text, sooner rather than later.
Big Data and journalism are becoming inextricably linked. Currently, we seem to be embarking on the slippery slope of presenting statistics and data analysis as solid evidence, sans clarification or context. Even Nate Silver of the FiveThirtyEight website, after putting Donald Trump's chances of becoming the Republican nominee at 12% to 13%, waited to clarify the rationale behind the prediction. To some, a U.S. presidential election feels life-altering. That is literally the case when data are being run through algorithms and used to make decisions, such as mortgage qualification or university admittance.
The familiar adage "Everything old is new again" was on my mind as I enjoyed one of my beach reads this summer, How to Lie With Statistics, by Darrell Huff. In a TED talk, Bill Gates said that Huff "shows you how visuals can be used to exaggerate trends and give distorted comparisons." I assumed it was a hot-off-the-press response to the current fixation with fake news and erroneous data. However, the copyright page says Huff first shared these tips and tricks back in 1954.
In the early days of relational databases, data collection was a large part of the puzzle. Pre-internet, the number of news sources was much smaller, and the metadata involved was much more segmented—names, dates, and places. The format and variety of Big Data have changed just about everything that we believed to be true regarding database contents and creation.