The Exaggerated Promise of So-Called Unbiased Data Mining
Nobel laureate Richard Feynman once asked his Caltech students to calculate the probability that, if he walked outside the classroom, the first car in the parking lot would have a specific license plate, say 6ZNA74. Assuming every number and letter are equally likely and determined independently, the students estimated the probability to be less than 1 in 17 million. When the students finished their calculations, Feynman revealed that the correct probability was 1: He had seen this license plate on his way into class. Something extremely unlikely is not unlikely at all if it has already happened.
WIRED OPINION
ABOUT
Gary Smith is the Fletcher Jones Professor of Economics at Pomona College. He is the author of The AI Delusion and Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie With Statistics.
The Feynman trap—ransacking data for patterns without any preconceived idea of what one is looking for—is the Achilles heel of studies based on data mining. Finding something unusual or surprising after it has already occurred is neither unusual nor surprising. Patterns are sure to be found, and are likely to be misleading, absurd, or worse.
In his best-selling 2001 book Good to Great, Jim Collins compared 11 companies that had outperformed the overall stock market over the previous 40 years to 11 companies that hadn’t. He identified five distinguishing traits that the successful companies had in common. "We did not begin this project with a theory to test or prove," Collins boasted. "We sought to build a theory from the ground up, derived directly from the evidence."
He stepped into the Feynman trap. When we look back in time at any group of companies, the best or the worst, we can always find some common characteristics, so finding them proves nothing at all. Following the publication of Good to Great, the performance of Collins’ magnificent 11 stocks has been distinctly mediocre: Five stocks have done better than the overall stock market, while six have done worse.
In 2011, Google created an artificial intelligence program called Google Flu that used search queries to predict flu outbreaks. Google’s data-mining program looked at 50 million search queries and identified the 45 that were the most closely correlated with the incidence of flu. It's yet another example of the data-mining trap: A valid study would specify the keywords in advance. After issuing its report, Google Flu overestimated the number of flu cases for 100 of the next 108 weeks, by an average of nearly 100 percent. Google Flu no longer makes flu predictions.
An internet marketer thought it could boost its revenue by changing its traditional blue webpage color to a different color. After several weeks of tests, the company found a statistically significant result: apparently England loves teal. By looking at several alternative colors for a hundred or so countries, they guaranteed that they would find a revenue increase for some color for some country, but they had no idea ahead of time whether teal would sell more in England. As it turned out, when England’s webpage color was changed to teal, revenue fell.
A standard neuroscience experiment involves showing a volunteer in an MRI machine various images and asking questions about the images. The measurements are noisy, picking up magnetic signals from the environment and from variations in the density of fatty tissue in different parts of the brain. Sometimes they miss brain activity; sometimes they suggest activity where there is none.
A Dartmouth graduate student used an MRI machine to study the brain activity of a salmon as it was shown photographs and asked questions. The most interesting thing about the study was not that a salmon was studied, but that the salmon was dead. Yep, a dead salmon purchased at a local market was put into the MRI machine, and some patterns were discovered. There were inevitably patterns—and they were invariably meaningless.
In 2018, a Yale economics professor and a graduate student calculated correlations between daily changes in Bitcoin prices and hundreds of other financial variables. They found that Bitcoin prices were positively correlated with stock returns in the consumer goods and health care industries, and that they were negatively correlated with stock returns in the fabricated products and metal mining industries. “We don’t give explanations," the professor said, "we just document this behavior.” In other words, they may as well have looked at correlations of Bitcoin prices with hundreds of lists of telephone numbers and reported the highest correlations.
The director of Cornell University’s Food and Brand Lab authored (or coauthored) more than 200 peer-reviewed papers and wrote two popular books, which were translated into more than 25 languages.
In a 2016 blog post titled “The Grad Student Who Never Said No,” he wrote about a PhD student who had been given data collected at an all-you-can-eat Italian buffet.
Email correspondence surfaced in which the professor advised the graduate student to separate the diners into “males, females, lunch goers, dinner goers, people sitting alone, people eating with groups of 2, people eating in groups of 2+, people who order alcohol, people who order soft drinks, people who sit close to buffet, people who sit far away, and so on…” Then she could look at different ways in which these subgroups might differ: “# pieces of pizza, # trips, fill level of plate, did they get dessert, did they order a drink, and so on…”
He concluded that she should “work hard, squeeze some blood out of this rock.” By never saying no, the student got four papers (now known as the “pizza papers”) published with the Cornell professor as a coauthor. The most famous paper reported that men eat 93 percent more pizza when they eat with women. It did not end well. In September 2018, a Cornell faculty committee concluded that he had “committed academic misconduct in his research.” He resigned, effective the following June.
Good research begins with a clear idea of what one is looking for and expects to find. Data mining just looks for patterns and inevitably finds some.
The problem has become endemic nowadays because powerful computers are so good at plundering Big Data. Data miners have found correlations between Twitter words or Google search queries and criminal activity, heart attacks, stock prices, election outcomes, Bitcoin prices, and soccer matches. You might think I am making these examples up. I am not.
There are even stronger correlations with purely random numbers. It is Big Data Hubris to think that data-mined correlations must be meaningful. Finding an unusual pattern in Big Data is no more convincing (or useful) than finding an unusual license plate outside Feynman's classroom.
WIRED Opinion publishes pieces written by outside contributors and represents a wide range of viewpoints. Read more opinions here. Submit an op-ed at [email protected]