During one of the “Amrut mancha” sessions, one of my close buddies, in a profound moment of enlightenment, said “In the recently concluded IPL10, there is a linkage between RCB’s pathetic performance and the owner, Mr. Mallya, hiding somewhere in London.” Wow! Well, there is a correlation, but…what the heck?
Causality means A causes B, whereas correlation on the other hand, means that A and B tend to be observed at the same time. These are very different things, whether correlation is “good enough” to act upon, without knowing the cause, depends entirely on the problem being solved and the risk of being wrong. This post is all about correlation vs causation, and the mad rush to engineer answers from the vast amount of data, without understanding the why.
During my early consulting days, I was taught to seriously follow the “5 WHY approach” – to reason anything, and certainly, that grooming helped me identify the root causes better. Big data on the other hand, isn’t based on the reasoning, but simply on correlation. Big data is NOT about WHY, but about WHAT! A classic example is Google Flu Trends: Back in 2009, when the H1N1 crisis occurred globally. Google system proved to be more useful and a timely indicator than government reports. So, what did Google do differently?
Google took the 50 million most common search items that the users had typed, and compared those with the past occurrences of seasonal flu. Google wanted to correlate what a user is searching for when they are affected with flu, like “medicine for cough and cold”. All their algorithm did was look for correlations between the frequency of certain search queries and the spread of flu over time and space. This example is not backed by a reason (WHY), but it surely is a clever application of math on the vast volume of data to surface the trends (WHAT).
Does this mean, given enough statistical evidence, it’s no longer necessary to understand why things happen?… Rather, we need to only know which things happen together? Who knows why people do what they do? Thus, instead of going down the rabbit hole and trying to find the WHY, if we just accept the point that people do what they do, and if we can track and measure it with unprecedented conformity with enough data, the numbers would speak for themselves and we will have something coming out of it. These arguments are not mine, there are several point of views and articles highlighting this alternative school of thought. Buoyed by the success of the Math Companies (please read as Google, Amazon, Facebook, LinkedIn, and the likes), businesses are seriously questioning the age old scientific methods that are cumbersome and lengthy, albeit rightly designed to uncover the WHY!
The scientific method is built around testable hypotheses. Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, one must understand the underlying mechanisms that connect the two. Faced with massive data, this approach to science (hypothesize, model, test) is becoming obsolete. The new mantra is, in a big data world, we should not be fixated on causality, instead, we should be discovering patterns and correlations in data as novel insights.
In the last 2 months, I have heard from numerous CAOs/CDOs that business is asking tough questions and is demanding answers right away. They do not have time to follow the scientific approach, rather they are much more appreciative if zillions of correlations can be delivered to them on a regular basis.
Given the technological advances and maturity of the big data analytics platforms, establishing correlations within the disparate datasets is the easiest part. The scary part is not knowing the cause. Why? Because, there are a lot of small data problems that occur in big data, and these problems don’t disappear just because you’ve got lots of data.
Instead of calling it as “big data”, I would prefer calling it as “found data” – the digital exhaust. Think about it, our communication, leisure and commerce have moved to the internet, and the internet has moved into our phones, cars, homes, and even into the glasses we wear. Our life is getting recorded and quantified in an unprecedented way. Google Flu Trends was built on found data and it’s this sort of data that one needs to be careful about. Google Flu Trends, after reliably providing a swift and accurate account of flu outbreaks for several winters, lost its predictable power for where flu was going. While Google’s model pointed to a severe outbreak, the slow-and-steady data from the CDC (Center for Disease Control) showed that Google’s estimates of the spread of flu-like illnesses were overstated by almost a factor of two.
The problem was that Google’s engineers weren’t trying to figure out what caused what, they were merely finding statistical patterns in the data. They cared about correlation rather than causation. Figuring out what causes what is hard, however figuring out what is correlated with what, is much cheaper and easier.
There are many reasons to be excited about the broader opportunities offered to us by the ease with which we can gather and analyze vast data sets. However, if you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down. Another much talked about definition of a big data set is “N = All” – where we no longer have to sample, but we have the entire population. And when “N = All”, there is indeed no issue of a sampling bias because the sample includes everyone. But is “N = All” really a good description of most of the found data sets we are considering? Probably not.
An example is, Potholes on roads. One of the popular radio stations in Bangalore urged its listeners to use a smartphone app and take pictures so that civic authorities can act on it. As citizens of Bangalore downloaded the app, drove around, uploaded the pictures, suddenly the civic authorities got on their hand an informative data exhaust that addresses a problem (sending inspection guys to survey various roads and manually record the state of the roads) in a way that would have been inconceivable a few years ago, that too without much involvement from the civic authority. Brilliant! Wait, don’t be ecstatic so fast. What the smartphone app and the citizen journalism really produced is a map of potholes that systematically favored the most commuted roads used by IT professionals, who were owning smartphones and were really suffering because of the potholes elongating their commute times. The point I am trying to make is, what about other roads, which are not used by these IT professionals and where there are serious pothole problems? The “N = All” literally means every pothole on every road from every enabled phone by every citizen. That is not the same thing as recording every pothole on a few heavily commuted roads by a few categories of commuters.
There must always be a question about who and what is missing, especially with a messy pile of found data. N = All is often an assumption, rather than a fact about the data. In 2005, John Ioannidis, an epidemiologist, published a research paper with the self-explanatory title, “Why Most Published Research Findings Are False”. The paper became famous as a provocative statement highlighting a serious issue. One of the key ideas behind Ioannidis’s work is the “multiple-comparisons problem”. When examining a pattern in data, scientists are trained to ask whether such a pattern might have emerged by chance. If it is unlikely that the observed pattern could have emerged at random (basically questioning WHY), then that pattern is considered “statistically significant”. The multiple-comparisons problem arises when you start looking at many possible patterns. This problem is more serious in found data sets, where just by applying sophisticated algorithms, one can generate vastly more possible patterns than there are data points to compare. Thus, without careful analysis and questioning the WHY, it is inevitable that you would end up with a dismal ratio of genuine to spurious patterns (of signal-to-noise). The idea that “with enough data, the numbers speak for themselves”, to me, seems hopelessly naïve, especially in the context of found data sets, where spurious patterns vastly outnumber genuine discoveries.
Fundamentally, asking whether correlation is enough, is actually asking the wrong question. In my view, the key question should be, “good that you have got so many correlations, can you take action on the basis of these correlation findings?” A BCG article nicely outlines a systematic approach to this question, based on two factors:
- Confidence that the correlation will reliably recur in the future. The higher that confidence level, the more reasonable it is to take action in response.
- The tradeoff between the risk and reward of acting. If the risk of acting and being wrong is extremely high, acting on even a strong correlation may be a seriously grave mistake.
For example, if you combine data from a supermarket’s loyalty card program with auto claims information, you will find interesting correlations. The data might show you that people who buy meat and milk are good car insurance risks, while people who buy pasta and spirits are poor risks. Though this statistical relationship could be an indicator of risky behaviors (driving under the influence of spirits, for example), there are a number of other possible reasons for the finding.
Now that you have found these interesting correlations, you can do 2 things:
- Targeting insurance marketing to loyalty card holders in the low-risk group.
- Pricing car insurance, based on these buying patterns.
The latter approach, however, could lead to a brand-damaging backlash should the practice be exposed. Thus, without additional confidence in the finding, the former approach is preferable.
Basically, it comes down to the problem at hand – If the goal is to answer “what is happening?”, then all you need is to expose the data to powerful algorithms and magically, you will find numerous trends. However, if the goal is to understand “why,” then you’ll need to go beyond correlation in order to get at causation. It’s great if we come to know WHAT shoppers are doing, but understanding WHY shoppers are doing what they’re doing, will put your finding in a much more powerful position, with which you can influence the shopping behaviors.
Enough about the importance of causation. You’re a data scientist, what best practices would you follow to reduce the likelihood of accepting spurious statistical correlations as facts? Here are some useful tips in that regard:
- Ensemble Learning: Do not base your analysis on a single model, rather use multiple independent models, all using the same data set, but trained on different samples, employing different algorithms, and different variables to converge on a common statistical pattern. If you achieve convergence, you can have greater confidence that the correlations they reveal have some causal validity.
- A/B Testing: Similar to the ensemble learning but differing in the implementation approach, where you develop alternative models between which some variables differ but others are held constant, you evaluate which model best predicts the dependent variable of interest. This approach is also known as “champion” and “challenger” models, where after successive runs of models with newer datasets, you eventually converge on a set of variables with the highest predictive value.
When applied consistently in your work as a data scientist, these approaches can ensure that the patterns you’re revealing actually reflect the real-world behavior of the domain you are trying to model. Without these in your tool kit, you can’t be confident that the correlations you’re seeing won’t vanish the next time you run your statistical model.
Lastly, if you are interested to learn more about spurious correlations, visit “Spurious Correlations”, a website set up by Harvard student, Tyler Vigen, where you will find enough amusing episodes proving that just because two trends seem to fluctuate in tandem, it does not necessarily mean that the correlation has any meaningful significance.
More from Soumendra Mohanty
Last week, I was in Johannesburg meeting some clients, and the conversation turned toward a…
AI (Artificial Intelligence) will make up for the lack of data scientists and the next frontier…
It’s hard to not notice that in almost everything (starting from our mundane day to day activities…
In the recently concluded “Gartner Data Analytics Summit 2017”, there was an interesting…
As the global pandemic began, I had just received approval on the city permits for a sizable…
The post pandemic boom in the IT service industry, has forced many companies to build a robust…
For operational excellence, a production workload must emit information necessary to support…
The future of the enterprise is on the cloud. The numbers back up this claim: the global public…