Friday, January 03, 2020

Correlation is not Causation #Analytics

A mantra that data analysts/scientist learn very early on is "Correlation is not causation". Measuring the strength of a correlation is usually done using Pearson's or Spearman's Correlation Coefficient (values between -1 and +1). These measures simply tell us whether two variables are related to each other or not. Even if we get a value as high as 0.9 (a strong positive correlation), we still cannot say that a change in one variable is dependent on change in the other. Causation is not established. 

For any two correlated events A and B, the following four relationships are possible:

  1. A causes B
  2. B causes A
  3. A and B are consequences of a common cause, but do not cause each other
  4. There is no connection between A and B, the correlation is coincidental

So what should we do? If a correlation is established, then further investigation is needed to see if there is also a causal relationship. To do this we need a controlled study in the form of an experiment. For example, as you drink more coffee, the number of hours you stay awake increases (see a great list of Common Correlations here). An experiment to test if there is a causal relationship would be easy to set up, for example - get volunteers to drink different amounts of coffee (measured by the same cup size) and time how long they stay awake. It would be important here to have a control group who do not drink any coffee. This experiment should provide strong evidence that there is a causal relationship between drinking coffee and staying awake. 

Image source: https://www.explainxkcd.com/wiki/index.php/552:_Correlation

Statistics is not an exact science, mostly because we are dealing with samples instead of populations. While we can be 95% or 99% confident of a correct result, we cannot say 100% - there is always uncertainty. Comparing two variables also involves uncertainty as we are usually also dealing with samples. Be careful with experimental design, as any bias or non-random sampling will compromise your research work.

No comments:

Post a Comment