A mantra that data analysts/scientist learn very early on is "Correlation is not causation". Measuring the strength of a correlation is usually done using Pearson's or Spearman's Correlation Coefficient (values between -1 and +1). These measures simply tell us whether two variables are related to each other or not. Even if we get a value as high as 0.9 (a strong positive correlation), we still cannot say that a change in one variable is dependent on change in the other. Causation is not established.
For any two correlated events A and B, the following four relationships are possible:
- A causes B
- B causes A
- A and B are consequences of a common cause, but do not cause each other
- There is no connection between A and B, the correlation is coincidental
So what should we do? If a correlation is established, then further investigation is needed to see if there is also a causal relationship. To do this we need a controlled study in the form of an experiment. For example, as you drink more coffee, the number of hours you stay awake increases (see a great list of Common Correlations here). An experiment to test if there is a causal relationship would be easy to set up, for example - get volunteers to drink different amounts of coffee (measured by the same cup size) and time how long they stay awake. It would be important here to have a control group who do not drink any coffee. This experiment should provide strong evidence that there is a causal relationship between drinking coffee and staying awake.
Image source: https://www.explainxkcd.com/wiki/index.php/552:_Correlation |
Statistics is not an exact science, mostly because we are dealing with samples instead of populations. While we can be 95% or 99% confident of a correct result, we cannot say 100% - there is always uncertainty. Comparing two variables also involves uncertainty as we are usually also dealing with samples. Be careful with experimental design, as any bias or non-random sampling will compromise your research work.
No comments:
Post a Comment