An interesting set of blog posts that I've been reading recently is hosted on Data Science Central and is written by Stephanie Glen. Her recent post "Statistics for Data Science in One Picture" captures the essence of Statistics in an easy to understand graphic that should aid students in grasping this widely variable subject. I intend to refer to this graphic in my next set of Introduction to Statistics notes. Students reading this post should check out the Data Science Central blog.
Glen's post got me thinking about what Statistics are essential for a Data Scientist to know about. Her graphic below covers basic probability and statistics - but does not mention actual tests like ANOVA and Chi-Square, which are behind everything you see on this chart:
|Image source: Data Science Central (by kind permission of Stephanie Glen).
When I finished my postgraduate studies (in 1987!) I can remember saying to myself thank goodness I will never have to perform multivariate analysis or an ANOVA again. Little did I know at the time that 25 years later I would be teaching two modules on Statistics, and that this subject would become a hugely enjoyable part of my academic life. Statistics is the Science of Data, and if we are to be analytical and accurate with our data analysis - the study of statistics must form part of our training.
I doubt that many data analytical reports being written today will contain the results of a t Test (for comparison of two normally distributed data sets) or a two-way ANOVA (Analysis of Variance between three or more samples). But there will be charts and data tables so that links, trends, patterns, and relationships can be identified and analysed. Dashboards can summarize huge amounts of data in a small space, but I've never seen one that displays a p value.
However, if you want to classify data, make predictions and recommendations through machine learning - then you have to start with Bayesian statistics as Glen suggests in her chart. If you want to decide whether to include or exclude outlying data - then you have to understand central tendency and probability distributions. If you want to search for clusters or groups of data - then you have to study methods such as PCA (Principal Component Analysis) and correlation. If you want to take the guesswork out of data analysis - then you have to perform statistical tests and understand p values.
In short, if you want to be a great data scientist - you have to study statistics!