I've often been asked in my Statistics classes by students why they are learning about t-Tests, ANOVA, correlation, and regression in our Data Analytics programmes. In the first of two posts, I'll try to answer this.
First - let's take a look at where statistics fits into data science/analytics. Bob Hayes (@bobehayes), writing in Business 2 Community, asks the question "Statistics: Is This Big Data’s Biggest Hurdle?". Here he places statistics as one of three "primary pillars of the field of data science". The other two are domain knowledge and computer science skills.
Clearly, context is important in any analysis, so domain knowledge helps us to ask the right questions and how we are to make sense and use of data.The computer science skills help us to prepare data, sort it, and set it up for analysis. Statistics gives us answers to our questions.
The simplest form of statistics is "Descriptive Statistics". As the name suggests, this helps us to to make basic descriptions of our data such as the average (the most commonly used statistic), the maximum value, the minimum value, the range, as well as variance (possibly for me the most important statistic of them all). Take a look at the following simple data set showing IQ scores for 30 people, and descriptive statistics for these data:
The descriptive statistics are generated in Excel using the Data Analysis Toolpak. Suddenly a set of potentially meaningless numbers has value. We now know the highest value, the lowest, the average (mean), the mode (most commonly occurring value), how much the data varies (variance/standard deviation) - we can tell much more about this data than a simple column of figures can tell us. We now have meaning using statistics. This is data analysis.