Wednesday, September 21, 2016

Some things you might not know about data. #Analytics #HDSDA #102

Statistics can be a confusing subject at the best of times - there are so many different formulas and test statistics (t, F, r, H, s, U, etc) that it can be difficult for students to figure out what's what, and what they are used for. The simplest of all descriptive statistics is the average. Everyone knows that to calculate the average of a series of numbers that all you have to do is to add them up and divide by the number of values. Or is it?

Averages are measures of Central Tendency, and there are in fact three flavours of averages: the mean, the median, and the mode. Each of these gives us similar, but different types of descriptive information about data. Most of the time when we talk about an average value we are in fact talking about the mean. Microsoft Excel confuses this a bit further - if you want to use Excel to calculate the mean, you use the "average()" function.

Let's take a look at a simple set of data showing the annual salaries of five people:

Name    Salary
John    €25,000
Mary    €27,500
Mike   €112,000
Jane    €34,500
Hugh    €48,750

To calculate the mean, this is simply:

(€25,000 + €27,500 + €112,000 + €34,500 + +€48,750) / 5 = €49,550

So, the mean salary is €49,550. While this is a useful metric for these data, you will have noticed that one of the values (Mike/€112,000) is way bigger than all the others, which tells us that our data are skewed (Skewness is another descriptive statistic). Our mean value is greater than four out of the five values given, and consequently has limited value as a measure of central tendency in describing this dataset.

The median is also an average, but of a different kind. It is the mid-point of a set of scores. To determine what the median is we simply rank the scores and point out the middle value. Let's do this for the salary data above:

Name    Salary
Mike   €112,000
Hugh    €48,750
Jane    €34,500
Mary    €27,500
John    €25,000

Jane's salary of €34,500 is the median - a much different statistic compared to our mean value of €49,550. The median is much less sensitive to extreme values, and when you have extreme values the median is a better representation of central tendency than the mean. For example, in our data above - if Mike's salary was reduced to €50,000, the median would still be €34,500 (Jane), but the mean value would change (to €37,150). If there are an even number of values, take the mean of the two middle values to determine the median.

Finally, the mode is a third useful average - this is the value that occurs most frequently in our data. If you examine the data above, each value occurs only one, so there is no mode. So let's take a look at a different set of values:

Sample data: 6, 5, 7, 4, 6, 8, 7, 7, 3, 7

Here you can see that the value "7" occurs four times in this data set - therefore the mode is 7, (for these data the mean = 6, and the median = 6.5). As data analysts we need to be sure of our terminology and distinguish between the mean and an average. We also need to describe data with more than just the mean, because it can be misleading if the data are skewed (as in salary example above).

If the mean, the median, and the mode are close together in value you can usually assume that the data are normally distributed around the mean, and would appear like a bell curve in a frequency histogram.

If you would like to determine how to calculate the mean, median, and mode using Excel - check out my YouTube video below:

No comments:

Post a Comment