## Wednesday, June 10, 2015

### The Perfect Norm - Basic Stats for Data Analysts #BigData #HDSDA

In dealing with large amounts of data there are hundreds, if not thousands, of different statistical tests that can be applied which yield many different results. Understanding statistics is not easy, and even though I teach basic descriptive and inferential statistics - I have only the tip of the iceberg in terms of knowledge of statistical methods. Many tests assume that the population data are distributed normally so that we can make inferences based on samples. This is important basic information for a data analysts to understand.

In class I use some classic data of 100 house fly wing length measurements (Sokal and Hunter, 1955) to illustrate what a normal distribution looks like - these data are sometimes used to show an almost perfect normal distribution. Usually we look at a frequency distribution histogram, and if it looks like a bell-shaped (or Gaussian) curve - then we assume normality. Here's the histogram for the fly wing length data plotted in Excel:

Another simple way to check for normality is to see how close the Mean, Median, and Mode are. In the above data they are 45.5, 45.5, and 45.0 respectively - very close. There are also some statistical tests that you can perform on the data to test for normality - the Shapiro-Wilk, and Kolmogorov-Smirnov tests would be the most common ones used. Other measures such as Skewness and Kurtosis are also good indicators of normality. Excel can give several descriptives

 Mean 45.5 Standard Error 0.39 Median 45.5 Mode 45 Standard Deviation 3.92 Sample Variance 15.36 Kurtosis -0.29 Skewness 0 Range 19 Minimum 36 Maximum 55 Sum 4550 Count 100 Confidence Level (95.0%) 0.78

For me, the above is a good start to examine any set of data - you can learn a lot from it!

Reference:
Sokal, R.R. and P.E. Hunter. 1955. A morphometric analysis of DDT-resistant and non-resistant housefly strains Ann. Entomol. Soc. Amer. 48: 499-507.