Click here for an explanation of the animation below

Today I had an interesting exchange regarding tests of normality when teaching introductory statistics.

I have a dilemma. The point is that I do not place a great deal of stress on tests of normality when I teach Master’s students, although I mention that they are often used. However in introductory statistics texts tests of normality are given a lot more attention, presumably in order to ensure that students are aware that normality is an important assumption for many statistical procedures. I’m all in favour of testing assumptions. But do students really know what assumptions they are testing?

I have to teach introductory statistics without confusing students or sending mixed messages. It is therefore quite a delicate matter that needs clarity.

In fact none of these statements are accurate (and its Smirno**ff** you are thinking of!). My own preference is to try teach students to understand why any underlying population or sampling frame might not be normal.They should also intuitively understand how the procedure used for sampling from the population may influence the properties of the sample drawn from the populations.

These properties are then treated as expected before beginning any field work. All data transformation or use of non-parametric tests are pre-planned as part of the formal protocol designed for data collection and analysis.

I really do not like any post-hoc alterations to a planned work scheme after the data are collected. At best they waste time, at worst they lead students to think that the data themselves are somehow “invalid” and thus unpublishable.

I therefore quite strongly dislike including post hoc tests of normality within the work flow of the analysis as a knee jerk procedure with a yes/no answer. This certainly does not suggest that I tell students to assume that all the preplanned analyses are necessarily valid, nor to accept that inference on the mean can be conducted without checking assumptions.

The alternative to automated tests of normality is to make sure that students always visualise the distribution of their data fully in order to understand why any assumption of normality may be wrong. I also try to encourage students understand how and why data transformations might work. Again this is usually most helpful before data is collected, but it is also a way to deal with major surprises.

Here again is the link to the pdf document I wrote that suggests a possible answer to the poll.

Click on the link above as it is easier to include PDFs in wordpress this way.

An here is a quick test of any interpretation of the results of a KS test of normality.

Just to summarise the well known reason to avoid testing for normality. If you draw a very large sample from a slightly non-normal population the test tends to provide low p-values. You should presumably reject the null hypothesis that the data could have come from a normal population and according to a strict interpretation you then can’t use your planned analysis as it would be “invalid!

However if you draw small samples from very non normal populations (as shown in the pdf) you will not reject the null hypothesis as often, even though the methodology will provide misleading inference.ksdemo3

If you ever want to read a reader’s feedback 🙂 , I rate this post for 4/5. Detailed info, but I have to go to that damn msn to find the missed parts. Thank you, anyway!

I have a problem on understanding my datasets. I have tested the variables (temperature and humidity) using Kolmogorov Smirnov test (datasets more that 2500) and it came out that the data was not normally distributes, where the p-value<.05. Therefore non parametric datasets should be carried out. What I have worried is that only nominal and ordinal measures using non parametric test but the tested variables in fact is ratio values. Do you have any suggestion towards my problems? Thanks

If the response consists of ratios then the values cannot be normally distributed on a priori grounds. They are bounded by zero and one. If the range of values fall around 0.5 and the variability is not too large then they might be approximately normal, but that is the best you can hope for. The whole point of the post was to point out that KS tests should not be used uncritically. The KS test (and other similar procedures) often provides an unhelpful answer to the wrong question. If you have a lot of data then KS tests will almost always be significant, but that does not necessarily rule out the use of procedures based on normal inference if the distribution does not in fact depart radically from normal. Look at the data or the residuals using histograms and QQ plots. Non parametric tests can be uninformative, and are rarely the best choice for large data sets. It sounds as if you should be looking at something along the lines of beta regression as a model.