Duncan Golicher’s weblog

Research, scripts and life in Chiapas

Posts Tagged ‘inference

Kolmogorov-Smirnov tests of normality

with 3 comments

Click here for an explanation of the animation below

ks

Today I had an interesting exchange regarding tests of normality when teaching introductory statistics.

I have a dilemma. The point is that I do not place a great deal of stress on tests of normality when I teach Master’s students, although I mention that they are often used. However in introductory statistics texts tests of normality are given a lot more attention, presumably in order to ensure that students are aware that normality is an important assumption for many statistical procedures. I’m all in favour of testing assumptions. But do students really know what assumptions they are testing?
I have to teach introductory statistics without confusing students or sending mixed messages. It is therefore quite a delicate matter that needs clarity.

In fact none of these statements are accurate (and its Smirnoff you are thinking of!). My own  preference is to try teach students to understand why any underlying population or sampling frame might not be normal.They should also intuitively understand how the procedure used for sampling from the population may influence the properties of the sample drawn from the populations.

These properties are then treated as expected before beginning any field work. All data transformation or use of non-parametric tests are pre-planned as part of the formal protocol designed for data collection and analysis.

I really do not like any post-hoc alterations to a planned work scheme after the data are collected. At best they waste time, at worst they lead students to think that the data themselves are somehow “invalid” and thus unpublishable.

I therefore quite strongly dislike including post hoc tests of normality within the work flow of the analysis as a knee jerk procedure with a yes/no answer. This certainly does not suggest that I tell students to assume that all the preplanned analyses are necessarily valid, nor to accept that inference on the mean can be conducted without checking assumptions.

The alternative to automated tests of normality is to make sure that students always visualise the distribution of their data fully in order to understand why any assumption of normality may be wrong. I also try to encourage students understand how and why data transformations might work. Again this is usually most helpful before data is collected, but it is also a way to deal with major surprises.

Here again is the link to the pdf document I wrote that suggests a possible answer to the poll.

ksdemo2

Click on the link above as it is easier to include PDFs in wordpress this way.

An here is a quick test of any interpretation of the results of a KS test of normality.

Just to summarise the well known reason to avoid testing for normality. If you draw a very large sample from a slightly non-normal population the test tends to provide low p-values. You should presumably reject the null hypothesis that the data could have come from a normal population and according to a strict interpretation you then can’t use your planned analysis as it would be “invalid!

However if you draw small samples from very non normal populations (as shown in the pdf) you will not reject the null hypothesis as often, even though the methodology will provide misleading inference.ksdemo3

Screening using comparisons between p-values?

leave a comment »

There was an interesting exchange on the R-help list yesterday. A researcher proposed to screen a large number of genes for a “significant” effect on survival time using a large number of univariate significance tests. Several people on the list thought it really wasn’t such a good idea. I have included the original communications on the list at the foot of this message.

The perhaps rather counterintutive point is that under the null hypothesis a p value of 0.05 is just as likely as a p value of 0.5.

How can this be so? Put this way it does sound rather odd. But the p values produced under the null hypothesis are themselves random variables. They are uniformly distributed between 0 and 1. Why should any particular p value be more likely than any other? It is not. Data produced under the null hypothesis can range from the highly probable to the highly improbable without changing the fact that the null hypothesis was used in the data producing process. Watch this very simple simulation closely if you don’t follow this.

R will produce a vector of 10 numbers taken from a normal distribution with mean zero and sd 1 using the following command.

> rnorm(10,0,1)
[1] -0.3736018 -0.2327996 -0.8154836 0.3663073 1.0702547 1.3302237
[7] 1.3972863 1.2029137 -0.9293702 0.6351127

Every time R does this the mean will not be exactly zero, but the numbers are taken from a population with mean zero. A null hypothesis test can be used to test the probability of getting these data if the true population mean is zero (which we know it is). This can be done in R by fitting a linear model with only an intercept and testing for significance of the intercept

summary(lm(rnorm(10,0,1)~1))

Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.06043 0.33829 0.179 0.862

Residual standard error: 1.07 on 9 degrees of freedom

So this time I am not tempted to reject the null hypothesis.

Let’s do the same one thousand times and look at the results.

samples<-replicate(1000,rnorm(10,0,1))
test.results<-apply(samples,2,function(x)summary(lm(x~1))$coefficients[4])
hist(test.results)

fiig22.png

Now this distribution for the test results on the simulated data is, of course, quite obvious when you think about it. About one in twenty of these particular results are less than 0.05 . Any value between 0 and 1 is equally likely, making 19 out of 20 test results fall above this cut off. This is explicit in the definition of a significance test.

If the true mean for the simulations is not zero then the distribution of the p values will change.

samples<-replicate(1000,rnorm(10,0.5,1))
test.results<-apply(samples,2,function(x)summary(lm(x~1))$coefficients[4])
hist(test.results)

fiig23.png

The researcher’s intention is to screen for genes that are more likely to have an effect. Thus the interest lies in comparisons between p-values. If all the genes screened have no effect at all then the technique is misleading, even if Bonferoni or any other corrections is applied, as only false positives will ever be found. If some of the genes do have a (small) effect there is no reason to believe that all those with an effect will provide p-values of less than 0.05. The actual results could be a mixture between the two histograms.

fiig24.png

If all the genes have the same effect then expressing a preference for those with low p-values as compared to high p-values would clearly be a mistake. If a mixture occurs then, as Duncan Murdoch points out, the best that can be achieved is some guidance regarding the direction of future work. The procedure is clearly fraught with dangers. It is especially dangerous if there is no clear a-priori reason to believe which genes would be more likely to have an effect.

I am concerned that in comparable situation in the typically observational science of ecology a researcher could be tempted to go too far and mention “significant” effects as if they have been fully confirmed by this sort of analysis.

>> > Hi Eleni,
>> >
>> > The problem of this approach is easily explained: Under the Null
>> > hypothesis, the P values
>> > of a significance test are random variables, uniformly distributed in
>> > the interval [0, 1]. It
>> > is easily seen that the lowest of these P values is not any 'better'
>> > than the highest of the
>> > P values.
>> >
>> > Best wishes,
>> >
>> > Matthias
>> >
>>
>> Correct me if I'm wrong, but isn't that the point? I assume that the
>> hypothesis is that one or more of these genes are true predictors,
>> i.e. for these genes the p-value should be significant. For all the
>> other genes, the p-value is uniformly distributed. Using a
>> significance level of 0.01, and an a priori knowledge that there are
>> significant genes, you will end up with on the order of 20 genes, some
>> of which are the "true" predictors, and the rest being false
>> positives. this set of 20 genes can then be further analysed. A much
>> smaller and easier problem to solve, no?
>>
>>
>> /Gustaf
> 
> Sorry, it should say 200 genes instead of 20.
> 
I agree with your general point, but want to make one small quibble:
the choice of 0.01 as a cutoff depends pretty strongly on the
distribution of the p-value under the alternative.  With a small sample
size and/or a small effect size, that may miss the majority of the true
predictors.  You may need it to be 0.1 or higher to catch most of them,
and then you'll have 10 times as many false positives to wade through
(but still 10 times fewer than you started with, so your main point
still holds).

Duncan Murdoch

Written by Duncan Golicher

February 14, 2008 at 7:44 pm

Follow

Get every new post delivered to your Inbox.