Screening using comparisons between p-values?

There was an interesting exchange on the R-help list yesterday. A researcher proposed to screen a large number of genes for a “significant” effect on survival time using a large number of univariate significance tests. Several people on the list thought it really wasn’t such a good idea. I have included the original communications on the list at the foot of this message.

The perhaps rather counterintutive point is that under the null hypothesis a p value of 0.05 is just as likely as a p value of 0.5.

How can this be so? Put this way it does sound rather odd. But the p values produced under the null hypothesis are themselves random variables. They are uniformly distributed between 0 and 1. Why should any particular p value be more likely than any other? It is not. Data produced under the null hypothesis can range from the highly probable to the highly improbable without changing the fact that the null hypothesis was used in the data producing process. Watch this very simple simulation closely if you don’t follow this.

R will produce a vector of 10 numbers taken from a normal distribution with mean zero and sd 1 using the following command.

> rnorm(10,0,1)
[1] -0.3736018 -0.2327996 -0.8154836 0.3663073 1.0702547 1.3302237
[7] 1.3972863 1.2029137 -0.9293702 0.6351127

Every time R does this the mean will not be exactly zero, but the numbers are taken from a population with mean zero. A null hypothesis test can be used to test the probability of getting these data if the true population mean is zero (which we know it is). This can be done in R by fitting a linear model with only an intercept and testing for significance of the intercept

summary(lm(rnorm(10,0,1)~1))

Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.06043 0.33829 0.179 0.862

Residual standard error: 1.07 on 9 degrees of freedom

So this time I am not tempted to reject the null hypothesis.

Let’s do the same one thousand times and look at the results.

samples<-replicate(1000,rnorm(10,0,1))
test.results<-apply(samples,2,function(x)summary(lm(x~1))$coefficients[4])
hist(test.results)

fiig22.png

Now this distribution for the test results on the simulated data is, of course, quite obvious when you think about it. About one in twenty of these particular results are less than 0.05 . Any value between 0 and 1 is equally likely, making 19 out of 20 test results fall above this cut off. This is explicit in the definition of a significance test.

If the true mean for the simulations is not zero then the distribution of the p values will change.

samples<-replicate(1000,rnorm(10,0.5,1))
test.results<-apply(samples,2,function(x)summary(lm(x~1))$coefficients[4])
hist(test.results)

fiig23.png

The researcher’s intention is to screen for genes that are more likely to have an effect. Thus the interest lies in comparisons between p-values. If all the genes screened have no effect at all then the technique is misleading, even if Bonferoni or any other corrections is applied, as only false positives will ever be found. If some of the genes do have a (small) effect there is no reason to believe that all those with an effect will provide p-values of less than 0.05. The actual results could be a mixture between the two histograms.

fiig24.png

If all the genes have the same effect then expressing a preference for those with low p-values as compared to high p-values would clearly be a mistake. If a mixture occurs then, as Duncan Murdoch points out, the best that can be achieved is some guidance regarding the direction of future work. The procedure is clearly fraught with dangers. It is especially dangerous if there is no clear a-priori reason to believe which genes would be more likely to have an effect.

I am concerned that in comparable situation in the typically observational science of ecology a researcher could be tempted to go too far and mention “significant” effects as if they have been fully confirmed by this sort of analysis.

>> > Hi Eleni,
>> >
>> > The problem of this approach is easily explained: Under the Null
>> > hypothesis, the P values
>> > of a significance test are random variables, uniformly distributed in
>> > the interval [0, 1]. It
>> > is easily seen that the lowest of these P values is not any 'better'
>> > than the highest of the
>> > P values.
>> >
>> > Best wishes,
>> >
>> > Matthias
>> >
>>
>> Correct me if I'm wrong, but isn't that the point? I assume that the
>> hypothesis is that one or more of these genes are true predictors,
>> i.e. for these genes the p-value should be significant. For all the
>> other genes, the p-value is uniformly distributed. Using a
>> significance level of 0.01, and an a priori knowledge that there are
>> significant genes, you will end up with on the order of 20 genes, some
>> of which are the "true" predictors, and the rest being false
>> positives. this set of 20 genes can then be further analysed. A much
>> smaller and easier problem to solve, no?
>>
>>
>> /Gustaf
> 
> Sorry, it should say 200 genes instead of 20.
> 
I agree with your general point, but want to make one small quibble:
the choice of 0.01 as a cutoff depends pretty strongly on the
distribution of the p-value under the alternative.  With a small sample
size and/or a small effect size, that may miss the majority of the true
predictors.  You may need it to be 0.1 or higher to catch most of them,
and then you'll have 10 times as many false positives to wade through
(but still 10 times fewer than you started with, so your main point
still holds).

Duncan Murdoch

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s