Screening using comparisons between p-values?
There was an interesting exchange on the R-help list yesterday. A researcher proposed to screen a large number of genes for a “significant” effect on survival time using a large number of univariate significance tests. Several people on the list thought it really wasn’t such a good idea. I have included the original communications on the list at the foot of this message.
The perhaps rather counterintutive point is that under the null hypothesis a p value of 0.05 is just as likely as a p value of 0.5.
How can this be so? Put this way it does sound rather odd. But the p values produced under the null hypothesis are themselves random variables. They are uniformly distributed between 0 and 1. Why should any particular p value be more likely than any other? It is not. Data produced under the null hypothesis can range from the highly probable to the highly improbable without changing the fact that the null hypothesis was used in the data producing process. Watch this very simple simulation closely if you don’t follow this.
R will produce a vector of 10 numbers taken from a normal distribution with mean zero and sd 1 using the following command.
> rnorm(10,0,1)
[1] -0.3736018 -0.2327996 -0.8154836 0.3663073 1.0702547 1.3302237
[7] 1.3972863 1.2029137 -0.9293702 0.6351127
Every time R does this the mean will not be exactly zero, but the numbers are taken from a population with mean zero. A null hypothesis test can be used to test the probability of getting these data if the true population mean is zero (which we know it is). This can be done in R by fitting a linear model with only an intercept and testing for significance of the intercept
summary(lm(rnorm(10,0,1)~1))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.06043 0.33829 0.179 0.862
Residual standard error: 1.07 on 9 degrees of freedom
So this time I am not tempted to reject the null hypothesis.
Let’s do the same one thousand times and look at the results.
samples<-replicate(1000,rnorm(10,0,1))
test.results<-apply(samples,2,function(x)summary(lm(x~1))$coefficients[4])
hist(test.results)

Now this distribution for the test results on the simulated data is, of course, quite obvious when you think about it. About one in twenty of these particular results are less than 0.05 . Any value between 0 and 1 is equally likely, making 19 out of 20 test results fall above this cut off. This is explicit in the definition of a significance test.
If the true mean for the simulations is not zero then the distribution of the p values will change.
samples<-replicate(1000,rnorm(10,0.5,1))
test.results<-apply(samples,2,function(x)summary(lm(x~1))$coefficients[4])
hist(test.results)

The researcher’s intention is to screen for genes that are more likely to have an effect. Thus the interest lies in comparisons between p-values. If all the genes screened have no effect at all then the technique is misleading, even if Bonferoni or any other corrections is applied, as only false positives will ever be found. If some of the genes do have a (small) effect there is no reason to believe that all those with an effect will provide p-values of less than 0.05. The actual results could be a mixture between the two histograms.
If all the genes have the same effect then expressing a preference for those with low p-values as compared to high p-values would clearly be a mistake. If a mixture occurs then, as Duncan Murdoch points out, the best that can be achieved is some guidance regarding the direction of future work. The procedure is clearly fraught with dangers. It is especially dangerous if there is no clear a-priori reason to believe which genes would be more likely to have an effect.
I am concerned that in comparable situation in the typically observational science of ecology a researcher could be tempted to go too far and mention “significant” effects as if they have been fully confirmed by this sort of analysis.
Hi Eleni, The problem of this approach is easily explained: Under the Null hypothesis, the P values of a significance test are random variables, uniformly distributed in the interval [0, 1]. It is easily seen that the lowest of these P values is not any 'better' than the highest of the P values. Best wishes, MatthiasCorrect me if I'm wrong, but isn't that the point? I assume that the hypothesis is that one or more of these genes are true predictors, i.e. for these genes the p-value should be significant. For all the other genes, the p-value is uniformly distributed. Using a significance level of 0.01, and an a priori knowledge that there are significant genes, you will end up with on the order of 20 genes, some of which are the "true" predictors, and the rest being false positives. this set of 20 genes can then be further analysed. A much smaller and easier problem to solve, no? /GustafSorry, it should say 200 genes instead of 20.
I agree with your general point, but want to make one small quibble: the choice of 0.01 as a cutoff depends pretty strongly on the distribution of the p-value under the alternative. With a small sample size and/or a small effect size, that may miss the majority of the true predictors. You may need it to be 0.1 or higher to catch most of them, and then you'll have 10 times as many false positives to wade through (but still 10 times fewer than you started with, so your main point still holds). Duncan Murdoch
Rationality and the lottery
The BBC web site today contained what appears to me to be a misrepresentation of decision theory. The argument goes…
“Should you invest £2 a day or use it to buy lottery tickets?
Maths makes the decision obvious. Suppose you invest two quid every day at the reasonable rate of 10%. It will take you almost exactly 50 years to accumulate £1m. To earn this same £1m in the National Lottery, you would (on average) have to match five numbers and a bonus ball, at odds of 2,330,635-to-1.
If you spent two quid a day for 50 years you would total just over 36,500 tickets and would thus have only a 1-in-63 chance of making that million pounds. However, the available image of immediate wealth subverts this rationality.”
Is this right. Is it “obvious” as the author claims. No it is not. It is far from obvious.
The calculation of compound interest is correct, although banks do not normally compound interest on a daily basis and 10% is rather optimistic. You can check by simulating the arrangement as an R function using numerical integration.
f<-function(ndays=365*50,interest=0.1,value=2){
a<-numeric(ndays)
a[1]<-value
for (i in 2:(ndays)){
a[i]<-a[i-1]+value
a[i]<-a[i]+(interest/365)*a[i]}
a}
par(bg=grey(0.92))
plot(f(v=2),type=”l”,lwd=2,col=”red”,xlab=”Number of days”,ylab=”Accumulated value”)
grid(col=1)
The money in the bank grows healthily towards the one million target. So what is wrong with the argument? The author claims that the odds of winning a million on the lottery are 2,330,635:1. This is not a fair bet, but it is not such a bad one either. You have just under one chance in two million of winning the one million on offer. The expected value of your one pound ticket is the chances of winning (admittedly very small indeed) multiplied by the sum that would be won (and of course this is very large).
1/2330636* 1000000= 0.4290674
So the expected value of your ticket is about 43p. You have superficially wasted 57p.
The story about all the interest you would get by investing the money is a misleading red herring. If you took the conclusion of a 1:63 ratio between saving and gambling seriously it would persuade you not to buy a single lottery ticket even if the odds on winning were to become more favourable than one in a two million and bettered the value of the prize. Decisions between retaining a small sum with certainty and risking a big one do always involve subjective judgement, but few would not consider the lottery worth a shot if the prize of 1 million could be won at odds of (say) 200,000:1. The author of the article would (on this erroneous logic) still be convinced that it is better to put the money in the bank.
The formula for compound interest can be written as an R function in terms of the principal (p) the number of periods in a year that interest is paid (q), interest rate (i) and number of years(n)
f1<-function(p=1,i=0.1,q=365,n=1)p*(1+(i/q))^(n*q)
So using this function, lets think this all through calmly. If you were to win the lottery tomorrow and do the same with the money as you would have done with the two pounds you spent on the ticket, i.e. invest it at a compound interest of 10% you would be colossally wealthy in fifty years time. Using the same interest rate calculation that the author assumed you would have over 148,000,000.
f1(p=1000000,n=50)
[1] 148311560
On the other hand, if you were to win your million exactly fifty years from now you would just have your million at the end of the period. This would coincide with what you would have gained from saving.
So to reiterate, all wins before the final date are worth more than the saved money in the bank at the end of fifty years, The earlier you win the better. The only addition I have made to the authors’ own argument is to assume (quite fairly) that lottery winnings also gain compound interest. The comparison the author makes between the frugal saver and the lottery player is quite unfair. It uses only the absolute minimum that a lottery win would be worth as the baseline for comparison. The expected lottery winnings at the end of fifty years are quite clearly worth very much more than one million. In fact under this model, it is easy to show that the expected amount is exactly 0.4290674 times the money that would be in the bank if you had not played the lottery, providing comparable assumptions are made regarding the use of the money.
plot(f(v=2),type=”l”,lwd=2,col=”red”,xlab=”Number of days”,ylab=”Accumulated value”)
lines(f(v=2*0.4290674),type=”l”,lwd=2,col=”blue”)
The differences between the money paid and the in expected value (in purely monetary terms) doesn’t change.The ratio between the red line (saver) and the blue line (expected value from playing the lottery and investing the proceeds) stays the same. Lottery players have (on average) an expected value of around 43% that of the savers. They are worse off, but nowhere near as irrational as the article suggests..
But we can go a step further with the argument. As you think it through and apply common sense it gets better and better for the lottery player. You clearly wouldn’t ever dream of actually investing a million you won tomorrow in order to have megabucks in fifty years time. A small fortune is worth much more to you now than an unspendable fortune in the future. In fact, to you, it is almost certainly worth much more than 148 times its future value, given the positive, life enhancing, potential of a single million. After the first million the next 147 are increasingly irrelevant to your happiness. This could be written using a function that converts money into happiness. This is a curve that reaches some sort of asymptote. The absolute level of the asymptote varies between individuals, but the shape is fixed, even for Bill Gates.
At the same time the savings should be devalued by the probability of dying before they can be used, the bank suffering a fate worse than Northern Rock, a meteorite strike, or the consequences of catastrophic global warming among a multitude of other scenarios. Depending on just how much all these trade offs come in at (which again is a rather subjective matter), your lottery ticket could easily turn out to be worth more to you than the pound you paid for it.
The original article states ..
If you spent two quid a day for 50 years you would total just over 36,500 tickets and would thus have only a 1-in-63 chance of making that million pounds.
A 1-in-63 chance during a lifetime doesn’t sound so unlikely!
It can in fact be perfectly rational to buy a lottery ticket. Which is why so many rational people do.
End line: Why should this interest a forest ecologist? Because it might explain why sustainable forestry is so difficult!
