Duncan Golicher’s weblog

Floods and forest cover 2: Ecological inference

Posted in Evidence and Ecology, Probability and statistics by Duncan Golicher on April 3rd, 2008

I have been building up carefully to making a specific comment on Bradshaw et al (2007) because the subject that the paper addresses is too important to risk mistakes.

http://www.blackwell-synergy.com/doi/abs/10.1111/j.1365-2486.2007.01446.x

The first steps I took yesterday were to look at the data available to the authors. In addition to discovering that the data on the floods themselves were of rather variable quality and lacking in geographical resolution I was surprised by some of the values for recent deforestation in the dataset (from http://www.wri.org) included in the study. In this case I was unable to locate the original data on this web site. However I reformated the data in the paper itself into a GIS and made it available through this weblog.

Despite what looks like exaggeration of the extent of Mexican deforestation and thus probably some other countries, my main issue is not with details regarding numbers within the dataset itself. It is with the underlying logic of the whole analysis. This paper seems to be a classic (almost embarrassing), case of the so called “ecological fallacy” at work. This problem is so well known I was actually quite surprised to see a work of this sort published in a journal with an impact factor of 4.3. It will no doubt be widely cited, particularly by authors interested in strengthening arguments in favour of payments for ecological services. This is likely to have a positive impact on forest conservation. The paper is therefore “politically correct” and could be considered a helpful contribution. However the underlying science is certainly open to criticism.

The term “ecological inference” strangely enough does not come from the discipline of ecology. Ecologists do not often use the term in its original sense. It is much more widely used by social scientists. I will set the scene starting with two classic examples from the social sciences cited by the statistician David Freedman (the tech report from which I took the examples is available here freedman549.pdf)

In 19th century Europe, suicide rates apparently were higher in countries that were more heavily Protestant. The inference could be drawn that suicide is promoted by Protestantism itself. Death rates from breast cancer are higher in countries where fat is a larger component of the diet. Fat intake therefore could cause breast cancer. More recently, floods cause more damage in countries that have high rates of recent deforestation, therefore deforestation causes floods.

These are all “ecological inferences,” that is, inferences about individual behavior drawn from data about aggregates. The ecological fallacy arises when it is believed that characteristics of aggregated data extend to the units from which they have been compiled. In the social sciences this can lead to incorrect inference being drawn regarding the characteristics of the individuals themselves. Individuals do of course live within countries. Floods do take place in defined political units (although lack of exclusivity is an additional issue in this case). However the factors that influence the probability of committing suicide, suffering from cancer or experiencing severe flooding are experienced at the level of the individual or the subregion affected by the flood event. Protestant countries in the nineteenth century were clearly different from the Catholic countries in many ways besides religion (confounding). The original data did not associate individual suicides to any particular religious faith directly (aggregation).

The first problem, that of confounding, must be dealt with in any observational study. But the second problem, that exposure and response are measured only for aggregates rather than for individual, is specific to so called “ecological” studies (the word here being used in the social sciences sense). If there is no confounding, the expected difference between effects for groups and effects for individuals is the “aggregation bias”. In real studies with aggregated data there is usually some confounding and some aggregation bias. Sometimes the message extracted from the aggregated data coincides well with that obtained by an analysis of its components. There is no imperative that fallacious conclusions are always going to be drawn. For example RA Fisher quite wrongly argued against the causal link between smoking and lung cancer due to his feeling that inference was impossible due to confounding (I am not sure how far aggregation was an issue in this particular case). In contrast, in the case of the link between fat intake and breast cancer, more recent studies using data from individuals did raise questions regarding the validity of conclusions drawn from aggregated data.

When data is aggregated in any way at all, details that can help to ensure accurate inference will always be lost. The common reason for this in the social sciences is a bureaucratic imperative to compile tables of regional statistics and a common requirement for confidentiality at the individual level. Election results are a classic example.

In my own experience I have often found that appropriate detail gets lost simply because a student naively aggregates data with the intention of producing a dataset that is easy to handle in a spreadsheet. This can be particularly frustrating if it is done before the data are captured in digital form. There is then no way back. Modern analytical techniques using contemporary statistical environments such as R can handle vast numbers of data points with great ease and speed. Raw data can read be strait into R from online repositories and aggregated according to the needs of a specific analysis with a couple of lines. Mixed effects (hierarchical) models and data mining techniques work with data labelled by individual or unit. There is no longer any reason to continue to repeat avoidable aggregation errors. This is a suitable point to provide a link to a previous post on data handling (in Spanish).

In the final posting on this thread I will finally deal with the specific content of the article in more detail.

Fuelwood model

Posted in Evidence and Ecology, Probability and statistics, R scripts by Duncan Golicher on March 12th, 2008

fuelwood-model1.png

fuelwood-model2.pngfuelwood-model3.png

This morning I designed a simple simulation model in R with a small tcltk interface based on the concepts in the previous two posts. The aim is to allow a prediction of the probable future yield of fuelwood from plots with uncertain planting or survival densities, uncertain diameter distributions and uncertain parameters regarding diameter-height relationships and wood densities.

The model can be run by sourcing the file below into R (install tcltk first). Again this is work in progress and may change as colleagues provide improvement through feedback and new data. The aim is to provide a useful tool for prediction in a highly uncertain environment through giving a probability that a fuelwood or carbon sequestration project’s aims could be met based on known or assumed planting densities and growth rates.

biomasa1.doc

Permanent forest plots 2.

Posted in Climate change, Evidence and Ecology, Probability and statistics, R scripts, Uncategorized by Duncan Golicher on March 11th, 2008

After thinking a little more about the problem of the allometric equation in the paper that I discussed yesterday, I realised I had a partial solution already to hand. (Again please read this as a marker for an idea that could be developed further. This is one of the motivations for using an informal web log instead of leaving ideas as notes on my own hard disk.)

In order to work with a fairly simple individual based forest model I have implemented many different allometric equations in R over time. My favourites for use in data poor areas are still those derived originally by Botkin and Shugart from way back in the late seventies (see for example Shugart 1984). I have used them in both my thesis and a later implementation of the same model. Results from this are included in a recent book chapter. I aim to include the entire R code for this model on this site some point soon (after tidying up some elements and documenting).

The nice thing about the parameterisations used in these early models is that they were explicitly designed to be run with a minimal amount of knowledge. Superficially that doesn’t sound too encouraging. Shouldn’t we try to improve a model by adding more explicit detail? Not necessarily. After all purely empirical allometric equations assume no knowledge at all, apart from that provided through the data. That is their strength and weakness. In order to be used correctly you have to build them afresh for each area being studied. However if equations can be constrained by some very easily obtained field observations a sensible adjustment can be made for the conditions of each plot without going through the costly process of measuring everything afresh. The results certainly won’t be precise, but that is the point. Uncertainty can be included in the calculations through Monte Carlo simulation. The true value should then fall somewhere in the predicted range.

Botkin’s equations can turn measured diameter at breast height into estimated height and therefore estimated tree volume for any species based simply on an estimate of the maximum diameter the species can obtain and its maximum diameter. It is fundamentally a three parameter quadratic model constrained to take realistic values. As maximum height is quite site specific and can be estimated through rapidly made observations the equations are potentially very useful in the tropics where actual measurements of tree height are hard to obtain, or in cases of models of newly established forests when final height may have to be estimated from some knowledge of site quality. Notice that the equation discussed yesterday must be based implicitly on some sort of rule regarding tree height and form factor. You simply can’t get around this. The equation is a statistical black box that just takes in diameter and churns out above ground biomass. The fact that the equation doesn’t explicitly state that height has been estimated doesn’t make it robust. You simply can’t turn diameter into volume and thus biomass without making some sort of assumption regarding the height and shape of a tree, even if it is hidden within an impenetrable calculation. This is a very common failing of empirical equations and a good reason to generally mistrust their uncritical application (This is often the result of a misplaced appeal to authority, an empirical equation has been published by an important research group so it is eminently citable. Yes fine and good, but the equation still only applies to their particular data set.)

So here are the set of simple equations Botkin suggested. I have prefixed each function with f. to avoid confusing a function placed in the global environment in R with a variable.

(All the code shown below can be downloaded in this script … permanentplots2.doc)

f.B2<-function(Hmax,Dmax)2 * ((Hmax - 137) / Dmax)
f.B3<-function(Hmax,Dmax) (Hmax - 137) / Dmax ^ 2
f.BA<-function(Diameter)pi * (Diameter / 2) ^ 2
f.Height<-function(Diameter,Hmax,Dmax)137 + f.B2(Hmax,Dmax) * Diameter - f.B3(Hmax,Dmax) * Diameter ^ 2
f.AGB<-function(D,Hmax=2000,Dmax=100,FormFactor=0.8,r=0.5)r*FormFactor * f.BA(D) * f.Height(D,Hmax,Dmax)/1000

Let’s see how the equation for height works in practice. Assume maximum diameter is 120 cm, but forests (or species) vary in the maximum height they can reach. Note that as this is a quadratic (i.e. parabola) maximum diameter must coincide with the maximum in the data or the curve turns downwards.

D<-10:120

plot(D,f.Height(D,Dmax=120,Hmax=3000)/100,type=”l”,lwd=2,col=”green”,ylab=”Height (m)”,xlab=”Diameter”,ylim=c(0,30))
lines(D,f.Height(D,Dmax=120,Hmax=2000)/100,col=”blue”,lwd=2)
lines(D,f.Height(D,Dmax=120,Hmax=1000)/100,col=”red”,lwd=2)
legend(”topleft”,lwd=2,legend=c(”High (30m)”,”Medium (20m)”,”Low (10m)”),col=c(”green”,”blue”,”red”)
)

permanent-plots4.png

Now let’s compare with the formulas used in the paper and included in the previous post and multiply appropriately by basal area and form to produce volume.

plot(D,wet(0.5,D=D),type=”l”,lwd=2,col=”green”,ylab=”Above ground biomass (kg)”,xlab=”Diameter”)
lines(D,moist(0.5,D=D),col=”blue”,lwd=2)
lines(D,dry(0.5,D=D),col=”red”,lwd=2)
legend(”topleft”,lwd=2,legend=c(”wet”,”moist”,”dry”),col=c(”green”,”blue”,”red”))

points(D,f.AGB(D=D,Hmax=3500,Dmax=120),col=”blue”)
points(D,f.AGB(D=D,Hmax=2000,Dmax=120),col=”green”)
points(D,f.AGB(D=D,Hmax=1500,Dmax=120),col=”red”)
legend(”bottomright”,pch=1,legend=c(”Hmax=20m”,”Hmax=35m”,”Hmax=15m”),

col=c(”green”,”blue”,”red”))

permanent-plots6.png

So the curves are very comparable. The second set can be produced by making a single reasonable assumption regarding maximum height. In fact the equation for above ground biomass has four parameters. Maximum height, maximum diameter, “form factor” and wood density. The form factor is a term used in forestry and here it might best be expressed as “branchiness”. Due to the way tree growth occurs (the “pipestem”model) total above ground volume can never be expected to exceed the volume of a cylinder. Foresters are only interested in the trunk, but a factor ranging between 0.6 and 1 can be applied quite reasonably to biomass as a whole. Here I have set the value at 0.8. This reproduces quite well the original allometric equations. We can add in uncertainty regarding this parameter in the analysis.

This is now the point at which the equations become useful. If a lot of knowledge is available, then separate parameters can be assigned for each species. Maximum height is quite easy to get.Simply by making this justifiable assumption accuracy should improve markedly over the “one size fits all” assumption that was implicit in the paper under discussion. A nice element is that the quadratic is constrained so the meaningless estimates that can sometimes arise when using polynomials outside their range of applicability will never occur. With less information functional groupings can be used. In the worst case scenario the same equation can be used for all trees in the forest, as was apparently done with the empirical allometric equation in the paper (with the exception of including a species specific term for wood density).

Using R it is very easy to include uncertainty regarding these parameters. As has been pointed out in the previous post, this is likely to affect the conclusions drawn much more than within sample variability. The point is that bootstrapping on parameters of a polynomial allometric equation with no real life interpretation would be very dangerous. The curve could go in many unexpected directions. In contrast bootstrapping on an equation with direct interpretation is reasonable. Lets use a boxplot as a quick way of looking at the results. Lets assume a great deal of uncertainty in Hmax (20m with a sd of 3m) and some uncertainty in the form factor (0.8, sd=0.1). It is easy to extend the uncertainty to all four parameters.

boot.strap<-replicate(10000,f.AGB(D,Hmax=rnorm(1,2000,300),FormFactor=rnorm(1,0.8,0.1)) )
boot.strap<-data.frame(rep(10:120,times=10000),as.vector(boot.strap))
names(boot.strap)<-c(”D”,”V”)
boxplot(V~D,data=boot.strap,range=0.1,outline=F,col=2)

permanent-plots7.png

So we can see how uncertainty in total above ground biomass increases with the diameter of the tree. Finally lets see how this could be applied to the simulated data. If we just want an uncertain estimate of total above ground biomass by applying the bootstrapped conversion formula to a single large plot we can do it like this.

D<-rlnorm(100000,2.2,0.8 )
D<-D[D>10]
D<-D[D<120]

boot.strap<-replicate(1000,sum(f.AGB(D,Hmax=rnorm(1,2000,300),FormFactor=rnorm(1,0.8,0.1))) )
quantile(boot.strap/1000000,c(0.05,0.5,0.95))

5% 50% 95%
6.97 9.75 13.04

plot(density(boot.strap/1000000,bw=0.7),xlab=”Total above ground biomass kg x 10^6″)
abline(v=quantile(boot.strap/1000000,c(0.05,0.5,0.95)),col=c(”blue”,”black”,”red”))

permanent-plots8.png

The extension of the technique to be used on the difference between two measurement periods is trivial if the data is available. More accuracy is obtained by narrowing the range of parameters by using a separate parameter for each species, so there is clearly a cost of ignorance at work here.. This can even be quantified (cf with this post) The method can clearly also be applied to estimating future biomass yield in the case of plots being planted for fuelwood or biofuels, based on making some very simple but justifiable assumptions regarding growth form, wood density and planting density. Because a measure of uncertainty is produced this is very useful in decision making.

biomasa.doc

A simple illustration of an ecological lottery model in R

Posted in Evidence and Ecology, Natural history, Probability and statistics, R scripts by Duncan Golicher on March 7th, 2008

A colleague contacted me yesterday with some interesting modelling work based on the neutral community model of Stephen Hubbell.

Hubbell’s work fascinated me when it was first published. I have always assumed that it is easily misunderstood. If Hubbell is taken to imply that all species are equal, then the work appears to be eroding the basic subject matter of ecology. Ecologists tend to look for informative differences between species. However this is not exactly what the theory is based on. It is rather more subtle. The theory concerns the equality of individuals (in this sense it almost has political implications). Hubbell writes.

“The theory treats organisms in a community as essentially identical in their per capita probabilities of giving birth, dying, migrating, and speciating. This neutrality is defined at the individual level, not the species level. All that is required is that all individuals of every species obey exactly the same rules of ecological engagement. So, for example, if all individuals and species enjoy a frequency-dependent advantage in per capita birth rate when rare, and if this per capita advantage is exactly the same for each and every individual of a species of equivalent abundance then such a theory would qualify as a bona fide neutral theory by the present definition.”

I have illustrated this idea for students with a very small simulation in R . R code, like matlab or octave, is very terse and efficient so a couple of lines can do a fair amount of work. Lets assume a very simple reproductive rule applies to all individuals. The rule is that all have the same discrete generation time, in other words they all survive one time step of a simulation. At the end of the time step each individual produces two offspring. The lottery then applies. The offspring are randomly mixed and placed back on the finite space occupied by the previous generation. So half find a home and half “die”. The process is then repeated. This was exactly how I first implemented the model. I then re-implemented the idea in fewer lines by making the probability of selection depend on the proportion of individuals in each species, which amounts to the same thing.

This is all the code needed to run the model. (Try this if the quotation marks are not right lottery.doc)


nspecies<-8
gridsize<-20
time<-100

RUN<-function(){
X<<-matrix(0,nspecies,time)
palette(rainbow(nspecies))

mat<-sample(1:nspecies,gridsize*gridsize,replace=T)
X[,1]<-table(mat)
image(matrix(mat,gridsize,gridsize),col=sort(unique(mat)))

for (i in 2:time){
a<-X[,i-1]
mat<-sample(1:nspecies,gridsize*gridsize,prob=a,replace=T)
X[,i]<-table(factor(mat,levels=1:nspecies))
image(matrix(mat,gridsize,gridsize),col=sort(unique(mat)))
gc()
}

matplot(t(X),type=”l”,lwd=2,col=rainbow(nspecies))
}

#However to get a little tcltk interface add this

############################################

library(tcltk)

Run<-function(…){
nspecies<<-as.numeric(tclvalue(”nspecies”))
gridsize<<-as.numeric(tclvalue(”gridsize”))
time<<-as.numeric(tclvalue(”time”))
RUN()}

tt <-tktoplevel()
d1<-tklabel(tt,text=”Numero de species”)
d2<-tklabel(tt,text=”Tamaño del grid”)
d3<-tklabel(tt,text=”Tiempo”)

s1 <- tkscale(tt,from=0, to=20, variable=”nspecies”,
showvalue=TRUE, resolution=1, orient=”horiz”)
s2 <- tkscale(tt, from=2, to=100, variable=”gridsize”,
showvalue=TRUE, resolution=1, orient=”horiz”)
s3 <- tkscale(tt, from=10, to=500, variable=”time”,
showvalue=TRUE, resolution=10, orient=”horiz”)

but <- tkbutton(tt, text=”Run”,command=Run)

tkgrid(d1,s1)
tkgrid(d2,s2)
tkgrid(d3,s3)
tkgrid(but)

The model is to be played with (just download R from CRAN and paste all the code into the console), rather than watched but here are some example runs on You Tube,

The moral of the story is that under a neutral model some species start a random walk to extinction if co-existing in a small space with other species. However we cannot tell which will dominate and which become extinct from their characteristics, because all individuals, regardless of species are governed by exactly the same rules. The fate of species is determined by the cumulative luck or misfortune of the individuals that have been given the label. The larger the grid the more species that can be supported for a longer time, but extinction risk is always present if a long losing streak hits. The element of Hubbell’s theory that this model does not reproduce well is the log-normal (-ish) relative abundance distribution of the surviving species, although come to think of it I have not tried running the model with a really large number of species and a really large grid. R runs out of colours for the illustration, but this is not important.

Many interesting questions remain. Could a simulation model based on this sort of idea be used to predict extinction over a real landscape? Models of this sort seem far too abstract to provide real insight into the real world. But in some sense the process that Hubbell describes in a more formal mathematical way in his book probably does apply. But, how can these sort of ideas be communicated accurately to prevent them from provoking the sort of unproductive SLOSS debate that has often resulted from theory being pushed too far?

Publication bias

Posted in Current affairs, Evidence and Ecology, Probability and statistics, Uncategorized by Duncan Golicher on March 1st, 2008

One of the stories in the news in the UK this week was the publication of a study by Irving Kirsh from the University of Hull and various colleagues on the efficacy of anti depressant drugs, such as the well known selective serotonin uptake inhibitor, Prozac. (101371_journalpmed0050045-l.pdf) . The researchers gained access to previously unpublished studies. This led them to some interesting conclusions. Although meta-analyses of antidepressant medications report modest benefits over placebo treatment, when the unpublished trial data are included, the benefit falls below accepted criteria for clinical significance. There is a link to a file of a BBC radio podcast on the subject here. There are also some earlier studies by the same group that tell a similar story 1171.pdf

It is easy to see this as a simple example of drug companies being cynically economical with the truth. However even drug companies stand to lose out in the long run if they peddle merchandise that doesn’t do what it says on the can. A more important aspect of the study is its broader implications for the scientific publication process in general. The message is that referees should not reject a study for publication merely on the grounds that “significant” results were not produced. Observational studies in ecology are particularly difficult and statistical significance testing is often quite inappropriate and misleading. I will return to this theme in more detail. Meanwhile here is a link to the web site of the Journal of Negative Results Ecology and Evolutionary Biology.

Another, slightly more subtle lesson to be drawn from this study concerns the general importance of data sharing and data pooling. Real insight into how a process works in an applied context often needs a lot of data. In this case, once enough data had been assembled the researchers not only were able to ask the question “is there a response” but could go further and ask about the shape of the response. The question was meaningful as its answer suggested where the benefit of the drugs effects could lie. This is a critical element of statistical analysis that introductory courses on statistics often overlook. Often the most meaningful questions concern issues such as “is the inclusion of a quadratic term supported by the data” rather than a test of the null hypothesis of no effect. Such questions are often unanswerable by a single study. This is another reason why researchers and organizations that do not make raw data available hold back science.

kirsh-et-al.png

Kirsch I, Deacon BJ, Huedo-Medina TB, Scoboria A, Moore TJ, et
al. (2008 ) Initial severity and antidepressant benefits: A meta-analysis of data submitted to theFood and DrAdministration. PLoSMed 5(2): e45. doi:10.1371/journal. pmed.0050045

Screening using comparisons between p-values?

Posted in Probability and statistics, R scripts by Duncan Golicher on February 14th, 2008

There was an interesting exchange on the R-help list yesterday. A researcher proposed to screen a large number of genes for a “significant” effect on survival time using a large number of univariate significance tests. Several people on the list thought it really wasn’t such a good idea. I have included the original communications on the list at the foot of this message.

The perhaps rather counterintutive point is that under the null hypothesis a p value of 0.05 is just as likely as a p value of 0.5.

How can this be so? Put this way it does sound rather odd. But the p values produced under the null hypothesis are themselves random variables. They are uniformly distributed between 0 and 1. Why should any particular p value be more likely than any other? It is not. Data produced under the null hypothesis can range from the highly probable to the highly improbable without changing the fact that the null hypothesis was used in the data producing process. Watch this very simple simulation closely if you don’t follow this.

R will produce a vector of 10 numbers taken from a normal distribution with mean zero and sd 1 using the following command.

> rnorm(10,0,1)
[1] -0.3736018 -0.2327996 -0.8154836 0.3663073 1.0702547 1.3302237
[7] 1.3972863 1.2029137 -0.9293702 0.6351127

Every time R does this the mean will not be exactly zero, but the numbers are taken from a population with mean zero. A null hypothesis test can be used to test the probability of getting these data if the true population mean is zero (which we know it is). This can be done in R by fitting a linear model with only an intercept and testing for significance of the intercept

summary(lm(rnorm(10,0,1)~1))

Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.06043 0.33829 0.179 0.862

Residual standard error: 1.07 on 9 degrees of freedom

So this time I am not tempted to reject the null hypothesis.

Let’s do the same one thousand times and look at the results.

samples<-replicate(1000,rnorm(10,0,1))
test.results<-apply(samples,2,function(x)summary(lm(x~1))$coefficients[4])
hist(test.results)

fiig22.png

Now this distribution for the test results on the simulated data is, of course, quite obvious when you think about it. About one in twenty of these particular results are less than 0.05 . Any value between 0 and 1 is equally likely, making 19 out of 20 test results fall above this cut off. This is explicit in the definition of a significance test.

If the true mean for the simulations is not zero then the distribution of the p values will change.

samples<-replicate(1000,rnorm(10,0.5,1))
test.results<-apply(samples,2,function(x)summary(lm(x~1))$coefficients[4])
hist(test.results)

fiig23.png

The researcher’s intention is to screen for genes that are more likely to have an effect. Thus the interest lies in comparisons between p-values. If all the genes screened have no effect at all then the technique is misleading, even if Bonferoni or any other corrections is applied, as only false positives will ever be found. If some of the genes do have a (small) effect there is no reason to believe that all those with an effect will provide p-values of less than 0.05. The actual results could be a mixture between the two histograms.

fiig24.png

If all the genes have the same effect then expressing a preference for those with low p-values as compared to high p-values would clearly be a mistake. If a mixture occurs then, as Duncan Murdoch points out, the best that can be achieved is some guidance regarding the direction of future work. The procedure is clearly fraught with dangers. It is especially dangerous if there is no clear a-priori reason to believe which genes would be more likely to have an effect.

I am concerned that in comparable situation in the typically observational science of ecology a researcher could be tempted to go too far and mention “significant” effects as if they have been fully confirmed by this sort of analysis.

>> > Hi Eleni,
>> >
>> > The problem of this approach is easily explained: Under the Null
>> > hypothesis, the P values
>> > of a significance test are random variables, uniformly distributed in
>> > the interval [0, 1]. It
>> > is easily seen that the lowest of these P values is not any 'better'
>> > than the highest of the
>> > P values.
>> >
>> > Best wishes,
>> >
>> > Matthias
>> >
>>
>> Correct me if I'm wrong, but isn't that the point? I assume that the
>> hypothesis is that one or more of these genes are true predictors,
>> i.e. for these genes the p-value should be significant. For all the
>> other genes, the p-value is uniformly distributed. Using a
>> significance level of 0.01, and an a priori knowledge that there are
>> significant genes, you will end up with on the order of 20 genes, some
>> of which are the "true" predictors, and the rest being false
>> positives. this set of 20 genes can then be further analysed. A much
>> smaller and easier problem to solve, no?
>>
>>
>> /Gustaf
> 
> Sorry, it should say 200 genes instead of 20.
> 
I agree with your general point, but want to make one small quibble:
the choice of 0.01 as a cutoff depends pretty strongly on the
distribution of the p-value under the alternative.  With a small sample
size and/or a small effect size, that may miss the majority of the true
predictors.  You may need it to be 0.1 or higher to catch most of them,
and then you'll have 10 times as many false positives to wade through
(but still 10 times fewer than you started with, so your main point
still holds).

Duncan Murdoch

Rationality and the lottery

Posted in Personal and family, Probability and statistics, R scripts by Duncan Golicher on February 11th, 2008

The BBC web site today contained what appears to me to be a misrepresentation of decision theory. The argument goes…

“Should you invest £2 a day or use it to buy lottery tickets?

Maths makes the decision obvious. Suppose you invest two quid every day at the reasonable rate of 10%. It will take you almost exactly 50 years to accumulate £1m. To earn this same £1m in the National Lottery, you would (on average) have to match five numbers and a bonus ball, at odds of 2,330,635-to-1.

If you spent two quid a day for 50 years you would total just over 36,500 tickets and would thus have only a 1-in-63 chance of making that million pounds. However, the available image of immediate wealth subverts this rationality.”

Is this right. Is it “obvious” as the author claims. No it is not. It is far from obvious.

The calculation of compound interest is correct, although banks do not normally compound interest on a daily basis and 10% is rather optimistic. You can check by simulating the arrangement as an R function using numerical integration.

f<-function(ndays=365*50,interest=0.1,value=2){
a<-numeric(ndays)
a[1]<-value
for (i in 2:(ndays)){
a[i]<-a[i-1]+value
a[i]<-a[i]+(interest/365)*a[i]}
a}

par(bg=grey(0.92))
plot(f(v=2),type=”l”,lwd=2,col=”red”,xlab=”Number of days”,ylab=”Accumulated value”)
grid(col=1)

fiig16.png

The money in the bank grows healthily towards the one million target. So what is wrong with the argument? The author claims that the odds of winning a million on the lottery are 2,330,635:1. This is not a fair bet, but it is not such a bad one either. You have just under one chance in two million of winning the one million on offer. The expected value of your one pound ticket is the chances of winning (admittedly very small indeed) multiplied by the sum that would be won (and of course this is very large).
1/2330636* 1000000= 0.4290674

So the expected value of your ticket is about 43p. You have superficially wasted 57p.

The story about all the interest you would get by investing the money is a misleading red herring. If you took the conclusion of a 1:63 ratio between saving and gambling seriously it would persuade you not to buy a single lottery ticket even if the odds on winning were to become more favourable than one in a two million and bettered the value of the prize. Decisions between retaining a small sum with certainty and risking a big one do always involve subjective judgement, but few would not consider the lottery worth a shot if the prize of 1 million could be won at odds of (say) 200,000:1. The author of the article would (on this erroneous logic) still be convinced that it is better to put the money in the bank.

The formula for compound interest can be written as an R function in terms of the principal (p) the number of periods in a year that interest is paid (q), interest rate (i) and number of years(n)

f1<-function(p=1,i=0.1,q=365,n=1)p*(1+(i/q))^(n*q)

So using this function, lets think this all through calmly. If you were to win the lottery tomorrow and do the same with the money as you would have done with the two pounds you spent on the ticket, i.e. invest it at a compound interest of 10% you would be colossally wealthy in fifty years time. Using the same interest rate calculation that the author assumed you would have over 148,000,000.

f1(p=1000000,n=50)
[1] 148311560

On the other hand, if you were to win your million exactly fifty years from now you would just have your million at the end of the period. This would coincide with what you would have gained from saving.

So to reiterate, all wins before the final date are worth more than the saved money in the bank at the end of fifty years, The earlier you win the better. The only addition I have made to the authors’ own argument is to assume (quite fairly) that lottery winnings also gain compound interest. The comparison the author makes between the frugal saver and the lottery player is quite unfair. It uses only the absolute minimum that a lottery win would be worth as the baseline for comparison. The expected lottery winnings at the end of fifty years are quite clearly worth very much more than one million. In fact under this model, it is easy to show that the expected amount is exactly 0.4290674 times the money that would be in the bank if you had not played the lottery, providing comparable assumptions are made regarding the use of the money.

plot(f(v=2),type=”l”,lwd=2,col=”red”,xlab=”Number of days”,ylab=”Accumulated value”)

lines(f(v=2*0.4290674),type=”l”,lwd=2,col=”blue”)

fiig18.png

The differences between the money paid and the in expected value (in purely monetary terms) doesn’t change.The ratio between the red line (saver) and the blue line (expected value from playing the lottery and investing the proceeds) stays the same. Lottery players have (on average) an expected value of around 43% that of the savers. They are worse off, but nowhere near as irrational as the article suggests..

But we can go a step further with the argument. As you think it through and apply common sense it gets better and better for the lottery player. You clearly wouldn’t ever dream of actually investing a million you won tomorrow in order to have megabucks in fifty years time. A small fortune is worth much more to you now than an unspendable fortune in the future. In fact, to you, it is almost certainly worth much more than 148 times its future value, given the positive, life enhancing, potential of a single million. After the first million the next 147 are increasingly irrelevant to your happiness. This could be written using a function that converts money into happiness. This is a curve that reaches some sort of asymptote. The absolute level of the asymptote varies between individuals, but the shape is fixed, even for Bill Gates.

fiig17.png

At the same time the savings should be devalued by the probability of dying before they can be used, the bank suffering a fate worse than Northern Rock, a meteorite strike, or the consequences of catastrophic global warming among a multitude of other scenarios. Depending on just how much all these trade offs come in at (which again is a rather subjective matter), your lottery ticket could easily turn out to be worth more to you than the pound you paid for it.

The original article states ..

If you spent two quid a day for 50 years you would total just over 36,500 tickets and would thus have only a 1-in-63 chance of making that million pounds.

A 1-in-63 chance during a lifetime doesn’t sound so unlikely!
It can in fact be perfectly rational to buy a lottery ticket. Which is why so many rational people do.

End line: Why should this interest a forest ecologist? Because it might explain why sustainable forestry is so difficult!