A simple illustration of an ecological lottery model in R
A colleague contacted me yesterday with some interesting modelling work based on the neutral community model of Stephen Hubbell.
Hubbell’s work fascinated me when it was first published. I have always assumed that it is easily misunderstood. If Hubbell is taken to imply that all species are equal, then the work appears to be eroding the basic subject matter of ecology. Ecologists tend to look for informative differences between species. However this is not exactly what the theory is based on. It is rather more subtle. The theory concerns the equality of individuals (in this sense it almost has political implications). Hubbell writes.
“The theory treats organisms in a community as essentially identical in their per capita probabilities of giving birth, dying, migrating, and speciating. This neutrality is defined at the individual level, not the species level. All that is required is that all individuals of every species obey exactly the same rules of ecological engagement. So, for example, if all individuals and species enjoy a frequency-dependent advantage in per capita birth rate when rare, and if this per capita advantage is exactly the same for each and every individual of a species of equivalent abundance then such a theory would qualify as a bona fide neutral theory by the present definition.”
I have illustrated this idea for students with a very small simulation in R . R code, like matlab or octave, is very terse and efficient so a couple of lines can do a fair amount of work. Lets assume a very simple reproductive rule applies to all individuals. The rule is that all have the same discrete generation time, in other words they all survive one time step of a simulation. At the end of the time step each individual produces two offspring. The lottery then applies. The offspring are randomly mixed and placed back on the finite space occupied by the previous generation. So half find a home and half “die”. The process is then repeated. This was exactly how I first implemented the model. I then re-implemented the idea in fewer lines by making the probability of selection depend on the proportion of individuals in each species, which amounts to the same thing.
This is all the code needed to run the model. (Try this if the quotation marks are not right lottery.doc)
nspecies<-8
gridsize<-20
time<-100
RUN<-function(){
X<<-matrix(0,nspecies,time)
palette(rainbow(nspecies))
mat<-sample(1:nspecies,gridsize*gridsize,replace=T)
X[,1]<-table(mat)
image(matrix(mat,gridsize,gridsize),col=sort(unique(mat)))
for (i in 2:time){
a<-X[,i-1]
mat<-sample(1:nspecies,gridsize*gridsize,prob=a,replace=T)
X[,i]<-table(factor(mat,levels=1:nspecies))
image(matrix(mat,gridsize,gridsize),col=sort(unique(mat)))
gc()
}
matplot(t(X),type=”l”,lwd=2,col=rainbow(nspecies))
}
#However to get a little tcltk interface add this
############################################
library(tcltk)
Run<-function(…){
nspecies<<-as.numeric(tclvalue(”nspecies”))
gridsize<<-as.numeric(tclvalue(”gridsize”))
time<<-as.numeric(tclvalue(”time”))
RUN()}
tt <-tktoplevel()
d1<-tklabel(tt,text=”Numero de species”)
d2<-tklabel(tt,text=”Tamaño del grid”)
d3<-tklabel(tt,text=”Tiempo”)
s1 <- tkscale(tt,from=0, to=20, variable=”nspecies”,
showvalue=TRUE, resolution=1, orient=”horiz”)
s2 <- tkscale(tt, from=2, to=100, variable=”gridsize”,
showvalue=TRUE, resolution=1, orient=”horiz”)
s3 <- tkscale(tt, from=10, to=500, variable=”time”,
showvalue=TRUE, resolution=10, orient=”horiz”)
but <- tkbutton(tt, text=”Run”,command=Run)
tkgrid(d1,s1)
tkgrid(d2,s2)
tkgrid(d3,s3)
tkgrid(but)
The model is to be played with (just download R from CRAN and paste all the code into the console), rather than watched but here are some example runs on You Tube,
The moral of the story is that under a neutral model some species start a random walk to extinction if co-existing in a small space with other species. However we cannot tell which will dominate and which become extinct from their characteristics, because all individuals, regardless of species are governed by exactly the same rules. The fate of species is determined by the cumulative luck or misfortune of the individuals that have been given the label. The larger the grid the more species that can be supported for a longer time, but extinction risk is always present if a long losing streak hits. The element of Hubbell’s theory that this model does not reproduce well is the log-normal (-ish) relative abundance distribution of the surviving species, although come to think of it I have not tried running the model with a really large number of species and a really large grid. R runs out of colours for the illustration, but this is not important.
Many interesting questions remain. Could a simulation model based on this sort of idea be used to predict extinction over a real landscape? Models of this sort seem far too abstract to provide real insight into the real world. But in some sense the process that Hubbell describes in a more formal mathematical way in his book probably does apply. But, how can these sort of ideas be communicated accurately to prevent them from provoking the sort of unproductive SLOSS debate that has often resulted from theory being pushed too far?
Publication bias
One of the stories in the news in the UK this week was the publication of a study by Irving Kirsh from the University of Hull and various colleagues on the efficacy of anti depressant drugs, such as the well known selective serotonin uptake inhibitor, Prozac. (101371_journalpmed0050045-l.pdf) . The researchers gained access to previously unpublished studies. This led them to some interesting conclusions. Although meta-analyses of antidepressant medications report modest benefits over placebo treatment, when the unpublished trial data are included, the benefit falls below accepted criteria for clinical significance. There is a link to a file of a BBC radio podcast on the subject here. There are also some earlier studies by the same group that tell a similar story 1171.pdf
It is easy to see this as a simple example of drug companies being cynically economical with the truth. However even drug companies stand to lose out in the long run if they peddle merchandise that doesn’t do what it says on the can. A more important aspect of the study is its broader implications for the scientific publication process in general. The message is that referees should not reject a study for publication merely on the grounds that “significant” results were not produced. Observational studies in ecology are particularly difficult and statistical significance testing is often quite inappropriate and misleading. I will return to this theme in more detail. Meanwhile here is a link to the web site of the Journal of Negative Results Ecology and Evolutionary Biology.
Another, slightly more subtle lesson to be drawn from this study concerns the general importance of data sharing and data pooling. Real insight into how a process works in an applied context often needs a lot of data. In this case, once enough data had been assembled the researchers not only were able to ask the question “is there a response” but could go further and ask about the shape of the response. The question was meaningful as its answer suggested where the benefit of the drugs effects could lie. This is a critical element of statistical analysis that introductory courses on statistics often overlook. Often the most meaningful questions concern issues such as “is the inclusion of a quadratic term supported by the data” rather than a test of the null hypothesis of no effect. Such questions are often unanswerable by a single study. This is another reason why researchers and organizations that do not make raw data available hold back science.
Kirsch I, Deacon BJ, Huedo-Medina TB, Scoboria A, Moore TJ, et
al. (2008 ) Initial severity and antidepressant benefits: A meta-analysis of data submitted to theFood and DrAdministration. PLoSMed 5(2): e45. doi:10.1371/journal. pmed.0050045
Screening using comparisons between p-values?
There was an interesting exchange on the R-help list yesterday. A researcher proposed to screen a large number of genes for a “significant” effect on survival time using a large number of univariate significance tests. Several people on the list thought it really wasn’t such a good idea. I have included the original communications on the list at the foot of this message.
The perhaps rather counterintutive point is that under the null hypothesis a p value of 0.05 is just as likely as a p value of 0.5.
How can this be so? Put this way it does sound rather odd. But the p values produced under the null hypothesis are themselves random variables. They are uniformly distributed between 0 and 1. Why should any particular p value be more likely than any other? It is not. Data produced under the null hypothesis can range from the highly probable to the highly improbable without changing the fact that the null hypothesis was used in the data producing process. Watch this very simple simulation closely if you don’t follow this.
R will produce a vector of 10 numbers taken from a normal distribution with mean zero and sd 1 using the following command.
> rnorm(10,0,1)
[1] -0.3736018 -0.2327996 -0.8154836 0.3663073 1.0702547 1.3302237
[7] 1.3972863 1.2029137 -0.9293702 0.6351127
Every time R does this the mean will not be exactly zero, but the numbers are taken from a population with mean zero. A null hypothesis test can be used to test the probability of getting these data if the true population mean is zero (which we know it is). This can be done in R by fitting a linear model with only an intercept and testing for significance of the intercept
summary(lm(rnorm(10,0,1)~1))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.06043 0.33829 0.179 0.862
Residual standard error: 1.07 on 9 degrees of freedom
So this time I am not tempted to reject the null hypothesis.
Let’s do the same one thousand times and look at the results.
samples<-replicate(1000,rnorm(10,0,1))
test.results<-apply(samples,2,function(x)summary(lm(x~1))$coefficients[4])
hist(test.results)

Now this distribution for the test results on the simulated data is, of course, quite obvious when you think about it. About one in twenty of these particular results are less than 0.05 . Any value between 0 and 1 is equally likely, making 19 out of 20 test results fall above this cut off. This is explicit in the definition of a significance test.
If the true mean for the simulations is not zero then the distribution of the p values will change.
samples<-replicate(1000,rnorm(10,0.5,1))
test.results<-apply(samples,2,function(x)summary(lm(x~1))$coefficients[4])
hist(test.results)

The researcher’s intention is to screen for genes that are more likely to have an effect. Thus the interest lies in comparisons between p-values. If all the genes screened have no effect at all then the technique is misleading, even if Bonferoni or any other corrections is applied, as only false positives will ever be found. If some of the genes do have a (small) effect there is no reason to believe that all those with an effect will provide p-values of less than 0.05. The actual results could be a mixture between the two histograms.
If all the genes have the same effect then expressing a preference for those with low p-values as compared to high p-values would clearly be a mistake. If a mixture occurs then, as Duncan Murdoch points out, the best that can be achieved is some guidance regarding the direction of future work. The procedure is clearly fraught with dangers. It is especially dangerous if there is no clear a-priori reason to believe which genes would be more likely to have an effect.
I am concerned that in comparable situation in the typically observational science of ecology a researcher could be tempted to go too far and mention “significant” effects as if they have been fully confirmed by this sort of analysis.
Hi Eleni, The problem of this approach is easily explained: Under the Null hypothesis, the P values of a significance test are random variables, uniformly distributed in the interval [0, 1]. It is easily seen that the lowest of these P values is not any 'better' than the highest of the P values. Best wishes, MatthiasCorrect me if I'm wrong, but isn't that the point? I assume that the hypothesis is that one or more of these genes are true predictors, i.e. for these genes the p-value should be significant. For all the other genes, the p-value is uniformly distributed. Using a significance level of 0.01, and an a priori knowledge that there are significant genes, you will end up with on the order of 20 genes, some of which are the "true" predictors, and the rest being false positives. this set of 20 genes can then be further analysed. A much smaller and easier problem to solve, no? /GustafSorry, it should say 200 genes instead of 20.
I agree with your general point, but want to make one small quibble: the choice of 0.01 as a cutoff depends pretty strongly on the distribution of the p-value under the alternative. With a small sample size and/or a small effect size, that may miss the majority of the true predictors. You may need it to be 0.1 or higher to catch most of them, and then you'll have 10 times as many false positives to wade through (but still 10 times fewer than you started with, so your main point still holds). Duncan Murdoch
Research interests
I have a broad range of research interests focussed largely but not exclusively on tropical forest ecology. I am currently mainly concerned with the potential effects climate change on the distribution and abundance of tropical organisms. This on-going research involves developing and testing methods for effectively modelling tree species distributions at a regional scale.
My 2001 doctoral thesis entitled “The dynamics of disturbed Mexican Pine Oak Forest: A modelling approach” combined individual based computer modelling at the University of Edinburgh with field research in Chiapas in order to further understanding of the complex successional patterns of highland forests. Some of the main findings have been summarised in a recent book chapter (see publications)
In order to generalise understanding to a wider regional scale I became interested in the use of Geographical Information Systems, image classification and spatial modelling. This led to a productive, ongoing collaboration with my colleague and friend Luis Cayuela, currently employed at the University of Alcala, Madrid.
During the time I have spent in Mexico I have come to understand that ecological research in the tropics has to address many common challenges related to data quality and quantity. A theme to my work has thus been the use of contemporary computer modelling and statistics in order to extract inference from “difficult” data sets. This has led to an interest in the use of Bayesian methods in the search for more robust forms of inference under uncertainty.
I am responsible for a course the use of computer simulation as a research tool at post graduate level. I also gain a great deal of satisfaction from participating in more applied research and courses on Ecological Restoration and Conservation.
Publications
The linked file can be imported into endnote or reference manager. golicher.doc
Golicher D.J., Cayuela L., Alkemade J.R.M., González-Espinosa M. and Ramírez-Marcial N. 2007.
Applying climatically associated species pools to the modelling of compositional change in
tropical montane forests. Global Ecology and Biogeography 10.1111/j.1466-823.
Golicher J.D., O’Hara R.B., Ruíz-Montoya L., Cayuela L. 2006. Lifting a veil on diversity: a
bayesian approach to fitting relative-abundance models. Ecological Applications. 16(1): 202-
212.
Golicher J.D, Ramírez-Marcial N., y Levy-Tacher S.I. 2006. Correlations between precipitation
patterns in Southern Mexico and the El Niño sea surface temperature index. Interciencia. 31(2):
Golicher D. and Newton A.C. 2007. Applying succession models to the conservation of tropical
montane forest. En: Newton A.C. (Ed.). Biodiversity loss and conservation in fragmented forest
landscapes: the forests of montane Mexico and temperate South America. CAB International.
Wallingford, United Kingdom. pp. 200-222
Cayuela L., Golicher D.J., Rey Benayas J.M., González-Espinosa M. & Ramírez-Marcial N. 2006.
Fragmentation, disturbance and tree diversity conservation in tropical montane forests. Journal
of Applied Ecology. 43: 1172-1181
Newton A.C., Stewart G.B., Diaz A., Golicher D., Pullin A.S. 2007. Bayesian Belief Networks as a
tool for evidence-based conservation management. Journal for Nature Conservation 15: 144-
160.
Newton A.C., Marshall E., Schreckenberg K., Golicher D., te Velde D.W., Edouard F. & Arancibia E.
2006. Use of a bayesian belief network to predict the impacts of commercializing non-timber
forest products on livelihoods. Ecology and Society. 11(2): 24. [online] URL:
http://www.ecologyandsociety.org/vol11/iss2/art24
Van der Wal H., Golicher J.D., Caudillo-Caudillo S., Vargas-Domínguez M. 2006. Plant densities,
yields and area demands for maize under shifting cultivation in the Chinantla, Mexico.
Agrociencia. 40: 449-460
Cayuela L., Golicher J.D. and Rey-Benayas J.M. 2006. The extent, distribution, and fragmentation
of vanishing montane cloud forest in the Highlands of Chiapas, Mexico. Biotrópica. 38(4): 544-
554
Cayuela L., Golicher J.D., Salas Rey J., Rey Benayas J.M. 2006. Classification of a complex
landscape using Dempster-Shafer theory of evidence. International Journal of Remote Sensing.
27(10): 1951-1971
Diemont S.A.W., Martin J.F., Levy-Tacher S.I., Nigh R.B., Ramirez Lopez P., Golicher J.D. 2006.
Lacandon Maya forest management: Restoration of soil fertility using native tree species.
Ecological Engineering. 28: 205-212
González-Espinosa M., Rey-Benayas J.M., Ramírez-Marcial N., Huston M.A., Golicher D. 2004.
Tree diversity in the northern Neotropics: regional patterns in highly diverse Chiapas, Mexico.
Ecography. 27(6): 741-756
Levy Tacher S., Golicher D. 2004. How predictive is traditional ecological knowledge? the case
of the lacandon maya fallow enrichment system. Interciencia. 29(9): 496-502.
