On friday my colleague Mario Gonzalez brought an interesting article to my attention that has been published on line in the Open Access biology journal PLoS Biology. I am a fan of this journal myself and I have tried to place a feed to the table of contents on this site (WordPress apparently doesn’t pick it up). I was particularly moved by the editors’ personal reasons for supporting Open Access.

The article in question is available by clicking here.

I was intrigued by the use of the words “Assessing evidence” in the title. The monitoring of large permanent plots has become a staple source of evidence for tropical forest ecologists. In recent years a great deal of progress has been made in understanding the processes at work in tropical forests, largely as a result of heroic efforts in a few well studied plots. Permanent plot monitoring involves a huge investment in time and labour. It produces a great deal of replication at the level of individual trees. However the amount of replication at the landscape scale is clearly limited. This leads to inevitable challenges for both statistical analysis and for the interpretation of results. I was very encouraged by the honesty and transparency in the article’s method section regarding problems associated with data quality. For example the authors explain that *“For the plots with more than two censuses, we were able to correct anomalous dbh values by comparing the stem dbh growth rates across census intervals. If a tree showed a dramatic change in dbh growth rate, we changed the one outside of the range (−5 mm/y, +45 mm/y) with the likely value, and updated the dbh value accordingly. This filter was applied using a computer routine, and then checked manually.”*

This sort of analysis involves a lot of work after the data has been collected. It also leads to some necessary but rather arbitrary decisions being taken. Trees generally really shouldn’t shrink over time, but anyone who has worked with this sort of data in the tropics has confronted the problem. One element that might help with this in the future is the wider use of PDAs in the field to enable real time sanity checking of data as it is recorded. Errors are often made in the data capture process and not detected until field work is completed. The level of training of technicians also tends to improve over time. So the next decade’s results should be more consistent than the last.

However this is not th element that most concerned me. Despite the interesting title I was left rather unsure whether the authors really had managed to evaluate **all **the evidence the data set provided. The problem in my mind arose from a common assumption that uncertainty within a statistical analysis is all attributable to variability in the data. In fact any assumption used in any calculation will change the results.. It is now becoming common practice to attempt to quantify the sensitivity of conclusions to assumptions, but this doesn’t seem to have been done here. Instead,even though a complete census was undertaken, statistical significance was based on within sample variability using an arbitrary division into sub-plots and the assumption of independence between them. It is not clear how useful this is.

In fact the main conclusion from the work would seem to be heavily reliant on the use of an allometric equation to convert diameter (usually measured at breast height) into an estimate of total oven-dry aboveground biomass, the fundamental response variable of interest.

Let’s look at how sensitive conclusions might be to this. There is a mistake in the parentheses of the functions as printed in the paper, but as I understand the equations they can be implemented in R using the following code. I have repeated the function with different name and parametrisation for each of the forest types. The important point is that, at least on my initial reading of the analysis, the authors’ applied the same function to all trees in each plot assuming the equation for a forest type applied to the whole plot.

(The code shown below is available in this text file. Use it if there are problems with quotation marks .. permanentplots1.doc )

dry<-function(r=0.5,D,a1=0.667,a2=1.784,a3=0.207,a4=0.0281){

D<-log(D)

r*exp(-a1+a2*D+a3*D^2 -a4*D^3)}

moist<-function(r=0.5,D,a1=1.499,a2=2.148,a3=0.207,a4=0.0281){

D<-log(D)

r*exp(-a1+a2*D+a3*D^2 -a4*D^3)}

wet<-function(r=0.5,D,a1=1.239,a2=1.980,a3=0.207,a4=0.0281){

D<-log(D)

r*exp(-a1+a2*D+a3*D^2 -a4*D^3)}

x<-1:120

plot(x,wet(0.5,D=x),type=”l”,lwd=2,col=”green”,ylab=”Above ground biomass”,xlab=”Diameter”)

lines(x,moist(0.5,D=x),col=”blue”,lwd=2)

lines(x,dry(0.5,D=x),col=”red”,lwd=2)

legend(“topleft”,lwd=2,legend=c(“wet”,”moist”,”dry”),col=c(“green”,”blue”,”red”))

A point to notice is that the equations suggest that trees in moist forest have higher biomass than those in dry or wet forest. Seven of the ten plots in the paper are considered “moist” but these must apparently lie somewhere on a continuum between dry and wet. There seems to be no clear justification for using the same optimum equation for all of these plots. I am not quite sure of the units as the paper says this produces above ground biomass in Mg (tons). The units look as if they should be Kg to me.

Also these rather complex looking formula basically reproduce the same curve that could be provided by simply fitting a quadratic function with three parameters over the likely range for the data (10:120 cm lets say). Two parameters do not change.

The curves diverge only diverge sharply for larger trees and it may be argued that really large trees are rare, so this shouldn’t matter too much. In all undisturbed tropical forests that I know of there tend to be a large number of small trees and a very few large trees. Thus a simple simulated data set can be produced by slightly truncating a log normal distribution.

set.seed(1)

D1<-rlnorm(100000,2.2,0.8 )

D1<-D1[D1>10]

D1<-D1[D1<130]

hist(D1,main=”Simulated diameter distribution”,xlab=”DBH (cm)”,col=”lightgrey”)

With time, patience, knowledge of the plots and perhaps a suitable individual based simulation model to hand it would be possible to simulate some reasonable growth and mortality data for each tree over time. For the moment I will simply assume (quite wrongly of course) that all trees increase their diameters equally by 1cm. I will also assume no mortality at all, so an increment in biomass is assured. Not at all realistic, but the question is, how might the increment in biomass in a plot affected by the assumption made regarding the fixed paramaterisation of the allometric equation.

Let’s compare the assumption that the trees follows the dry forest formula against the assumption of the moist forest formula. The code divides the total number of trees into 200 random subplots and carries out the sort of bootstrap resampling that was used to provide confidence intervals. The test of significance used in the paper relied on whether 95% of the bootstrapped values where above zero. In this simulated case, without mortality, this is not interesting. The question is how important is the assumption made for the allometric equation.

nsubplots<-200

subplot<-as.factor(rep(1:nsubplots,each=length(D1)/nsubplots) )

D1<-D1[1:length(subplot)]

D2<-D1+1

d<-data.frame(subplot,D1,D2,moist=moist(D=D2)-moist(D=D1),dry=dry(D=D2)-dry(D=D1))

a<-with(d,tapply(moist,subplot,sum))

boot.moist<-replicate(1000,mean(a[sample(1:nsubplots,nsubplots,replace=T)]))

b<-with(d,tapply(dry,subplot,sum))

boot.dry<-replicate(1000,mean(b[sample(1:nsubplots,nsubplots,replace=T)]))

d2<-data.frame(category=rep(c(“moist”,”dry”),each=1000),bootstrap=c(boot.moist,boot.dry))

bwplot(bootstrap~category,data=d2,breaks=1000)

So in this simulated data, the confidence intervals attributable to the bootstrapping procedure are overwhelmed completely by the effect of changing the assumptions in the allometric equation. This was simulated data and the effect may be much less in real data. But it should be taken into account in some sense. One way of dealing with it is to bootstrap on the parameters used in the entire calculation as was done here.

Another interesting point is that although PLOs Biology is laudably open access, the raw data itself is not made available. Thus I had to simulate data to try to make a point. The future of open access research should (IMHO) be directed towards open, transparent and documented data analysis. R can play an important role in this as any analysis can be freely shared as a script. We always have to make assumptions when analysing data and some assumptions are a matter of personal preference or instinct. A future model for meta-analysis might be based on an “analysis of analyses”. The more ways a data set is looked at the more likely it is that its fundamental content will be revealed.