Qgis Udig and Postgis
Until comparatively recently visualisation of geographical information tended to mean one thing. ArcView.
When Google Earth (and the perhaps under rated and certainly under used NASA Worldwind) appeared on the scene visualisation of free geographical information became popular and ubiquitous. Three dimensional spinning globes revolutionized the way none GIS professionals could work with geographical data. At around the same time some comparatively modest open source projects turned into mature easily used products that could be applied to two dimensional desktop map composition.
At the time of writing the best known of these are QGis (Quantum GIS) and Udig. Both are cross platform. Udig uses Java and runs well under Windows. Qgis still performs rather better under Linux, at least in my experience.
Neither of these programs are true geographical information systems, in the sense that they do not of themselves provide very much in the way of analytical tools. They simply allow geographical information to be visualised in a convenient way. Visualisation is a fundamental element of geographical analysis so it should be done well. These open source programs are acceptable alternatives to the traditional ESRI benchmark.
Users of Arcview commonly activate a range of plugins to give the program more features. Qgis provides a connection with GRASS to provide analytical capability, although I would assume that few GRASS users routinely run GRASS through Qgis. Both Qgis and Udig have wms (world map server) connectivity, which means that they can be used to customize the sort of maps that are found on web sites showing wms data.
An interesting recent development in GIS has been the rapid growth in the use of spatially aware data bases. Most major databases (MySQL, Oracle, Postgresql) can now be given a spatial component. However in the open source context “spatial database” has become almost synonymous with the postgis component of postgresql. It is the built in connectivity with posttgis feature that gives qgis and udig their rapidly expanding potential value. Spatial databases allow sophisticated extraction and combination of geographical data. The results usually usually have to be shown as maps. Command line sql statements are not visual of themselves. QGis and Udig can thus potentially provide the full functionality of an Arc* type environment through levering the capabilities of postgis.
At present this very powerful framework is still in a rather incipient phase, in the sense that there are no GUIs that carry out analyses comparable with Arcview/ArcGis. Most of the functionality of the expensive commercial software is now available through the open source alternatives. However while the documentation is comprehensive it does usually assume prior knowledge of data base programming. Most of the current postgis users are knowledgeable data base programmers with considerable experience using spatial sql queries in the context of online mapserver based web sites. However this is changing fast. The WKB (Well Known Binary) representation of geographical information used in postgis databases looks like a good standard and will undoubtedly be more widely used. This should result in the very rapid development of higher level forms of manipulating information in this format.
One advantage of using an online data base as opposed to separate shapefiles is that the coordinate system of all data has to be consistent, or made so through reprojection when the data base is set up.This is especially important in Mexico where data is often provided without details of the projection and datum used (NAD27 and WGS84 are both used and projection details are rarely provided). Also details of how to connect to the data base using Qgis or Udig can be provided in an email, while it can be difficult to send large shapefiles. Both Qgis and Udig allow POSTGIS data to be saved locally as shapefiles.
Population density and deforestation
An element of spatial analysis that can be both fascinating and frustrating involves resolving the tension between the attraction of visual pattern matching and the formality of statistical analysis.
Traditional non spatial statistics in an observational setting relies heavily on correlation analysis. This alone can be uninformative, misleading and often confusing. Modern spatial statistics that attempt to go beyond the limitations of correlation can be obscure and difficult to communicate. At the same time maps can be difficult to summarize. Presenting data in the form of patterns that overlay each other in space (and sometimes time) is very difficult to achieve convincingly online and much harder to incorporate into a scientific format that still relies heavily on printed material.
Take the example of the relationship between population density and forest cover in Chiapas. My own experience in the region has led me to some simple, robust conclusions based on the evidence I see around me on a daily basis.
1. The highlands of Chiapas have an extremely high population density that can not be sustained through subsistence agriculture. Around San Cristobal this is above 100 people per km2. Yet a surprising amount of forest cover still remains.
2. The Central Depression has a much lower population density and an entirely agricultural economy. It is largely deforested. This is largely true despite the fact that satellite imagery can underestimate the amount of remaining forest.
3. The rural population of the highlands has continued to grow in recent years. However the amount of land dedicated to agriculture in the highlands has slightly decreased as has the value of agricultural production.
4. Recent deforestation in the highlands has disproportionately affected old growth mature forest. Secondary forest in the highlands has tended to expand.
5. The economy of the highlands of Chiapas is growing while the Central Depression is static.
6. Extreme poverty is still more prevalent in the highlands of Chiapas, while chronic rural deprivation is a feature of the central depression.
These observations have been the subject of academic discussion and are not always accepted in the way I have stated them. In particular the empirical relationship between population density and deforestation is often still at the centre of the controversy.
This may be due to trivialization of the obviously non linear (in the statistical sense) pattern I tried to show in the animated gif above (move back up if you didn’t realize it was an animation, I left a longish pause for thought between each frame). Agricultural “frontiers” with no resident population are trivially forested.The agricultural frontier regions also (trivially) show the highest rates of deforestation (deforestation is not shown explicit here, but see some previous posts). However the extremely complex dynamics of forest loss, regrowth and consequent structural and compositional change as a result of the management of a culturally determined landscape are often misrepresented and misunderstood. Attempting to linearize this sort of pattern though correlation analysis is never going to provide insight.
The animation above is based on the 2000 census data which was imported into a POSTGIS database and then visualized using Qgis. The polygons are watersheds, produced using r.watershed in GRASS, translated into a vector layer and also exported to POSTGIS. The populations within the watersheds can then be calculated using a simple spatial query in POSTGIS. I will provide more technical details on these steps in a subsequent post. The size of the points for the population centres is not proportional to population size as such but was chosen for visual clarity.
Little Blue Heron
The Little Blue heron (Egreta caeurulea) is, as its scientific name suggests, a form of egret. Like other egrets it tends to walk when feeding rather than standing patiently in one spot as would a typical heron. Juveniles are white and easily confused with the Snowy Egret. This individual is feeding on small carp in a rapidly shrinking artificial pond close to the house. Permanent water bodies are rare in the highlands of Chiapas thus there are comparatively few specialist water birds even in the wetlands (humedales) of the valley.
Easter holidays
Northern flicker
There are four groups of the Northern Flicker (Colaptes auratus). This is the southernmost sub species, Colaptes auratus mexicanoides (sometimes known as the Guatemalan Flicker). It is a typical woodpecker in many ways and uses old, decaying trees for food and nesting sites. However the species also spends a surprising amount of time on the ground, often picking insects out of hard dried cow pats. This sort of complex habitat use is very common among “forest” birds. Conservation initiatives in the tropics thus need to take a quite sophisticated view of landscape connectivity and habitat diversity. It is not a simple case of planting more trees. Incidentally as I have been asked to provide georeferences and dates, all photos unless otherwise stated have been taken within 1 km of my house ( 16°42′14.62″N, 92°36′32.71″W) and on the same day that they are posted to the weblog. This was on an alder tree along the side of a small stream.
Potential species distributions 1
A constant frustration when working in tropical regions is the shortage of really high quality data. The quantitative knowledge that applied conservation needs in order to prevent the extinction of the world’s endangered species of plants and animals is still surprisingly hard to come by. At best the available hard data is fragmented, of variable quality, difficult to organize and thus hard to analyze in a formal manner. At worst it is either missing or positively misleading
Information used for conservation planning also forms the central subject matter of the scientific discipline of ecology. In other words, it concerns the abundance and distribution of organisms. Our ignorance of this most basic characteristic of the planet we live on is remarkable. Astronomers (arguably) provide estimates of the number of stars in our galaxy with narrower proportional margins of error than we have for the number of species in a Mexican forest. For example, Wikipedia cites three sources in stating that of 2006, the Milky Way is thought to be comprised of between 200 to 400 billion stars. A defensible estimate of the number of tree species in Chiapas could still range from 500 to 2500 depending on how data of dubious quality is interpreted. This is not good enough for such highly visible and important elements of ecological communities. The hard data is just not there to improve on this.
Although in recent years considerable advances have been made in systematics (Linnean shortfalls), these so called “Wallacean” shortfall remains. The large scale, systematic, planned, coordinated field work needed to redress our ignorance is simply not being funded.
In recent years I have been working with techniques for producing automated distribution maps of tropical tree species using R. Even using comparatively good data for areas that we know at first hand, it is a remarkably difficult task. Modelling species distributions can involve more pragmatism than theoretical insight. At a regional scale the task becomes yet more challenging.
This is not because R is too limited in the number of useful tools it provides. A very broad range of models are available. They range from rule based classification and regression trees through AI “black boxes” such as neural networks, to generalised additive models and simple logistic regression. All can be fit within R in a common framework. The models can be used to produce predicted distributions based on either presence-absence data or known occurrences alone (with the addition of “pseudo absences”). It is also possible to automate linkages between R and other popular software such as maxent or GARP. Increasingly accurate layers for many key predictor variables are now available.
However at least fifty well distributed occurrence points are needed to provide credibly well validated distribution maps for a single species using a fully automated procedure. If fewer data points are available some sort of input from “expert judgement” is inevitably required in order to evaluate which of the potentially infinite outputs from species distribution models are most credible. Well documented, repeatable automation of the fitting process using R is very useful in this process, as many maps can be produced in a short space of time based on different assumptions. However resorting to visual pattern matching is frustratingly informal and breaks a standard rule of data analysis, i.e. ensuring perfectly reproducible results.
The code example below is one possible very simple implementation of GAMs with pseudo absences. It is is applied here to modelling the potential distribution of tree species using data points provided mainly from MOBOT. The code (deliberately in this case) does not take into account spatial trends or autocorrelation, thus potential matching climates will often be suggested outside a “known” species range. This can be corrected by adding in coordinates as predictors to the model, but at the cost of potential loss of insight into the distribution of species that may not yet have been collected from their entire range. Any modelling technique has to find a balance between too many false positives and too many false negatives. A common technique is to look at the ROC curve, but this is not particularly useful if the number of known occurrences is very low and restricted to a small part of the suspected range.
I have come to the conclusion that we should not be overly defensive regarding the failings of automated species distribution mapping algorithms when these are inevitably attributable to the poor quality of the underlying data. There is an unavoidable “garbage in, garbage out” syndrome that is difficult to avoid without extremely time consuming data cleaning. This is best undertaken at the point of origin, i.e. in the herbaria and museums where the data is collated.
The models can however provide some useful heuristic input to the evaluation of the potential range of a species. In my opinion, any assessment of the area actually occupied should not be attempted using regional scale models alone. Even obvious techniques such as using forest cover maps as masks to remove non-forested pixels will have mixed results. Some tree and shrub species are still common in areas classified as pasture at a regional scale. Urban areas also can have many trees. Unpublished studies have suggested that tree diversity is higher in urban Managua than in surrounding agricultural areas. Other species have very specific requirements for non mappable habitat characteristics.
These concerns led our group to the conclusion that given the current data the best that can be achieved at a regional scale is to map the distribution of “climatically associated species pools” (CASPS). The “CASP” approach suffers the serious weakness of rather devaluing individualistic species responses, but may be the best that can be achieved until new initiatives begin to provide contemporary, accurately georeferenced and correctly determined occurrence (and absence) data from across the tropics. Data sharing and pooling between currently active research groups will be a vital first step in this process.
The output from the R code above is overlain on the 3d Blue Marble image from code in the previous posts to give a visual impression at a very broad regional level. Green points are the input to the model (recorded occurrences) and red points are grid squares with similar combinations of regionally important climatic variables mapped at an approximately 5 km x 5 km scale. Further details of this and similar procedures including mapping of species pools are available in this document.mnp-workshop3.pdf
This model could provide some heuristic input to expert evaluation of the potential distribution and threats to some species currently being assessed for IUCN red list status. 450 maps produced through an automated procedure are included as low resolution jpg files within pdf archives in alphabetical order here.
Code 9
During the holiday Mickey and I have been watching a program called brainbox challenge on the BBC. Neither of us can keep up with any of the competitors, although Mickey sometimes beats them to it on the memory games. One of the games is called code 9. The rules are (fairly) simple. The competitors are given a mathematical operation to perform on two numbers. However each digit of the numbers has first to be coded by subtracting it from 9. Thus 1 becomes 8; 12, becomes 87 etc. After completing the operation the competitors have to apply the code again to the answer. So 1 +1 becomes 8+8=16, which is then recoded to give the answer 83. As I just can’t get my brain to do it unaided I thought of writing a function in R to help.
This line carries out the operation on whole positive numbers.
code9<-function(x)as.numeric(paste(unlist(lapply(strsplit(as.character(x),”"),
function(x)9-as.numeric(x))),collapse=”"))
so
code9(code9(10)+code9(10))
[1] 821
This works as a one liner, but there should be a better way requiring fewer steps within the function using a regular expression.
3d Belize
At the request of Bruce Miller in Belize I have added a 3d R object for that part of the world. The Modis Blue marble coverage is a bit coarse for this small country and the relief requires more vertical exaggeration, but quite nice results can still be obtained with a Landsat image, despite the cloud cover. The object can be loaded and visualized with the script below. The R object that is loaded is 3MB in size so Bruce may have a very long wait given the glacial speed of the Belizean internet.
This also gives now me a excuse to add a link to BERDS, the inspiring biodiversity database for Belize set up by Jan Meerman.
Blue marble and Landsat in R 3.
A final note on this theme. To add points to a 3d visualisation you need to know z values in order to place them on the surface. The way x,y,z are referred to by rgl is not intuitive and it took me some time to get this to work. The example uses a set of coordinates for the botanical collections and inventories of tree species that are compiled into a database. We are using these data for modelling regional tree species distributions and their potential response to climate change. A major source of this information has been the wonderful online resource provided by Missouri Botanical Gardens MOBOT, but data has also been provided from other sources including Conabio. Here I only provide the raw coordinates of collections, without species names or further details as some of the data is restricted. The interesting point to note is the spatial distribution. Notice that these points are for all trees, thus any one species will form an extremely small subset. Our knowledge of the real distribution of tree species in Mexico is thus clearly still very dependent on expert knowledge and uncompiled data from little explored areas.
The code is here. You should load rgl and sp before running it. addpoints3d.doc
load(url(”http://duncanjg.files.wordpress.com/2008/03/colpoints.doc”))
load(url(”http://duncanjg.files.wordpress.com/2008/03/bmmexicoapril.doc”))
coordinates(a)<-~x+y
a<-overlay(d,a)
a<-as.data.frame(a)
view.3d.rgb<-function(d,dem=4,red=1,blue=2,green=3,exag=1/30000){
fullgrid(d)<-TRUE
y<-d[[dem]]*exag
y[is.na(y)]<-0
dim(y)<-d@grid@cells.dim
x<-seq(min(d@coords[,1]),max(d@coords[,1]),length=d@grid@cells.dim[1])
z<-seq(min(d@coords[,2]),max(d@coords[,2]),length=d@grid@cells.dim[2])
colvec<- rgb(d[[red]],d[[green]],d[[blue]],m=255)
rgl.clear()
rgl.surface(x, z, y, color=colvec)
}
add.points.3d<-function(d=d,x,y,elev,exag=1/30000,…){
rgl.points(x=x,z=max(d@coords[,2])-y+min(d@coords[,2]), y=elev*exag,…)}
view.3d.rgb(d)
add.points.3d(d,a$x,a$y,a$alt,col=2,size=2)
Blue marble and Landsat in R 2.
I have just been asked if the images in the previous post could be produced simply in two dimensions so that points could be easily over plotted. The code to do this can be taken strait from the rimage package and very slightly adapted for a “SpatialGridDataFrame”. I haven’t worked out how it could be done using lattice.
spimage.doc
require(sp)
load(url(”http://duncanjg.files.wordpress.com/2008/03/bmmexicoapril.doc”))
sp.image.rgb<-function(d,red=1,green=3,blue=2,m=255){
colvec <- rgb(d[[red]],d[[green]],d[[blue]],m=255)
colors <- unique(colvec)
d[["colmat"]] <- match(colvec, colors)
image(d,”colmat”,col=colors)
}
sp.image.rgb(d)
Blue marble and Landsat in R
R is of course a statistical environment, so it really shouldn’t be expected to be doing the job of GRASS (Nviz), Google Earth, WorldWind, Qgis or Udig as well. However some nice visualizations can be done with rgl, including rgb composites from satellite images. They can even be more useful for quick regional overviews than the alternatives mentioned above.
The code to produce these images be found here bluemarble3d.doc. The first line will download the imported spatial object from this site. R does need a lot of memory for images of this size, so I guess you need a minimum of 1GB RAM (I am using a Toshiba Tecra laptop with Nvidia running Ubuntu Feisty 7.04 with 2GB). Given that, you should get very own zoomable 3d image of Mexico and Central America to play with in the same time it takes the screenshot below to run. It shows the steps in real time. (For R beginners, just open the file, paste all the code into the R console and wait a little for the download). The resolution is quite coarse (2 minute) so this is only suitable for a regional overview. You will need the rgl and sp packages installed first of course.
Exactly he same can be done with Landsat imagery. landsat.doc
If you want to get your own coverages for this for your own region the starting point is to run R from GRASS in a lat-long location. For example
g.region -p
projection: 3 (Latitude-Longitude)
datum: wgs84
ellipsoid: a=6378137 es=0.006694379990141317
north: 32:51:47.86586N
south: 5:28:57.975592N
west: 118:30:20.609692W
east: 76:26:43.294846W
nsres: 0:02:00.060768
ewres: 0:01:59.982024
rows: 821
cols: 1262
cells: 1036102
Then get some of the Blue Marble (Modis) imagery that can be downloaded from the NASA wms strait into GRASS using r.in.onearth. From within R you can use system to run GRASS. This does work with GRASS under Cygwin in Windows with the shell command.
system(”r.in.onearth -b file=/tmp/test month=Apr time=2005-3-24 ‘wgetopt=-c -t 5 –user-agent=MSIE5.5′ “)
The layers can then be imported into R with spgrass6.
library(spgrass6)
d<-readRAST6(c(”BMNG_Apr.red”,”BMNG_Apr.blue”,”BMNG_Apr.green”,”alt”))
Notice that this line imported a digital elevation model of my own as well, as one is of course needed for the 3d terrain effect.
Disturbance and the Lotka Volterra competition model
One of the works that provided constant inspiration both as an undergraduate and graduate student was Huston’s huge 1994 monograph. Despite some inevitable repetition in a book of this length I have continued to find new ideas in it over the years. The book contained a very simple demonstration of the intermediate disturbance hypothesis using the Lotka Volterra competition equations that I continue to use in classes. It can be implemented very concisely in R. The idea is simple. Reset the classic competition equations at periodic intervals and see how it affects the balance between species with r or K strategies. A very simple model with important implications.
Source here …lotka-volterra.doc
R<-c(0.1,0.12,0.15,0.2,0.3)
K<-c(1000,800,600,200,100)
Run<-function(runtime=100,freq=101,intensity=0.1){
N<-numeric(runtime*5)
N<-matrix(N,nrow=runtime,ncol=5)
N[1,]<-10
for (i in 2:runtime){
Totpop<-sum(N[i-1,])
for (j in 1:5){
N[i,j]<-N[i-1,j]+N[i-1,j]*R[j]*(1-Totpop/K[j])
}
if(!i%%freq){
N[i,]<-N[i,]*intensity
}
}
res<-data.frame(N)
res
}
#####################################
par(mfcol=c(2,2))
res<-Run()
matplot(1:100,res,type=”l”,xlab=”Tiempo”,ylab=”Poblaciones”,lwd=2)
title(”Sin perturbacion”)
res<-Run(freq=60)
matplot(1:100,res,type=”l”,xlab=”Tiempo”,ylab=”Poblaciones”,lwd=2)
title(”Perturbacion infrequente”)
res<-Run(freq=25)
matplot(1:100,res,type=”l”,xlab=”Tiempo”,ylab=”Poblaciones”,lwd=2)
title(”Perturbacion intermedio”)
res<-Run(freq=12)
matplot(1:100,res,type=”l”,xlab=”Tiempo”,ylab=”Poblaciones”,lwd=2)
title(”Perturbacion frequente”)
Reference
Huston, M.A. 1994. Biological Diversity: The Coexistence of Species on Changing Landscapes. Cambridge University Press, 708 pp.
Fuelwood model
This morning I designed a simple simulation model in R with a small tcltk interface based on the concepts in the previous two posts. The aim is to allow a prediction of the probable future yield of fuelwood from plots with uncertain planting or survival densities, uncertain diameter distributions and uncertain parameters regarding diameter-height relationships and wood densities.
The model can be run by sourcing the file below into R (install tcltk first). Again this is work in progress and may change as colleagues provide improvement through feedback and new data. The aim is to provide a useful tool for prediction in a highly uncertain environment through giving a probability that a fuelwood or carbon sequestration project’s aims could be met based on known or assumed planting densities and growth rates.
Acer negundo L. subsp. mexicanum (DC.) Wesm.
Acer. negundo is a familiar tree to North Americans who know it as Box Elder” or “Box Elder Maple”. This is apparently due to the fact that its its whitish wood resembles boxwood and the similarity of its pinnately compound leaves with those of some species of elder (Sambucus). Other common names are based upon this maple’s similarity to ash as it is the only North American maple with compound leaves. There are many varieties or sub-species across its range and the species is a good candidate for more work on its genetics.
The species is fully dioecious and these are clearly male flowers. They are very attractive to bees. This individual is in flower this week close to Ecosur.
Permanent forest plots 2.
After thinking a little more about the problem of the allometric equation in the paper that I discussed yesterday, I realised I had a partial solution already to hand. (Again please read this as a marker for an idea that could be developed further. This is one of the motivations for using an informal web log instead of leaving ideas as notes on my own hard disk.)
In order to work with a fairly simple individual based forest model I have implemented many different allometric equations in R over time. My favourites for use in data poor areas are still those derived originally by Botkin and Shugart from way back in the late seventies (see for example Shugart 1984). I have used them in both my thesis and a later implementation of the same model. Results from this are included in a recent book chapter. I aim to include the entire R code for this model on this site some point soon (after tidying up some elements and documenting).
The nice thing about the parameterisations used in these early models is that they were explicitly designed to be run with a minimal amount of knowledge. Superficially that doesn’t sound too encouraging. Shouldn’t we try to improve a model by adding more explicit detail? Not necessarily. After all purely empirical allometric equations assume no knowledge at all, apart from that provided through the data. That is their strength and weakness. In order to be used correctly you have to build them afresh for each area being studied. However if equations can be constrained by some very easily obtained field observations a sensible adjustment can be made for the conditions of each plot without going through the costly process of measuring everything afresh. The results certainly won’t be precise, but that is the point. Uncertainty can be included in the calculations through Monte Carlo simulation. The true value should then fall somewhere in the predicted range.
Botkin’s equations can turn measured diameter at breast height into estimated height and therefore estimated tree volume for any species based simply on an estimate of the maximum diameter the species can obtain and its maximum diameter. It is fundamentally a three parameter quadratic model constrained to take realistic values. As maximum height is quite site specific and can be estimated through rapidly made observations the equations are potentially very useful in the tropics where actual measurements of tree height are hard to obtain, or in cases of models of newly established forests when final height may have to be estimated from some knowledge of site quality. Notice that the equation discussed yesterday must be based implicitly on some sort of rule regarding tree height and form factor. You simply can’t get around this. The equation is a statistical black box that just takes in diameter and churns out above ground biomass. The fact that the equation doesn’t explicitly state that height has been estimated doesn’t make it robust. You simply can’t turn diameter into volume and thus biomass without making some sort of assumption regarding the height and shape of a tree, even if it is hidden within an impenetrable calculation. This is a very common failing of empirical equations and a good reason to generally mistrust their uncritical application (This is often the result of a misplaced appeal to authority, an empirical equation has been published by an important research group so it is eminently citable. Yes fine and good, but the equation still only applies to their particular data set.)
So here are the set of simple equations Botkin suggested. I have prefixed each function with f. to avoid confusing a function placed in the global environment in R with a variable.
(All the code shown below can be downloaded in this script … permanentplots2.doc)
f.B2<-function(Hmax,Dmax)2 * ((Hmax - 137) / Dmax)
f.B3<-function(Hmax,Dmax) (Hmax - 137) / Dmax ^ 2
f.BA<-function(Diameter)pi * (Diameter / 2) ^ 2
f.Height<-function(Diameter,Hmax,Dmax)137 + f.B2(Hmax,Dmax) * Diameter - f.B3(Hmax,Dmax) * Diameter ^ 2
f.AGB<-function(D,Hmax=2000,Dmax=100,FormFactor=0.8,r=0.5)r*FormFactor * f.BA(D) * f.Height(D,Hmax,Dmax)/1000
Let’s see how the equation for height works in practice. Assume maximum diameter is 120 cm, but forests (or species) vary in the maximum height they can reach. Note that as this is a quadratic (i.e. parabola) maximum diameter must coincide with the maximum in the data or the curve turns downwards.
D<-10:120
plot(D,f.Height(D,Dmax=120,Hmax=3000)/100,type=”l”,lwd=2,col=”green”,ylab=”Height (m)”,xlab=”Diameter”,ylim=c(0,30))
lines(D,f.Height(D,Dmax=120,Hmax=2000)/100,col=”blue”,lwd=2)
lines(D,f.Height(D,Dmax=120,Hmax=1000)/100,col=”red”,lwd=2)
legend(”topleft”,lwd=2,legend=c(”High (30m)”,”Medium (20m)”,”Low (10m)”),col=c(”green”,”blue”,”red”))
Now let’s compare with the formulas used in the paper and included in the previous post and multiply appropriately by basal area and form to produce volume.
plot(D,wet(0.5,D=D),type=”l”,lwd=2,col=”green”,ylab=”Above ground biomass (kg)”,xlab=”Diameter”)
lines(D,moist(0.5,D=D),col=”blue”,lwd=2)
lines(D,dry(0.5,D=D),col=”red”,lwd=2)
legend(”topleft”,lwd=2,legend=c(”wet”,”moist”,”dry”),col=c(”green”,”blue”,”red”))
points(D,f.AGB(D=D,Hmax=3500,Dmax=120),col=”blue”)
points(D,f.AGB(D=D,Hmax=2000,Dmax=120),col=”green”)
points(D,f.AGB(D=D,Hmax=1500,Dmax=120),col=”red”)
legend(”bottomright”,pch=1,legend=c(”Hmax=20m”,”Hmax=35m”,”Hmax=15m”),
col=c(”green”,”blue”,”red”))
So the curves are very comparable. The second set can be produced by making a single reasonable assumption regarding maximum height. In fact the equation for above ground biomass has four parameters. Maximum height, maximum diameter, “form factor” and wood density. The form factor is a term used in forestry and here it might best be expressed as “branchiness”. Due to the way tree growth occurs (the “pipestem”model) total above ground volume can never be expected to exceed the volume of a cylinder. Foresters are only interested in the trunk, but a factor ranging between 0.6 and 1 can be applied quite reasonably to biomass as a whole. Here I have set the value at 0.8. This reproduces quite well the original allometric equations. We can add in uncertainty regarding this parameter in the analysis.
This is now the point at which the equations become useful. If a lot of knowledge is available, then separate parameters can be assigned for each species. Maximum height is quite easy to get.Simply by making this justifiable assumption accuracy should improve markedly over the “one size fits all” assumption that was implicit in the paper under discussion. A nice element is that the quadratic is constrained so the meaningless estimates that can sometimes arise when using polynomials outside their range of applicability will never occur. With less information functional groupings can be used. In the worst case scenario the same equation can be used for all trees in the forest, as was apparently done with the empirical allometric equation in the paper (with the exception of including a species specific term for wood density).
Using R it is very easy to include uncertainty regarding these parameters. As has been pointed out in the previous post, this is likely to affect the conclusions drawn much more than within sample variability. The point is that bootstrapping on parameters of a polynomial allometric equation with no real life interpretation would be very dangerous. The curve could go in many unexpected directions. In contrast bootstrapping on an equation with direct interpretation is reasonable. Lets use a boxplot as a quick way of looking at the results. Lets assume a great deal of uncertainty in Hmax (20m with a sd of 3m) and some uncertainty in the form factor (0.8, sd=0.1). It is easy to extend the uncertainty to all four parameters.
boot.strap<-replicate(10000,f.AGB(D,Hmax=rnorm(1,2000,300),FormFactor=rnorm(1,0.8,0.1)) )
boot.strap<-data.frame(rep(10:120,times=10000),as.vector(boot.strap))
names(boot.strap)<-c(”D”,”V”)
boxplot(V~D,data=boot.strap,range=0.1,outline=F,col=2)
So we can see how uncertainty in total above ground biomass increases with the diameter of the tree. Finally lets see how this could be applied to the simulated data. If we just want an uncertain estimate of total above ground biomass by applying the bootstrapped conversion formula to a single large plot we can do it like this.
D<-rlnorm(100000,2.2,0.8 )
D<-D[D>10]
D<-D[D<120]
boot.strap<-replicate(1000,sum(f.AGB(D,Hmax=rnorm(1,2000,300),FormFactor=rnorm(1,0.8,0.1))) )
quantile(boot.strap/1000000,c(0.05,0.5,0.95))
5% 50% 95%
6.97 9.75 13.04
plot(density(boot.strap/1000000,bw=0.7),xlab=”Total above ground biomass kg x 10^6″)
abline(v=quantile(boot.strap/1000000,c(0.05,0.5,0.95)),col=c(”blue”,”black”,”red”))
The extension of the technique to be used on the difference between two measurement periods is trivial if the data is available. More accuracy is obtained by narrowing the range of parameters by using a separate parameter for each species, so there is clearly a cost of ignorance at work here.. This can even be quantified (cf with this post) The method can clearly also be applied to estimating future biomass yield in the case of plots being planted for fuelwood or biofuels, based on making some very simple but justifiable assumptions regarding growth form, wood density and planting density. Because a measure of uncertainty is produced this is very useful in decision making.
Permanent forest plots
On friday my colleague Mario Gonzalez brought an interesting article to my attention that has been published on line in the Open Access biology journal PLoS Biology. I am a fan of this journal myself and I have tried to place a feed to the table of contents on this site (WordPress apparently doesn’t pick it up). I was particularly moved by the editors’ personal reasons for supporting Open Access.
The article in question is available by clicking here.
I was intrigued by the use of the words “Assessing evidence” in the title. The monitoring of large permanent plots has become a staple source of evidence for tropical forest ecologists. In recent years a great deal of progress has been made in understanding the processes at work in tropical forests, largely as a result of heroic efforts in a few well studied plots. Permanent plot monitoring involves a huge investment in time and labour. It produces a great deal of replication at the level of individual trees. However the amount of replication at the landscape scale is clearly limited. This leads to inevitable challenges for both statistical analysis and for the interpretation of results. I was very encouraged by the honesty and transparency in the article’s method section regarding problems associated with data quality. For example the authors explain that “For the plots with more than two censuses, we were able to correct anomalous dbh values by comparing the stem dbh growth rates across census intervals. If a tree showed a dramatic change in dbh growth rate, we changed the one outside of the range (−5 mm/y, +45 mm/y) with the likely value, and updated the dbh value accordingly. This filter was applied using a computer routine, and then checked manually.”
This sort of analysis involves a lot of work after the data has been collected. It also leads to some necessary but rather arbitrary decisions being taken. Trees generally really shouldn’t shrink over time, but anyone who has worked with this sort of data in the tropics has confronted the problem. One element that might help with this in the future is the wider use of PDAs in the field to enable real time sanity checking of data as it is recorded. Errors are often made in the data capture process and not detected until field work is completed. The level of training of technicians also tends to improve over time. So the next decade’s results should be more consistent than the last.
However this is not th element that most concerned me. Despite the interesting title I was left rather unsure whether the authors really had managed to evaluate all the evidence the data set provided. The problem in my mind arose from a common assumption that uncertainty within a statistical analysis is all attributable to variability in the data. In fact any assumption used in any calculation will change the results.. It is now becoming common practice to attempt to quantify the sensitivity of conclusions to assumptions, but this doesn’t seem to have been done here. Instead,even though a complete census was undertaken, statistical significance was based on within sample variability using an arbitrary division into sub-plots and the assumption of independence between them. It is not clear how useful this is.
In fact the main conclusion from the work would seem to be heavily reliant on the use of an allometric equation to convert diameter (usually measured at breast height) into an estimate of total oven-dry aboveground biomass, the fundamental response variable of interest.
Let’s look at how sensitive conclusions might be to this. There is a mistake in the parentheses of the functions as printed in the paper, but as I understand the equations they can be implemented in R using the following code. I have repeated the function with different name and parametrisation for each of the forest types. The important point is that, at least on my initial reading of the analysis, the authors’ applied the same function to all trees in each plot assuming the equation for a forest type applied to the whole plot.
(The code shown below is available in this text file. Use it if there are problems with quotation marks .. permanentplots1.doc )
dry<-function(r=0.5,D,a1=0.667,a2=1.784,a3=0.207,a4=0.0281){
D<-log(D)
r*exp(-a1+a2*D+a3*D^2 -a4*D^3)}
moist<-function(r=0.5,D,a1=1.499,a2=2.148,a3=0.207,a4=0.0281){
D<-log(D)
r*exp(-a1+a2*D+a3*D^2 -a4*D^3)}
wet<-function(r=0.5,D,a1=1.239,a2=1.980,a3=0.207,a4=0.0281){
D<-log(D)
r*exp(-a1+a2*D+a3*D^2 -a4*D^3)}
x<-1:120
plot(x,wet(0.5,D=x),type=”l”,lwd=2,col=”green”,ylab=”Above ground biomass”,xlab=”Diameter”)
lines(x,moist(0.5,D=x),col=”blue”,lwd=2)
lines(x,dry(0.5,D=x),col=”red”,lwd=2)
legend(”topleft”,lwd=2,legend=c(”wet”,”moist”,”dry”),col=c(”green”,”blue”,”red”))
A point to notice is that the equations suggest that trees in moist forest have higher biomass than those in dry or wet forest. Seven of the ten plots in the paper are considered “moist” but these must apparently lie somewhere on a continuum between dry and wet. There seems to be no clear justification for using the same optimum equation for all of these plots. I am not quite sure of the units as the paper says this produces above ground biomass in Mg (tons). The units look as if they should be Kg to me.
Also these rather complex looking formula basically reproduce the same curve that could be provided by simply fitting a quadratic function with three parameters over the likely range for the data (10:120 cm lets say). Two parameters do not change.
The curves diverge only diverge sharply for larger trees and it may be argued that really large trees are rare, so this shouldn’t matter too much. In all undisturbed tropical forests that I know of there tend to be a large number of small trees and a very few large trees. Thus a simple simulated data set can be produced by slightly truncating a log normal distribution.
set.seed(1)
D1<-rlnorm(100000,2.2,0.8 )
D1<-D1[D1>10]
D1<-D1[D1<130]
hist(D1,main=”Simulated diameter distribution”,xlab=”DBH (cm)”,col=”lightgrey”)
With time, patience, knowledge of the plots and perhaps a suitable individual based simulation model to hand it would be possible to simulate some reasonable growth and mortality data for each tree over time. For the moment I will simply assume (quite wrongly of course) that all trees increase their diameters equally by 1cm. I will also assume no mortality at all, so an increment in biomass is assured. Not at all realistic, but the question is, how might the increment in biomass in a plot affected by the assumption made regarding the fixed paramaterisation of the allometric equation.
Let’s compare the assumption that the trees follows the dry forest formula against the assumption of the moist forest formula. The code divides the total number of trees into 200 random subplots and carries out the sort of bootstrap resampling that was used to provide confidence intervals. The test of significance used in the paper relied on whether 95% of the bootstrapped values where above zero. In this simulated case, without mortality, this is not interesting. The question is how important is the assumption made for the allometric equation.
nsubplots<-200
subplot<-as.factor(rep(1:nsubplots,each=length(D1)/nsubplots) )
D1<-D1[1:length(subplot)]
D2<-D1+1
d<-data.frame(subplot,D1,D2,moist=moist(D=D2)-moist(D=D1),dry=dry(D=D2)-dry(D=D1))
a<-with(d,tapply(moist,subplot,sum))
boot.moist<-replicate(1000,mean(a[sample(1:nsubplots,nsubplots,replace=T)]))
b<-with(d,tapply(dry,subplot,sum))
boot.dry<-replicate(1000,mean(b[sample(1:nsubplots,nsubplots,replace=T)]))
d2<-data.frame(category=rep(c(”moist”,”dry”),each=1000),bootstrap=c(boot.moist,boot.dry))
bwplot(bootstrap~category,data=d2,breaks=1000)
So in this simulated data, the confidence intervals attributable to the bootstrapping procedure are overwhelmed completely by the effect of changing the assumptions in the allometric equation. This was simulated data and the effect may be much less in real data. But it should be taken into account in some sense. One way of dealing with it is to bootstrap on the parameters used in the entire calculation as was done here.
Another interesting point is that although PLOs Biology is laudably open access, the raw data itself is not made available. Thus I had to simulate data to try to make a point. The future of open access research should (IMHO) be directed towards open, transparent and documented data analysis. R can play an important role in this as any analysis can be freely shared as a script. We always have to make assumptions when analysing data and some assumptions are a matter of personal preference or instinct. A future model for meta-analysis might be based on an “analysis of analyses”. The more ways a data set is looked at the more likely it is that its fundamental content will be revealed.
Aggregating time series in R: The Iraq body count
Yesterday I was asked whether R could be used to analyse time series. The answer is of course it can. R is used extensively in the financial sector for analysing complex time series such as stock prices. I have already included an example using R in the context of climate variability (El Niño). One challenge is that there are a lot of different ways of working with time series and representing dates. Aggregation can be rather tricky. The zoo package is one of the most powerful tools for working with time series, but it is not always simple to use. I still haven’t got my head around all the different ways to achieve results.
So here is an example. The code first reads in the data from an online database, then uses tapply to sum the number of casualties per day. Zoo is then used to aggregate by month and the total is plotted. If anyone reading this can suggest better ways to do this or add a more sophisticated analysis I would like to know.
(Open this document if the code doesn’t work due to problems with quotation marks bodycount.doc)
Or try this …
source(url(”http://duncanjg.files.wordpress.com/2008/03/bodycount.doc”)) ,
again you might need to retype the quotation marks)
library(zoo)
library(date)
a<-”http://www.iraqbodycount.org/database/download/ibc-incidents”
d<-read.csv(url(a),skip=11,header=T)
a<-tapply(d$Reported.Minimum,d$Start.Date,sum)
x <- zoo(as.vector(a), as.POSIXct(as.date(row.names(a))))
f<-function(x)as.POSIXct(as.yearmon(x))
Deaths<-aggregate(x,f, sum)
par(bg=”lightgrey”)
plot(Deaths,lwd=3,col=2)
However you get there, the result is shocking. These are documented civilian casualties and I chose the lower estimate.
Even if the trend was initially downwards after the start of the surge, it still only bottomed out at around the level it began at when the “mopping up” operations were taking place in the first few months after the invasion. The time series taken from the compiled online data base stops in January 2008, so it doesn’t take into account the recent renewal of violence. Today, Thursday 6 March there were 86 civilian dead. Two bomb attacks killed 68 in Baghdad alone. Apparently this is not an isolated incident. The trend is sadly upwards again.
Joseph Stiglitz has estimated the price of the war at 3 trillion dollars. I wasn’t even sure what a trillion was until he used the figure. It has twelve zeros, in other words a thousand billion. That works out at 30 million dollars for each dead Iraqi civilian or enough to make ever one of these people as rich as Bill Gates. No further comment, apart from a heartfelt request to visit the site of those who have worked so hard to compile this important data set and re-run the code periodically to check the updated figures.
As an addition, I was saddened to hear that Harvard professor Samantha Power has resigned from Obama’s campaign team this week, apparently for speaking too openly on this, among other, issues. I was extremely impressed by her thoughtful, yet emotional, contribution to BBC radio’s start the week in which she talked of the biography she has written on Sergio Vieira de Mello, a heroic figure who constantly impressed me in every interview I heard with him up to his tragic death in Iraq. The recent news suggests that honest, open minded expression by academics is still considered to be a liability, even for apparently honest, open minded, politicians. This is a link to the Hard Talk interview.
Open source, open ideas
The nobel prize winning economist Joseph Stiglitz is very fond of quoting Thomas Jefferson. “Knowledge is like a candle; as it lights another candle, its original light is not diminished”.
Economists like Stiglitz refer to the opposite of this property as exclusivity. In other words if I eat a box of chocolates or turn a tree into timber, you can’t have the chocolates or enjoy the shade and scenic beauty provided by the tree. Only one of us can consume a consumable. On the other hand if I have an idea I can share it with you, if you are interested. If my idea happens to be a good one then denying access to it, or the right to use it, gives rise to an inefficiency. Traditional intellectual property is based on such exclusion. Knowledge is a global public good that is of potential benefit to anyone in the world. There is a global social cost in depriving anyone in the world the right to use available knowledge.
Most of our individual ideas are “memes”. In other words we have all been lit from someone else’s candle. Originality in science is extremely difficult and is only partially achieved even by the small handful of creative individuals that manage to provide a new spark. We all worry that we are wrong or being too derivative when we think or write about scientific issues but I have (quite recently) come to the conclusion that we really shouldn’t be worried that our ideas are not original providing we are casting some light, somewhere. The more candles that are out there, the more light is cast and the collective effect is to everyone’s benefit.
In the open source model of software development the light cast has an even more positive effect, because software is itself useful as a tool for achieving other goals. Software developers in fact make some of the candles. I have neither the time nor the ability to contribute code to open source projects, but I appreciate the tremendous generosity of all those who do and try to support and use open source alternatives to commercial software as a matter of principle.
A simple illustration of an ecological lottery model in R
A colleague contacted me yesterday with some interesting modelling work based on the neutral community model of Stephen Hubbell.
Hubbell’s work fascinated me when it was first published. I have always assumed that it is easily misunderstood. If Hubbell is taken to imply that all species are equal, then the work appears to be eroding the basic subject matter of ecology. Ecologists tend to look for informative differences between species. However this is not exactly what the theory is based on. It is rather more subtle. The theory concerns the equality of individuals (in this sense it almost has political implications). Hubbell writes.
“The theory treats organisms in a community as essentially identical in their per capita probabilities of giving birth, dying, migrating, and speciating. This neutrality is defined at the individual level, not the species level. All that is required is that all individuals of every species obey exactly the same rules of ecological engagement. So, for example, if all individuals and species enjoy a frequency-dependent advantage in per capita birth rate when rare, and if this per capita advantage is exactly the same for each and every individual of a species of equivalent abundance then such a theory would qualify as a bona fide neutral theory by the present definition.”
I have illustrated this idea for students with a very small simulation in R . R code, like matlab or octave, is very terse and efficient so a couple of lines can do a fair amount of work. Lets assume a very simple reproductive rule applies to all individuals. The rule is that all have the same discrete generation time, in other words they all survive one time step of a simulation. At the end of the time step each individual produces two offspring. The lottery then applies. The offspring are randomly mixed and placed back on the finite space occupied by the previous generation. So half find a home and half “die”. The process is then repeated. This was exactly how I first implemented the model. I then re-implemented the idea in fewer lines by making the probability of selection depend on the proportion of individuals in each species, which amounts to the same thing.
This is all the code needed to run the model. (Try this if the quotation marks are not right lottery.doc)
nspecies<-8
gridsize<-20
time<-100
RUN<-function(){
X<<-matrix(0,nspecies,time)
palette(rainbow(nspecies))
mat<-sample(1:nspecies,gridsize*gridsize,replace=T)
X[,1]<-table(mat)
image(matrix(mat,gridsize,gridsize),col=sort(unique(mat)))
for (i in 2:time){
a<-X[,i-1]
mat<-sample(1:nspecies,gridsize*gridsize,prob=a,replace=T)
X[,i]<-table(factor(mat,levels=1:nspecies))
image(matrix(mat,gridsize,gridsize),col=sort(unique(mat)))
gc()
}
matplot(t(X),type=”l”,lwd=2,col=rainbow(nspecies))
}
#However to get a little tcltk interface add this
############################################
library(tcltk)
Run<-function(…){
nspecies<<-as.numeric(tclvalue(”nspecies”))
gridsize<<-as.numeric(tclvalue(”gridsize”))
time<<-as.numeric(tclvalue(”time”))
RUN()}
tt <-tktoplevel()
d1<-tklabel(tt,text=”Numero de species”)
d2<-tklabel(tt,text=”Tamaño del grid”)
d3<-tklabel(tt,text=”Tiempo”)
s1 <- tkscale(tt,from=0, to=20, variable=”nspecies”,
showvalue=TRUE, resolution=1, orient=”horiz”)
s2 <- tkscale(tt, from=2, to=100, variable=”gridsize”,
showvalue=TRUE, resolution=1, orient=”horiz”)
s3 <- tkscale(tt, from=10, to=500, variable=”time”,
showvalue=TRUE, resolution=10, orient=”horiz”)
but <- tkbutton(tt, text=”Run”,command=Run)
tkgrid(d1,s1)
tkgrid(d2,s2)
tkgrid(d3,s3)
tkgrid(but)
The model is to be played with (just download R from CRAN and paste all the code into the console), rather than watched but here are some example runs on You Tube,
The moral of the story is that under a neutral model some species start a random walk to extinction if co-existing in a small space with other species. However we cannot tell which will dominate and which become extinct from their characteristics, because all individuals, regardless of species are governed by exactly the same rules. The fate of species is determined by the cumulative luck or misfortune of the individuals that have been given the label. The larger the grid the more species that can be supported for a longer time, but extinction risk is always present if a long losing streak hits. The element of Hubbell’s theory that this model does not reproduce well is the log-normal (-ish) relative abundance distribution of the surviving species, although come to think of it I have not tried running the model with a really large number of species and a really large grid. R runs out of colours for the illustration, but this is not important.
Many interesting questions remain. Could a simulation model based on this sort of idea be used to predict extinction over a real landscape? Models of this sort seem far too abstract to provide real insight into the real world. But in some sense the process that Hubbell describes in a more formal mathematical way in his book probably does apply. But, how can these sort of ideas be communicated accurately to prevent them from provoking the sort of unproductive SLOSS debate that has often resulted from theory being pushed too far?
Deforestation in La Sepultura 2.
Understanding and predicting patterns of deforestation in El Tablon.
(Note: This is an informal marker for work in progress that will be formalised at a later date as part of the reforlan project. Any suggestions and contributions are welcome. The work has not been reviewed and the conclusions may change. Please contact me if you wish to use any part of this analysis )
Deliberate destruction of existing forest cover is clearly a necessary component of anthropogenic deforestation. However, it is not a sufficient condition. For permanent deforestation to result there must be an impediment to forest regeneration. If vigorous regeneration occurs after timber harvest, or other forest use, then the activity meets at least one of the most basic criteria for sustainability. Unfortunately disturbance of the dry forests of La Sepultura biosphere is not being followed by regeneration. In the last decade the number of cattle grazing extensively within the buffer region of the biosphere reserve has increased rapidly. This has led not only to direct, deliberate removal of forest in order to provide pasture but also an increase in fire severity as the vegetation becomes more open and fuel dries out. Loss of vegetation leads to both chronic and acute soil erosion. Trampling and browsing by cattle prevents tree seedlings establishing. The result is generally agreed by both residents and researchers in the region to be causing degradation in the long term productivity.
This afternoon a colleague sent me a students’ analysis of deforestation in a very different region for comments. The student had produced tables of correlation coefficients and then then fitted a multiple linear regression in order to “explain” the factors responsible for deforestation. This is common way of looking at deforestation, but not a very informative one. I was completely confused and couldn’t see the point of the analysis.
Here I will show what I would consider a more informative way to proceed. In most cases we start an analysis of deforestation with a reasonable amount of prior knowledge concerning causal processes and linkages. The aim of the analysis should be to try to gain some new insights or reinforce an important, interesting and informative linkage. If for example poorer farmers are considered more likely to deforest their land than wealthier farmers then an analysis based on a correlation between income and forest use may be useful. However all correlation analyses must be approached with extreme care. For example, if poorer farmers live in areas with poorer soils where regeneration doesn’t occur then the linkages are clearly more complex.
GIS allows us to analyse whole landscapes with complete raster coverages. This puts into very sharp relief the traditional use of statistical tests of significance. By taking all the pixels huge amounts of data points could be analysed. Significance is then virtually guaranteed for even the most minor correlation, simply by assuming (obviously wrongly) that all points are independent. Now that would clearly be bad practice. But, how many independent data points are there on any landscape? It is really impossible to tell. Fitting variograms can give a hint of typical range of autocorrelative effects but it is always a matter of degree. If one landscape represents a single altitudinal gradient, then in one sense there are no degrees of freedom to play with.
A reasonable starting position is to put on hold the idea of automated statistical inference based on known degrees of freedom (independent replication) and work on producing informative documentation of patterns. Many scientifically interesting hypotheses can be tested during this process.
R and GRASS can work together very easily under Linux. R can be run from GRASS and then on loading the package spgrass6 raster coverages can be read into memory directly form the current GRASS region.
projection: 99 (Transverse Mercator)
zone: 0
datum: wgs84
ellipsoid: a=6378137 es=0.006694379990141317
nsres: 29.99281781
ewres: 30.00276767
rows: 844
cols: 1007
cells: 849908
d<-readRAST6(c(”chisdem”,”slope”,”beam1″,”for75″,
“for87″,”for99″,”for2005″,”basin.tablon”,”streams.buffer”,”cost2″,”topindex”))
These layers are as follows:
chisdem is the digital elevation model
slope in degrees
beam1 direct beam insolation under clear skies on the first clear day of the year.
for75 forest cover in 1975
for87 forest cover in 1987
for99 forest cover in 1999
for2005 forest cover in 2005
basin.tablon watersheds with a threshold of 1000 ha
streams.buffer distances of 100,1000,2000,5000,10000,15000,100000 from water courses.
topindex
Without needing accces to GRASS the RspatialPixelsDataFrame can be loaded into memory from the Ecosur ftp site using the code below. The rest of the analysis can then be duplicated in R under Windows. Source here … tablon.doc
urlstring<-”ftp://anonymous@200.23.34.16/simulacion/Datos”
a<-paste(urlstring,”tablon.rob”,sep=”/”)
load(url(a))
Now although R is capable of fitting models using the entire coverages, this would be “overkill”. We can take either a regular, stratified or random set of points using spsample. If points are going to be fairly close together, i.e. the whole landscape pattern is of interest, then a regular sampling scheme is probably preferable. The code below samples 50,000 points from the whole bounding box, in other words the number of points falling on the masked area is rather less. The points are then overlaid on the coverages to extract the values.
a<-spsample(d,50000,”regular”)
image(d,1,col=terrain.colors(100))
points(a,pch=21,cex=0.15,bg=2)
a<-as.data.frame(overlay(d,a))
Now, how can we look at systematic patterns over the landscape? There are three potentially orthogonal coverages that might collectively form a useful starting point. They are cost of access as measured in cumulative slope percentage from the entrance to the watershed, slope itself, and direct beam radiation in the winter months (an informative variable derived from aspect). Our hypothesis, that has already been partially confirmed by visual pattern analysis, is that deforestation has affected the most accessible areas with gentler slopes first. This is admittedly something of a “no brainer” but controlling for it allows more subtle elements to be analysed. The effect of winter insolation is the most interesting as it has many applied implications for management of the buffer zone. It appears that slopes that receive more sunlight in the winter are vulnerable to deforestation on two counts. 1) Fuel dries out more quickly leading to more severe fires 2) Regeneration is prevented by hydric stress experienced by juvenile trees.
So if a clearly monotonic relationship between direct beam radiation and the probability of a pixel being deforested exists we would have clear evidence that a process is at work. This can support further research in the area and suggest more sustainable management practices. A useful modern statistical tool for looking at this is provided by generalised additive models. There are several reasons for using GAMs as a default first analysis of this complex pattern. 1) The response is binomial (forest-non forest), so scatterplots are not useful. 2) An additive model allows partialling out of effects, in other words statistically holding for the effects of variables 3) Generalised additive models use splines to depict the shape of the relationship. These are flexible curves, thus if the response is multi-modal or inconsistently “wavy” we can spot this early in the analysis and thus not be fooled by the assumption of a simple monotonic relationship if it doesn’t exist in the data. Overly complex responses tend to raise a red flag. In this context they are most probably attributable to other highly autocorrelated spatial processes overlaying the area of interst. For example if three forest plantations had been established at differing elevations a spline would be trimodal, but knowledge of the history of the area would reveal why an attempt to relate tree cover to elevation would not be an informative analysis.
So lets fit a gam and look at the results. We can use the package mgcv
library(mgcv)
mod<-gam(for2005~s(cost2)+s(beam1)+s(slope),data=a,family=”binomial”)
par(mfcol=c(2,2))
plot(mod)
![]()
There are several things to notice in these graphs. First the y axes show the response on the scale of the linear predictor (logit). This lies between – infinity and infinity. It is not natural to think on a logit scale. Back transformation is more usual to improve interpretability. This doesn’t matter here as we are not interested in the absolute calibration for the moment. The questions involve the shape of the relationships. The results are quite encouraging. All three are effectively monotonic and form close approximations to strait lines on the scale of the linear predictor. In other words the probability of finding a forested pixel increases with slope and inaccessibility but decreases sharply with winter direct beam insolation. The slight waviness at the top of the first curve can be attributed to the scarcity of data points at the extreme. The confidence interval widens. So we are quite justified in using a simple linear model as a descriptor of the pattern. Notice that as data points are not independent we do not even think of talking about statistical significance. However we have now picked up a clear trend over the landscape.
However now things can become more complex. General additive models are just that. Additive. In other words we can’t analyse interactions. There must be many interactive effects occurring here. In fact a complete predictive model of the landscape would be best provided by a black box such as a neural network. We can build models of arbitrary complexity form this sort of data. Even very complex uninterpretable terms are often supported. Thus model selection is a very complex issue. Lets simplify it using some pragmatism and common sense rather than statistical reasoning. We have established that winter radiation, accessibility and slope all have monotonic relationships with deforestation on the scale of the logit linear predictor. Thus we can now justify the use of classical logistic regression that can include an interaction term. Let’s ask if accessibility interacts with insolation after accounting for slope.
mod1<-glm(for2005~cost2+beam1+slope,data=a,family=”binomial”)
mod2<-glm(for2005~cost2*beam1+slope,data=a,family=”binomial”)
mod3<-glm(for2005~cost2*beam1,data=a,family=”binomial”)
anova(mod1,mod2,mod3)
Analysis of Deviance Table
Model 1: for2005 ~ cost2 + beam1 + slope
Model 2: for2005 ~ cost2 * beam1 + slope
Model 3: for2005 ~ cost2 * beam1
Resid. Df Resid. Dev Df Deviance
1 18077 17205.8
2 18076 17137.8 1 68.0
3 18077 17675.7 -1 -537.9
So adding in the interaction reduces the deviance, i.e. it produces a better fit to the data. It is particularly important to include slope per se in the model as an additional term. Notice we have still not carried out a significance test (the reduction in deviance would be significant and would be strongly supported by information criteria .
Lets look at model 2 in more detail. We can do this using the effects package.
library(effects)
plot(all.effects(mod2),x.var=”cost2″)
plot(all.effects(mod2),x.var=”beam1″)
Note that effects plots also use the linear predictor scale, but are annotated using the response (probability of deforestation). The panel plots can be arranged using either variable. In the second example they progressively move from accessible to inaccessible (bottom left to top right). So we can conclude that the effects of winter insolation as a factor in deforestation seem to be more marked in more accessible areas even after accounting for slope. This helps to discount the possibility that we are analysing an artefact of the shadowing effect during the classification process and detecting a genuine trend. It also suggests an interesting process at work. Less remote areas have undergone a form of chronic deforestation through the cumulative effects of forest fires, uncontrolled browsing and poor regeneration that has been accentuated on south facing slopes receiving more sun in the winter months. It does indeed seem to be that deforestation is arising from poor regeneration.
We can go a step further and use the model to simulate how the historical deforestation process would have proceeded if these general relationships are assumed to have held in the past. The way to do this is to first predict the probability that a pixel is forested using the model based on the current pattern. Then arrange pixels in rank order. A simple simulation can be made by taking them in this order and deforesting in steps up to the current level.
mod<-glm(for2005~cost2*beam1+slope,data=a,family=”binomial”)
d[["pred"]]<-predict(mod,newdata=d@data)
d[["rank"]]<-rank(jitter(d[["pred"]]),na.last=”keep”)
for (i in seq(1000,90000,length=10)){
d[["defor"]]<-ifelse(d[["rank"]]<i,0,1)
image(d,”defor”,col=c(”red”,”green”))
view.3d(d,1,exag=1.5,14,pal=c(”red”,”green”))
filename <- paste(”pic”, sprintf(”%07.0f”, i),”.png”,sep=”")
rgl.snapshot(filename)
}
Now there are still some quite big issues with this. Although a reasonable match to the current pattern is provided, it is not perfect. This is probably due to nucleation effects of settlements. I will show how this can be added into the analysis and used to look at future deforestation risks in a later post.
Deforestation in La Sepultura
The animation above first shows an accessibility index calculated as a cost surface with the entrance to the watershed as a starting point. There is an excellent “How To” on calculating cost surfaces in GRASS available here. The cost surface (warm colours accessible, cold colours inaccessible) is followed by an animation of deforestation estimated by supervised classification of a series of landsat images taken in1975, 1987, 1999 and 2003. The area is a focus case study, the Tablon watershed in the Sepaltura biosphere reserve. The set of images in Google Earth format are available here.
Landsat imagery is the most frequently used for tracing patterns of deforestation over time. Landsat has been acquiring coverage of the Earth’s surface since 1972 when Landsat 1 was launched. Since then, four other satellites have been in operation. Landsat 1, 2, and 3 flew in a circular orbit 913 kilometers (570 miles) above the Earth’s surface and circled the Earth every 103 minutes (14 times a day). Landsat 4 and 5 fly about 705 kilometers (440 miles) above the Earth and circle every 98 minutes.
Landsat 4 and 5 are still operating. Landsat 6 was launched in 1993. Thus the sensors used to provide images have changed over time and are not easily comparable. The current Thematic Mapper (TM) class of Landsat sensors began with Landsat-4 (1982). This series continued with the nearly identical sensor on Landsat-5 (1984). Landsat-7 Enhanced Thematic Mapper Plus (ETM+) was carried into orbit in 1999. Currently, both the Landsat-5 TM and the Landsat-7 ETM+ are operational and providing data. Landsat-7 ETM+ experienced a failure of its Scan Line Corrector mechanism in May 2003. Data products have been developed to fill these gaps using other ETM+ scenes.
The images used for this analysis were downloaded from the global land cover facility and processed in GRASS. We are currently engaged in a classifying deforestation over a much larger area using similar imagery. Quantifying patterns of deforestation from these historical satellite images is never straightforward, for many reasons. Differences in the conditions when the images where taken can lead to misclassification of pixels leading to errors when documenting the overall spatial pattern. However even though it can be difficult to tell with certainty if a specific small area has changed, the general regional trends detected are quite reliable.
Recently high resolution images of this particular study region have become available and we are acquiring more. It is encouraging to find that at least the classification into “forest-non-forest” that has been achieved using Landsat imagery provides a very acceptable match with the visual impression provided by high resolution images. This can be seen below. You can check your own visual impression of the match by downloading the KML files and changing the transparency in Google Earth. The following lines load the layers into R.
urlstring<-”ftp://anonymous@200.23.34.16/simulacion/Datos”
a<-paste(urlstring,”tablondefor.rob”,sep=”/”)
load(url(a))
In flower this week
Publication bias
One of the stories in the news in the UK this week was the publication of a study by Irving Kirsh from the University of Hull and various colleagues on the efficacy of anti depressant drugs, such as the well known selective serotonin uptake inhibitor, Prozac. (101371_journalpmed0050045-l.pdf) . The researchers gained access to previously unpublished studies. This led them to some interesting conclusions. Although meta-analyses of antidepressant medications report modest benefits over placebo treatment, when the unpublished trial data are included, the benefit falls below accepted criteria for clinical significance. There is a link to a file of a BBC radio podcast on the subject here. There are also some earlier studies by the same group that tell a similar story 1171.pdf
It is easy to see this as a simple example of drug companies being cynically economical with the truth. However even drug companies stand to lose out in the long run if they peddle merchandise that doesn’t do what it says on the can. A more important aspect of the study is its broader implic



