“Raster” data in PostGIS?
PostGIS does not support raster data as yet. Hopefully there should be raster support soon. Raster layers can be visualised from a local or remote server using geoserver. However having raster layers integrated into a PostGIS data base would be extremely useful for tasks such as overlaying herbarium records with continuous extrapolated climate data or elevation.
I have previously experimented with importing the points of a raster layer. This works quite well. It can be used for simple spatial overlays using proximity operations. However the visual impression produced by the points is not very satisfactory and a query using the contains operator would be more efficient.
QGIS has a plugin for graticule generation. This forms a grid of individual polygons. So if a grid is produced using the graticule plugin of QGIS and then imported into PostGIS using SPIT it is possible to partially mimic a raster layer, although with some computational overhead.
The document mnp-workshop3.pdf showed how worldclim data (http://www.worldclim.org/) can be imported into GRASS and R.
If R is run from GRASS the following lines can be used to import the appropriate raster layers at a chosen resolution into R.
Most of the lines of the script are concerned with calculations on the basic worldclim data. The dataset consists of 36 grids of monthly mean minimum temperature, maximum temperature and precipitation. Some mean annual values may be derived for simplicity.
a<-c(paste(”tmin”,1:12,sep=”"),paste(”tmax”,1:12,sep=”"),paste(”prec”,1:12,sep=”") ,”alt”)
d<-readRAST6(a)
d@data[,1:24]<-lapply(d@data[,1:24],function(x)x/10)
fullgrid(d)<-FALSE
f<-function(x)mean(x[1:12])
d@data[["meanmim"]]<-apply(d@data,1,f)
f<-function(x)mean(x[13:24])
d@data[["meanmax"]]<-apply(d@data,1,f)
f<-function(x)mean(x[1:24])
d@data[["meantemp"]]<-apply(d@data,1,f)
f<-function(x)sum(x[25:36])
d@data[["totprec"]]<-apply(d@data,1,f)
The trick for importing the layers to PostGIS from R is shown below. If the data is in R as a spatialPixelDataFrame the coordinates can be added to the data to form a table.
dd<-data.frame(coordinates(d),d@data)
This table can then be saved to a PostGIS database using ODBC.
library(RODBC)
con<-odbcConnect(”mydb”)
odbcQuery(con,”SET search_path =chiapas, pg_catalog;”)
sqlSave(con,dd,”worldclim”)
###########################################
Now either in pgadmin or running the queries directly through R by placing them inside an odbcQuery function.
ALTER TABLE chiapas.worldclim add column gid serial NOT NULL;
ALTER TABLE chiapas.worldclim ADD CONSTRAINT worldclim_pkey PRIMARY KEY(gid);
ALTER TABLE chiapas.worldclim ADD COLUMN the_geom geometry;
UPDATE chiapas.worldclim
SET the_geom = PointFromText(’POINT(’ || s1 || ‘ ‘ || s2 || ‘)’,4326);
CREATE INDEX
grat30arc_idx_geo on grat30arc USING GIST(the_geom GIST_GEOMETRY_OPS);
CREATE INDEX
worldclim_idx_geo on chiapas.worldclim USING GIST(the_geom GIST_GEOMETRY_OPS);
If a polygon graticule called grat30arc covering the same region with a 30 arch second grid has been produced using the graticule plugin in Qgis and imported into the public schema then the average values of the points within each grid square can then be calculated for the graticule and a new table produced using this as a model query.
CREATE TABLE chiapas.worldclim2 with oids as
SELECT a.*, g.totprec, g.meanmin, g.meanmax, g.elev from
grat30arc a,
(SELECT AVG(c.totprec) as totprec,
AVG(c.meanmim) as meanmin,
AVG(c.meanmax) as meanmax,
AVG(c.alt) as elev,d.gid
FROM chiapas.worldclim c, grat30arc d
WHERE c.the_geom && d.the_geom
AND contains(d.the_geom, c.the_geom) group by d.gid) g
WHERE a.gid=g.gid;
The result is very similar to the visualisation of an arcgrid in arcview when it is viewed in QGis over a shaded relief map. In this case the pixels are approximately 1 km x 1 km (30 arc seconds or 0.008333333 degrees) and the elevation lines up well with the contours derived from a 1:250,000 map.
Spatial indices on the herbarium table and the worldclim2 layer with the following
CREATE INDEX
herb_idx_geo on herb.herbario USING GIST(the_geom GIST_GEOMETRY_OPS);
CREATE INDEX
worldclim2_idx_geo on chiapas.worldclim2 USING GIST(the_geom GIST_GEOMETRY_OPS);
Then the following query runs in less than four seconds to provide 6,000 herbarium records with extrapolated climate data to the nearest km.
SELECT h.*, w.totprec, w.elev,w.meanmin,w.meanmax
FROM herb.herbario h, chiapas.worldclim2 w
WHERE h.the_geom && w.the_geom
AND contains(w.the_geom, h.the_geom);
Or
select h.*, w.totprec, w.elev,w.meanmin,w.meanmax
from herb.herbario h, chiapas.worldclim2 w
where ST_contains(w.the_geom, h.the_geom);
Deforestation:Overlaying vector and raster layers in PostGIS using GRASS
A very common operation in GIS involves overlaying vector polygons on a raster map and calculating the area of each class in the raster map. The results are then added to the attribute table of the vector map. How can this be achieved in PostGIS?
If the raster layer is quite small it could be imported as points and the result achieved through a point in polygon SQL overlay. However that is not the usual way to achieve the operation. When large raster layers derived from the analysis of satelite imagery are of interest the conventional GIS approach is to first rasterise the vector layer. The computations involved in the overlay are then simple and fast.
This example taken from our ongoing work. It overlays a map of agricultural units on a raster layer representing deforestation in the last decade derived from analysis of satelite data. Because deforestation in Chiapas has tended to take place on a rather small scale involving only a few hectares at a time, it is not at all easy to see the large scale regional pattern from an image. Calculating the proportion of the total area of each agricultural unit deforested helps to see the pattern. The map above shows the percentage of the total (forested + non-forested) area of each agricultural unit estimated to have been deforested between 1990 and 2000.
The simplest way to achieve this result with PostGIS is to import the vector layer into GRASS, where the raster is also held. There are a few issues that have to be addressed when doing this. The first is that the gid index is lost when the attributes are imported. As a unique identifier for each polygon is needed during the overlay an index should be added to the table first if the gid is the only unique identifier for the table. This is very simple.
ALTER TABLE chiapas.agros ADD COLUMN id int8;
UPDATE chiapas.agros SET id=gid;
Now enter GRASS in a location with the same projection as the PostGIS data base (typically EPSG:4326)
Now there is another technical issue. Clearly vector processing in any GIS involves a combination of the geometery and attribute tables. We have both already in PostGIS. However in GRASS we use functions that work on GRASS’s own representation of the geometry, so an import is needed. GRASS can use a wide variety of databases as a back end for storing attribute data. Among these is Postgresql. What is not obvious, and potentially quite confusing, is that we already have the data in a Postgresql data base! So, does it all need to be imported again? The answer seems to be yes and no. It is quite possible to link the GRASS topology with the original table. However it is much easier to import it all in the usual way using v.in.ogr. This produces a new table, that could either stored in an external data base or perhaps even within the original PostGIS data base (as a new copy). Of course this attribute only table lacks the geometry column.
For example to set up GRASS to use the original local PostGIS data base as the backend for vector layers
db.connect driver=pg database=”host=localhost,dbname=gisdb2″
db.login user=Duncan pass=******
Note that this step is not necessary. By default Grass will store the table as in dbf format.
A layer can be imported from PostGIS using
v.in.ogr dsn=”PG:host=localhost dbname=gisdb2 user=Duncan” layer=chiapas.agros output=agros type=boundary,centroid
If the raster layer is in another coordinate system, as it is in this case, it will be necessary to exit the location and reproject in the location of the raster layer.
v.proj input=agros location=Global mapset=mesoanalysis output=agros
Then form a raster layer.
v.to.rast input=agros output=agros use=attr column=id layer=1 value=1
Then in GRASS the overlay is achieved quite simply using r.stats and sending the output to a text file.
r.stats -a input=agros,defor >agros.txt
This forms a file in which the first column holds the id of the vector layer, the second the class in the raster layer and the third the area in square meters of each combination. For example…
1 111 1319094.000000
1 112 159201.000000
1 122 1041304.500000
1 211 4026323.250000
1 222 85285437.750000
1 244 89347.500000
1 422 52796.250000
1 444 1749586.500000
2 111 565326.000000
So the task is now done, apart from the job of adding the data back into the original PostGIS table. There are many ways to do this. Often only one, or a few focal classes are of sufficient interest to merit a new column. So an easy way to process the output and send it back to PostGIS involves R.
Connect to the PostGIS data base using RODBC and set the schema
library(RODBC)
con<-odbcConnect(”mydb”)
odbcQuery(con,”SET search_path =chiapas, pg_catalog;”)
Read in the data and set the first column as numeric, removing any missing data (by default it will probably have been interpreted as a factor as GRASS will use a * for missing data).
d<-read.table(”agros.txt”)
d$V1<-as.numeric(as.character(d$V1))
d<-na.omit(d)
Now a subset of the data can be taken for each of the columns of interest, named and added back to the data base. In this case code “111″ represents area that was classified as forest in 1990,2000 and 2005. “122″ pixels that were deforested in the period between 1990 and 2000.
odbcQuery(con,”ALTER TABLE agros ADD COLUMN forest int4;”)
d2<-subset(d,d$V2==111)
d2<-d2[,-2]
names(d2)<-c(”gid”,”forest”)
d2$forest<-round(d2$forest/10000,0)
sqlUpdate(con,d2,”agros”)
odbcQuery(con,”ALTER TABLE agros ADD COLUMN defor2000 int4;”)
d2<-subset(d,d$V2==122)
d2<-d2[,-2]
names(d2)<-c(”gid”,”defor2000″)
d2$defor90<-round(d2$defor2000/10000,0)
sqlUpdate(con,d2,”agros”)
We are rather concerned that this cover change analysis, that has been achieved with the standard methodology used by Conservation International, tends to quite seriously underestimate small scale disturbance and canopy opening. Nevertheless in comparative terms it does demonstrate that absolute deforestation in Chiapas is certainly not a homogeneous process. Extensive deforestation is not occurring, although if the map is zoomed out further to show the Marques de Comillas area a very clear hotspot of deforestation is shown. In this part of Chiapas (Central Valley, Sierra and Altos) comparatively few agricultural units have deforested more than 2% of their total area in the last decade.
The other way of looking at this would be as the proportion of remaining forest lost. This woud tend to emphasise deforestation rates in ejidos and agricultural units with a very small ammount of forest in 1990. However at this level the approach would be somewhat unfair as it would be biased by classification errors in small areas.
To put the map in context a map with the proportion of the total area classified as forest can be used. In R the column can be added using
odbcQuery(con,”ALTER TABLE agros ADD COLUMN propforest float;”)
odbcQuery(con,”update agros set propforest=100*forest/hectares;”)
A surprising number of the agricultural units have over 40% tree cover. This has to be placed in the context of a classification method that draws no distinction between a pixel within a pasture with sufficient cover of secondary vegetation to be classified as forest and untouched primary forest. The next step clearly involves analysis of fragmentation in order to clarify this.
An example of using PostGIS from R
One of the interesting elements of PostGIS is that it can be run from R using RODBC. I have found this to be very useful, especially when tables are being altered, as it allows the properties of the results of a query to be quickly analysed and checked for sanity before being added to the data base. So in order to work with POSTGIS I usually open QGIS, PgAdmin and an R console and switch between them.
Here is an example of how you might calculate the population density within polygons and add the results to a PostGIS table.
In R first connect to a data base assuming an ODBC connector has been set up
library(RODBC)
con<-odbcConnect(”mydb”)
Now the example will find the sum of the population within polygons representing watersheds. This layer was originallybuilt using r.watershed in GRASS followed by r.to.vect. The automated watershed limitation doesn’t work very well along the coast, but provides a useful set if polygons for the rest of Chiapas, particularly the central valley. First add a new column to hold the result.
odbcQuery(con,”ALTER TABLE chiapas_basins ADD COLUMN poptot int4;”)
Then the query that calculates the total population can be run from R. The result can by looked at first in R to check that it is sensible. This is useful, as I found differences between using the 2000 census data and the 2005 population count that are not attributable to population change or migration, but rather are caused by errors in the geopositioning of the data.
sql<-”select sum(m.z1) as poptot, b.id from chiapas_conteo_2005 m, chiapas_basins b where m.the_geom && b.the_geom AND contains(b.the_geom, m.the_geom) group by b.id;”
d<-sqlQuery(con,sql)
This takes a while to run. I quickly discovered that it is extremely important to make sure that a GIST index is added before attempting point in polygon operations. This can be added to each table in PgAdmin, for example.
CREATE INDEX chiapas_basins_idx_geo on chiapas_basins USING GIST(the_geom GIST_GEOMETRY_OPS);
Now histograms and totals of the population in the watersheds can be produced in R to look at the results before adding them back into the data base.
For example
hist(log(d$poptot))
The population density is typically log normal.
sum(d$poptot)
4255957
This gives Chiapas a total population of 4.25 million which looks about right. As the query returned the results with a column name that has already been set up and has the primary key this line will update the database.
sqlUpdate(con,d,”chiapas_basins”)
Now we need to calculate the area of the watersheds in km2. To do this I first had to insert an srid for Albers conic projection for North America. This can be found here
http://www.spatialreference.org/ref/epsg/102008/
Then some more queries that can either be run in pgAdmin or wrapped up by odbcQuery(con, “some sql statements “) and run from R directly.
odbcQuery(con,”ALTER TABLE chiapas_basins ADD COLUMN areakm2 float4;”)
odbcQuery(con,”UPDATE chiapas_basins SET areakm2 = area(transform(the_geom,9102008))/1000000; “)
odbcQuery(con,”ALTER TABLE chiapas_basins ADD COLUMN density float4;”)
odbcQuery(con,”UPDATE chiapas_basins SET density = poptot/areakm2; “)
The results can then be visualised in QGIS. Whether this is easier than using ARCGIS certainly depends on what you are used to. I am still working out how to achieve fairly basic operations like this in PostGIS. However when I have achieved them once, it is then easy to repeat the operation for other examples, which is why I am posting fairly simple examples like this to the weblog for reference in case they can help others to overcome the intially steep learning curve with PostGIS. Although the geoprocessing wizard in ArcView is very simple to use and the point in polygon operation does seem to run much faster than the corresponding query in PostGIS, this particular operation would still require some additional work with the database tables if it were to be done using ArcView alone.
Deforestation and floods 4: PostGIS raster layers?
One element struck me when looking at the original data compiled by Dartmouth college on flood events. The data base contained a table headed “cause”. This was full of text such as “rain”, “heavy rain”, “torrential rain”. The orange points on the map above are the centres of recorded flood events in thals three years. Note VillaHermosa in Tabasco. The coloured layer is shaded by rainfall intensity (five bands ranging from below 600mm to 2500 mm +).
So, the headline would be that “scientists conclude that flooding is caused by ….. rain!”
This is not in fact as trivial as it seems. Devastating floods can occur in regions where heavy rain is uncommon, but there must be a clear correlation between flood frequency and the sheer amount of annual rainfall a region experiences. I am constantly surprised by how quickly the vegetation in Mexico changes from dry forest to exuberant tropical forest. The transition zone can be a matter of kilometers on the road north from Chiapas to Tabasco. The flooding in Villa Hermosa in 2007 was mainly the result of rainfall falling on the state of Tabasco itself (and a small part of the extreme north of Chiapas). During the dramatic events a very light drizzle was falling in San Cristóbal. The lowlands of Tabasco are almost completely deforested, but the mountains of northern Chiapas are covered in a mixture of secondary forest and cattle pastures. Rains in late October fall on already saturated soil profiles. Vegetation cover could only have played a marginal role in reducing the amount of water flowing out of the highland watersheds in this particular case.
This provided me with an excuse to experiment with ways to move raster layers into POSTGIS. The POSTGIS data base structure is clearly not designed for raster layers, but some quite good results can be obtained simply by moving a fairly coarse raster layer, such as an interpolated rainfall coverage, into POSTGIS as points. This is similar to a SpatialPixelsDataFrame in R. In fact the easy way to set up the coverage is to move the data from a sp object in R to POSTGIS by converting the coordinates into a geometry. When viewed at some scales an interference pattern results, but the coverage can be very useful because clicking the information tool on any point on the map results in information on the pattern of rainfall over twelve months, although the visualzation can use the total annual rainfall.
The first image used Udig. The lower image uses Qgis. Again although I have not put a clear legend in this very quick informal treatment, the rainfall ranges from 800mm per year (bright red)to 4000 mm (dark blue)
La niña and early rainfall
One of my first posts to this weblog mentioned the statistical correlations between low mid pacific sea surface temperatures and early rainfall in Chiapas.
http://duncanjg.wordpress.com/2008/02/06/checking-on-the-el-nino-effect-using-r/
Heavy rains often do not begin in the state until mid May. However this weekend we have experienced a major rain storm (12 April). Persistent rain fell throughout the night and early morning (14 April). Rerunning the code shows that la Niña effect is still strong.
Of course as any linkages are statistical in nature I do not expect my temporary success at long term weather forecasting to continue.
Floods and forest cover 1
Good decision making should be based on good scientific advice. It is clearly the job of the scientific community to support decision makers whenever they feel able to do so. Floods cause tremendous suffering, not only in terms of loss of life but in the immense disruption caused to lives, livelihoods and economies. Most residents of Southern Mexico have been affected in some way by floods in Tapachula (2005) and Villa Hermosa (2007). Fortunately these severe events caused comparatively few lives to be lost directly. However their consequences to the regional economy were extremely grave. In both cases politicians in the lower part of the watersheds cited deforestation as a major contributory factor.
Conservation of the native forests of Chiapas will not be achieved unless the full value of the services they provide is appreciated by society. Yet the linkages between flooding and deforestation are complex. When a study claims to provide new evidence regarding linkages between deforestation and flooding it must be considered with some care.
Comments on the scientific value of the paper I cited in the previous post should be based only on evidence. I therefore checked most of the websites the authors used to derive their information before analyzing the paper. An important source was the world atlas of flood events at Dartmouth College
http://www.dartmouth.edu/%7Efloods/archiveatlas/index.htm
The data found there can be mapped directly using their web site as shown below.
I also downloaded and reformatted the raw data and moved it into postgis. It can be used as a shapefile through the zipped file here floodevents(change-extension-to-zip).doc
The data is clearly of variable quality. This is par for the course. Prior to 2006 the geographical coordinates were not entered explicitly in this particular data base. All the original raw data from the Dartmouth data (that I downloaded in Excel format) including textual descriptions of the floods can be loaded strait into R using read.table by changing the extension of the following file to txt flood_events.doc. Here are some quick (Qgis) maps derived from the data showing loss of life from the floods that count with coordinates.
There are various points to note before I look at the statistical analysis used in the article in greater detail. Southern and Eastern Asia seems to be particularly flood prone. Most floods on this continent occur around coastlines, on the lowlying floodplains and in the Himalayan watersheds. Africa has comparatively few floods. Flooding in North and Central America tends to be associated with the effects of tropical depressions and hurricanes.
The patterns of flooding at this scale seem to be determined largely by global circulation patterns, but the consequences and probability of being reported will be determined by settlement patterns and economic activity. Floods are rarely reported from the Amazon basin, even though much of the low lying areas of forest (várzea) are seasonally flooded and these floods vary in severity tremendously between years. These sort of details interest forest ecologists trying to trace tree distribution patterns but never make news because they do not affect human lives.
PostGIS and R continued
Last week a colleague sent me an article published in Global Change Biology for comment. I have just got round to looking at it in detail. The abstract is available here (with full text if you have the institutional access.)
http://www.blackwell-synergy.com/doi/abs/10.1111/j.1365-2486.2007.01446.x
The authors claim have identified a correlation at the global scale between forest loss and flooding.
I have a lot of issues with the way the data has been analysed that I intend to discuss here. However one of the positive features of this article is that all the data that used is included as an appendix. This allows anyone to make their own mind up regarding the merits of the analysis.
I will make specific comments in a later post. But I first deal with different technical issue. Having setup a postgis database for Mexico and Central America I realised that I lacked a global country map to help visualise data of this sort. I tried importing the shapefiles that come with Arcview, but with no success. They apparently have errors in the topography. I also tried converting the worldhires data in the R mapdata package, but that also had issues with bad topography (apparently American Samoa was to blame). Finally I found this reply to a R help post from the prolifically helpful Roger Bivand
http://finzi.psych.upenn.edu/R/Rhelp02a/archive/78339.html
I tried
load(url(”http://spatial.nhh.no/R/etc/world_countries.rda”))
library(rgdal)
writeOGR(world_countries, “PG:dbname=mydb”, “world_countries”, “PostgreSQL”)
Bingo. In a couple of minutes I had a rather low resolution world countries map with ISO3166 two letter country codes ready for linkage to any data with a column that uses this international standard. It is now nicely tucked away in my postgis data base ready for use whenever needed. Very satisfying. Notice that I include this code as an example of how R connects to a postgis database once one has been set up. It obviously won’t work if you haven’t gone through the steps to establish your local data base. Here is how on Ubuntu. It is not something I would try under Windows,even though posgresql 8.2 apparently has now been ported to Windows.
I then had to reformat the data from the pdf copy of the article. This was slightly frustrating as it needed a bit of manual work in an open office spreadsheet. I then added the ISO3166 codes. This line (also suggested by Roger Bivand) helped a lot. It would be even better if editors and reviewers insisted on the use of standards when they are available.
d<-data.frame(d,ISO3166=world_countries
$ISO3166[match(toupper(as.character(d$Country)),as.character(world_countries$names))])
The nice thing about a postgis data base is that you can use it for storing all your data, spatial or otherwise, in one convenient place. So using RODBC connectivity the floods data can be incorporated by
library(RODBC)
con<-odbcConnect(”mydb”)
sqlSave(con,d,”floods”)
The only slightly fiddly aspect is that the saved table lacks a primary key. That can be resolved from R with few lines that run sql in postgresql.
odbcQuery(con,”ALTER TABLE floods ADD COLUMN index int4;”)
odbcQuery(con,”UPDATE floods SET index=cast(rownames AS integer);”)
odbcQuery(con,”ALTER TABLE floods ADD CONSTRAINT floods_pkey PRIMARY KEY(\”index\”);”)
odbcQuery(con,”ALTER TABLE floods DROP COLUMN rownames;”)
Then the tables can linked together as a temporary view from within R
sql<-”create view floods2 as select f.*, w.wkb_geometry from floods f,world_countries w where f.ISO3166=w.ISO3166″
odbcQuery(con,sql)
The view can be removed from the database whe it is finished with by
odbcQuery(con,”drop view floods2″)
With the view in place Qgis can be used for the visualisation instead of R, which is really helpful for casual analysis. This convenient interoperability, drawing on the particular strengths of a range of different tools within a common framework (statistical environment, database, GIS visualisation), is one of the great plus points of Open Source software from the point of view of a user. However you do have to discover how to do it first. You don’t get a “wizard” to help.
R does produce very good publication quality maps, but there is rather more overhead involved than when using the simple point and click graphical user interface of QGis I am now finding it much easier to send vector layers strait to postgis and work from there.
The view can also be exported as a shapefile strait from QGis. As the wordpress site only allows me to upload files it thinks are documents I have placed it here with a false doc extension. Don’t try to open it, save the file to disk and then change the extension to .zip. floods2thisisreallyazipfile.doc
Population density and deforestation
An element of spatial analysis that can be both fascinating and frustrating involves resolving the tension between the attraction of visual pattern matching and the formality of statistical analysis.
Traditional non spatial statistics in an observational setting relies heavily on correlation analysis. This alone can be uninformative, misleading and often confusing. Modern spatial statistics that attempt to go beyond the limitations of correlation can be obscure and difficult to communicate. At the same time maps can be difficult to summarize. Presenting data in the form of patterns that overlay each other in space (and sometimes time) is very difficult to achieve convincingly online and much harder to incorporate into a scientific format that still relies heavily on printed material.
Take the example of the relationship between population density and forest cover in Chiapas. My own experience in the region has led me to some simple, robust conclusions based on the evidence I see around me on a daily basis.
1. The highlands of Chiapas have an extremely high population density that can not be sustained through subsistence agriculture. Around San Cristobal this is above 100 people per km2. Yet a surprising amount of forest cover still remains.
2. The Central Depression has a much lower population density and an entirely agricultural economy. It is largely deforested. This is largely true despite the fact that satellite imagery can underestimate the amount of remaining forest.
3. The rural population of the highlands has continued to grow in recent years. However the amount of land dedicated to agriculture in the highlands has slightly decreased as has the value of agricultural production.
4. Recent deforestation in the highlands has disproportionately affected old growth mature forest. Secondary forest in the highlands has tended to expand.
5. The economy of the highlands of Chiapas is growing while the Central Depression is static.
6. Extreme poverty is still more prevalent in the highlands of Chiapas, while chronic rural deprivation is a feature of the central depression.
These observations have been the subject of academic discussion and are not always accepted in the way I have stated them. In particular the empirical relationship between population density and deforestation is often still at the centre of the controversy.
This may be due to trivialization of the obviously non linear (in the statistical sense) pattern I tried to show in the animated gif above (move back up if you didn’t realize it was an animation, I left a longish pause for thought between each frame). Agricultural “frontiers” with no resident population are trivially forested.The agricultural frontier regions also (trivially) show the highest rates of deforestation (deforestation is not shown explicit here, but see some previous posts). However the extremely complex dynamics of forest loss, regrowth and consequent structural and compositional change as a result of the management of a culturally determined landscape are often misrepresented and misunderstood. Attempting to linearize this sort of pattern though correlation analysis is never going to provide insight.
The animation above is based on the 2000 census data which was imported into a POSTGIS database and then visualized using Qgis. The polygons are watersheds, produced using r.watershed in GRASS, translated into a vector layer and also exported to POSTGIS. The populations within the watersheds can then be calculated using a simple spatial query in POSTGIS. I will provide more technical details on these steps in a subsequent post. The size of the points for the population centres is not proportional to population size as such but was chosen for visual clarity.
Potential species distributions 1
A constant frustration when working in tropical regions is the shortage of really high quality data. The quantitative knowledge that applied conservation needs in order to prevent the extinction of the world’s endangered species of plants and animals is still surprisingly hard to come by. At best the available hard data is fragmented, of variable quality, difficult to organize and thus hard to analyze in a formal manner. At worst it is either missing or positively misleading
Information used for conservation planning also forms the central subject matter of the scientific discipline of ecology. In other words, it concerns the abundance and distribution of organisms. Our ignorance of this most basic characteristic of the planet we live on is remarkable. Astronomers (arguably) provide estimates of the number of stars in our galaxy with narrower proportional margins of error than we have for the number of species in a Mexican forest. For example, Wikipedia cites three sources in stating that of 2006, the Milky Way is thought to be comprised of between 200 to 400 billion stars. A defensible estimate of the number of tree species in Chiapas could still range from 500 to 2500 depending on how data of dubious quality is interpreted. This is not good enough for such highly visible and important elements of ecological communities. The hard data is just not there to improve on this.
Although in recent years considerable advances have been made in systematics (Linnean shortfalls), these so called “Wallacean” shortfall remains. The large scale, systematic, planned, coordinated field work needed to redress our ignorance is simply not being funded.
In recent years I have been working with techniques for producing automated distribution maps of tropical tree species using R. Even using comparatively good data for areas that we know at first hand, it is a remarkably difficult task. Modelling species distributions can involve more pragmatism than theoretical insight. At a regional scale the task becomes yet more challenging.
This is not because R is too limited in the number of useful tools it provides. A very broad range of models are available. They range from rule based classification and regression trees through AI “black boxes” such as neural networks, to generalised additive models and simple logistic regression. All can be fit within R in a common framework. The models can be used to produce predicted distributions based on either presence-absence data or known occurrences alone (with the addition of “pseudo absences”). It is also possible to automate linkages between R and other popular software such as maxent or GARP. Increasingly accurate layers for many key predictor variables are now available.
However at least fifty well distributed occurrence points are needed to provide credibly well validated distribution maps for a single species using a fully automated procedure. If fewer data points are available some sort of input from “expert judgement” is inevitably required in order to evaluate which of the potentially infinite outputs from species distribution models are most credible. Well documented, repeatable automation of the fitting process using R is very useful in this process, as many maps can be produced in a short space of time based on different assumptions. However resorting to visual pattern matching is frustratingly informal and breaks a standard rule of data analysis, i.e. ensuring perfectly reproducible results.
The code example below is one possible very simple implementation of GAMs with pseudo absences. It is is applied here to modelling the potential distribution of tree species using data points provided mainly from MOBOT. The code (deliberately in this case) does not take into account spatial trends or autocorrelation, thus potential matching climates will often be suggested outside a “known” species range. This can be corrected by adding in coordinates as predictors to the model, but at the cost of potential loss of insight into the distribution of species that may not yet have been collected from their entire range. Any modelling technique has to find a balance between too many false positives and too many false negatives. A common technique is to look at the ROC curve, but this is not particularly useful if the number of known occurrences is very low and restricted to a small part of the suspected range.
I have come to the conclusion that we should not be overly defensive regarding the failings of automated species distribution mapping algorithms when these are inevitably attributable to the poor quality of the underlying data. There is an unavoidable “garbage in, garbage out” syndrome that is difficult to avoid without extremely time consuming data cleaning. This is best undertaken at the point of origin, i.e. in the herbaria and museums where the data is collated.
The models can however provide some useful heuristic input to the evaluation of the potential range of a species. In my opinion, any assessment of the area actually occupied should not be attempted using regional scale models alone. Even obvious techniques such as using forest cover maps as masks to remove non-forested pixels will have mixed results. Some tree and shrub species are still common in areas classified as pasture at a regional scale. Urban areas also can have many trees. Unpublished studies have suggested that tree diversity is higher in urban Managua than in surrounding agricultural areas. Other species have very specific requirements for non mappable habitat characteristics.
These concerns led our group to the conclusion that given the current data the best that can be achieved at a regional scale is to map the distribution of “climatically associated species pools” (CASPS). The “CASP” approach suffers the serious weakness of rather devaluing individualistic species responses, but may be the best that can be achieved until new initiatives begin to provide contemporary, accurately georeferenced and correctly determined occurrence (and absence) data from across the tropics. Data sharing and pooling between currently active research groups will be a vital first step in this process.
The output from the R code above is overlain on the 3d Blue Marble image from code in the previous posts to give a visual impression at a very broad regional level. Green points are the input to the model (recorded occurrences) and red points are grid squares with similar combinations of regionally important climatic variables mapped at an approximately 5 km x 5 km scale. Further details of this and similar procedures including mapping of species pools are available in this document.mnp-workshop3.pdf
This model could provide some heuristic input to expert evaluation of the potential distribution and threats to some species currently being assessed for IUCN red list status. 450 maps produced through an automated procedure are included as low resolution jpg files within pdf archives in alphabetical order here.
Code 9
During the holiday Mickey and I have been watching a program called brainbox challenge on the BBC. Neither of us can keep up with any of the competitors, although Mickey sometimes beats them to it on the memory games. One of the games is called code 9. The rules are (fairly) simple. The competitors are given a mathematical operation to perform on two numbers. However each digit of the numbers has first to be coded by subtracting it from 9. Thus 1 becomes 8; 12, becomes 87 etc. After completing the operation the competitors have to apply the code again to the answer. So 1 +1 becomes 8+8=16, which is then recoded to give the answer 83. As I just can’t get my brain to do it unaided I thought of writing a function in R to help.
This line carries out the operation on whole positive numbers.
code9<-function(x)as.numeric(paste(unlist(lapply(strsplit(as.character(x),”"),
function(x)9-as.numeric(x))),collapse=”"))
so
code9(code9(10)+code9(10))
[1] 821
This works as a one liner, but there should be a better way requiring fewer steps within the function using a regular expression.
3d Belize
At the request of Bruce Miller in Belize I have added a 3d R object for that part of the world. The Modis Blue marble coverage is a bit coarse for this small country and the relief requires more vertical exaggeration, but quite nice results can still be obtained with a Landsat image, despite the cloud cover. The object can be loaded and visualized with the script below. The R object that is loaded is 3MB in size so Bruce may have a very long wait given the glacial speed of the Belizean internet.
This also gives now me a excuse to add a link to BERDS, the inspiring biodiversity database for Belize set up by Jan Meerman.
Blue marble and Landsat in R 3.
A final note on this theme. To add points to a 3d visualisation you need to know z values in order to place them on the surface. The way x,y,z are referred to by rgl is not intuitive and it took me some time to get this to work. The example uses a set of coordinates for the botanical collections and inventories of tree species that are compiled into a database. We are using these data for modelling regional tree species distributions and their potential response to climate change. A major source of this information has been the wonderful online resource provided by Missouri Botanical Gardens MOBOT, but data has also been provided from other sources including Conabio. Here I only provide the raw coordinates of collections, without species names or further details as some of the data is restricted. The interesting point to note is the spatial distribution. Notice that these points are for all trees, thus any one species will form an extremely small subset. Our knowledge of the real distribution of tree species in Mexico is thus clearly still very dependent on expert knowledge and uncompiled data from little explored areas.
The code is here. You should load rgl and sp before running it. addpoints3d.doc
load(url(”http://duncanjg.files.wordpress.com/2008/03/colpoints.doc”))
load(url(”http://duncanjg.files.wordpress.com/2008/03/bmmexicoapril.doc”))
coordinates(a)<-~x+y
a<-overlay(d,a)
a<-as.data.frame(a)
view.3d.rgb<-function(d,dem=4,red=1,blue=2,green=3,exag=1/30000){
fullgrid(d)<-TRUE
y<-d[[dem]]*exag
y[is.na(y)]<-0
dim(y)<-d@grid@cells.dim
x<-seq(min(d@coords[,1]),max(d@coords[,1]),length=d@grid@cells.dim[1])
z<-seq(min(d@coords[,2]),max(d@coords[,2]),length=d@grid@cells.dim[2])
colvec<- rgb(d[[red]],d[[green]],d[[blue]],m=255)
rgl.clear()
rgl.surface(x, z, y, color=colvec)
}
add.points.3d<-function(d=d,x,y,elev,exag=1/30000,…){
rgl.points(x=x,z=max(d@coords[,2])-y+min(d@coords[,2]), y=elev*exag,…)}
view.3d.rgb(d)
add.points.3d(d,a$x,a$y,a$alt,col=2,size=2)
Blue marble and Landsat in R 2.
I have just been asked if the images in the previous post could be produced simply in two dimensions so that points could be easily over plotted. The code to do this can be taken strait from the rimage package and very slightly adapted for a “SpatialGridDataFrame”. I haven’t worked out how it could be done using lattice.
spimage.doc
require(sp)
load(url(”http://duncanjg.files.wordpress.com/2008/03/bmmexicoapril.doc”))
sp.image.rgb<-function(d,red=1,green=3,blue=2,m=255){
colvec <- rgb(d[[red]],d[[green]],d[[blue]],m=255)
colors <- unique(colvec)
d[["colmat"]] <- match(colvec, colors)
image(d,”colmat”,col=colors)
}
sp.image.rgb(d)
Blue marble and Landsat in R
R is of course a statistical environment, so it really shouldn’t be expected to be doing the job of GRASS (Nviz), Google Earth, WorldWind, Qgis or Udig as well. However some nice visualizations can be done with rgl, including rgb composites from satellite images. They can even be more useful for quick regional overviews than the alternatives mentioned above.
The code to produce these images be found here bluemarble3d.doc. The first line will download the imported spatial object from this site. R does need a lot of memory for images of this size, so I guess you need a minimum of 1GB RAM (I am using a Toshiba Tecra laptop with Nvidia running Ubuntu Feisty 7.04 with 2GB). Given that, you should get very own zoomable 3d image of Mexico and Central America to play with in the same time it takes the screenshot below to run. It shows the steps in real time. (For R beginners, just open the file, paste all the code into the R console and wait a little for the download). The resolution is quite coarse (2 minute) so this is only suitable for a regional overview. You will need the rgl and sp packages installed first of course.
Exactly he same can be done with Landsat imagery. landsat.doc
If you want to get your own coverages for this for your own region the starting point is to run R from GRASS in a lat-long location. For example
g.region -p
projection: 3 (Latitude-Longitude)
datum: wgs84
ellipsoid: a=6378137 es=0.006694379990141317
north: 32:51:47.86586N
south: 5:28:57.975592N
west: 118:30:20.609692W
east: 76:26:43.294846W
nsres: 0:02:00.060768
ewres: 0:01:59.982024
rows: 821
cols: 1262
cells: 1036102
Then get some of the Blue Marble (Modis) imagery that can be downloaded from the NASA wms strait into GRASS using r.in.onearth. From within R you can use system to run GRASS. This does work with GRASS under Cygwin in Windows with the shell command.
system(”r.in.onearth -b file=/tmp/test month=Apr time=2005-3-24 ‘wgetopt=-c -t 5 –user-agent=MSIE5.5′ “)
The layers can then be imported into R with spgrass6.
library(spgrass6)
d<-readRAST6(c(”BMNG_Apr.red”,”BMNG_Apr.blue”,”BMNG_Apr.green”,”alt”))
Notice that this line imported a digital elevation model of my own as well, as one is of course needed for the 3d terrain effect.
Disturbance and the Lotka Volterra competition model
One of the works that provided constant inspiration both as an undergraduate and graduate student was Huston’s huge 1994 monograph. Despite some inevitable repetition in a book of this length I have continued to find new ideas in it over the years. The book contained a very simple demonstration of the intermediate disturbance hypothesis using the Lotka Volterra competition equations that I continue to use in classes. It can be implemented very concisely in R. The idea is simple. Reset the classic competition equations at periodic intervals and see how it affects the balance between species with r or K strategies. A very simple model with important implications.
Source here …lotka-volterra.doc
R<-c(0.1,0.12,0.15,0.2,0.3)
K<-c(1000,800,600,200,100)
Run<-function(runtime=100,freq=101,intensity=0.1){
N<-numeric(runtime*5)
N<-matrix(N,nrow=runtime,ncol=5)
N[1,]<-10
for (i in 2:runtime){
Totpop<-sum(N[i-1,])
for (j in 1:5){
N[i,j]<-N[i-1,j]+N[i-1,j]*R[j]*(1-Totpop/K[j])
}
if(!i%%freq){
N[i,]<-N[i,]*intensity
}
}
res<-data.frame(N)
res
}
#####################################
par(mfcol=c(2,2))
res<-Run()
matplot(1:100,res,type=”l”,xlab=”Tiempo”,ylab=”Poblaciones”,lwd=2)
title(”Sin perturbacion”)
res<-Run(freq=60)
matplot(1:100,res,type=”l”,xlab=”Tiempo”,ylab=”Poblaciones”,lwd=2)
title(”Perturbacion infrequente”)
res<-Run(freq=25)
matplot(1:100,res,type=”l”,xlab=”Tiempo”,ylab=”Poblaciones”,lwd=2)
title(”Perturbacion intermedio”)
res<-Run(freq=12)
matplot(1:100,res,type=”l”,xlab=”Tiempo”,ylab=”Poblaciones”,lwd=2)
title(”Perturbacion frequente”)
Reference
Huston, M.A. 1994. Biological Diversity: The Coexistence of Species on Changing Landscapes. Cambridge University Press, 708 pp.
Fuelwood model
This morning I designed a simple simulation model in R with a small tcltk interface based on the concepts in the previous two posts. The aim is to allow a prediction of the probable future yield of fuelwood from plots with uncertain planting or survival densities, uncertain diameter distributions and uncertain parameters regarding diameter-height relationships and wood densities.
The model can be run by sourcing the file below into R (install tcltk first). Again this is work in progress and may change as colleagues provide improvement through feedback and new data. The aim is to provide a useful tool for prediction in a highly uncertain environment through giving a probability that a fuelwood or carbon sequestration project’s aims could be met based on known or assumed planting densities and growth rates.
Permanent forest plots 2.
After thinking a little more about the problem of the allometric equation in the paper that I discussed yesterday, I realised I had a partial solution already to hand. (Again please read this as a marker for an idea that could be developed further. This is one of the motivations for using an informal web log instead of leaving ideas as notes on my own hard disk.)
In order to work with a fairly simple individual based forest model I have implemented many different allometric equations in R over time. My favourites for use in data poor areas are still those derived originally by Botkin and Shugart from way back in the late seventies (see for example Shugart 1984). I have used them in both my thesis and a later implementation of the same model. Results from this are included in a recent book chapter. I aim to include the entire R code for this model on this site some point soon (after tidying up some elements and documenting).
The nice thing about the parameterisations used in these early models is that they were explicitly designed to be run with a minimal amount of knowledge. Superficially that doesn’t sound too encouraging. Shouldn’t we try to improve a model by adding more explicit detail? Not necessarily. After all purely empirical allometric equations assume no knowledge at all, apart from that provided through the data. That is their strength and weakness. In order to be used correctly you have to build them afresh for each area being studied. However if equations can be constrained by some very easily obtained field observations a sensible adjustment can be made for the conditions of each plot without going through the costly process of measuring everything afresh. The results certainly won’t be precise, but that is the point. Uncertainty can be included in the calculations through Monte Carlo simulation. The true value should then fall somewhere in the predicted range.
Botkin’s equations can turn measured diameter at breast height into estimated height and therefore estimated tree volume for any species based simply on an estimate of the maximum diameter the species can obtain and its maximum diameter. It is fundamentally a three parameter quadratic model constrained to take realistic values. As maximum height is quite site specific and can be estimated through rapidly made observations the equations are potentially very useful in the tropics where actual measurements of tree height are hard to obtain, or in cases of models of newly established forests when final height may have to be estimated from some knowledge of site quality. Notice that the equation discussed yesterday must be based implicitly on some sort of rule regarding tree height and form factor. You simply can’t get around this. The equation is a statistical black box that just takes in diameter and churns out above ground biomass. The fact that the equation doesn’t explicitly state that height has been estimated doesn’t make it robust. You simply can’t turn diameter into volume and thus biomass without making some sort of assumption regarding the height and shape of a tree, even if it is hidden within an impenetrable calculation. This is a very common failing of empirical equations and a good reason to generally mistrust their uncritical application (This is often the result of a misplaced appeal to authority, an empirical equation has been published by an important research group so it is eminently citable. Yes fine and good, but the equation still only applies to their particular data set.)
So here are the set of simple equations Botkin suggested. I have prefixed each function with f. to avoid confusing a function placed in the global environment in R with a variable.
(All the code shown below can be downloaded in this script … permanentplots2.doc)
f.B2<-function(Hmax,Dmax)2 * ((Hmax - 137) / Dmax)
f.B3<-function(Hmax,Dmax) (Hmax - 137) / Dmax ^ 2
f.BA<-function(Diameter)pi * (Diameter / 2) ^ 2
f.Height<-function(Diameter,Hmax,Dmax)137 + f.B2(Hmax,Dmax) * Diameter - f.B3(Hmax,Dmax) * Diameter ^ 2
f.AGB<-function(D,Hmax=2000,Dmax=100,FormFactor=0.8,r=0.5)r*FormFactor * f.BA(D) * f.Height(D,Hmax,Dmax)/1000
Let’s see how the equation for height works in practice. Assume maximum diameter is 120 cm, but forests (or species) vary in the maximum height they can reach. Note that as this is a quadratic (i.e. parabola) maximum diameter must coincide with the maximum in the data or the curve turns downwards.
D<-10:120
plot(D,f.Height(D,Dmax=120,Hmax=3000)/100,type=”l”,lwd=2,col=”green”,ylab=”Height (m)”,xlab=”Diameter”,ylim=c(0,30))
lines(D,f.Height(D,Dmax=120,Hmax=2000)/100,col=”blue”,lwd=2)
lines(D,f.Height(D,Dmax=120,Hmax=1000)/100,col=”red”,lwd=2)
legend(”topleft”,lwd=2,legend=c(”High (30m)”,”Medium (20m)”,”Low (10m)”),col=c(”green”,”blue”,”red”))
Now let’s compare with the formulas used in the paper and included in the previous post and multiply appropriately by basal area and form to produce volume.
plot(D,wet(0.5,D=D),type=”l”,lwd=2,col=”green”,ylab=”Above ground biomass (kg)”,xlab=”Diameter”)
lines(D,moist(0.5,D=D),col=”blue”,lwd=2)
lines(D,dry(0.5,D=D),col=”red”,lwd=2)
legend(”topleft”,lwd=2,legend=c(”wet”,”moist”,”dry”),col=c(”green”,”blue”,”red”))
points(D,f.AGB(D=D,Hmax=3500,Dmax=120),col=”blue”)
points(D,f.AGB(D=D,Hmax=2000,Dmax=120),col=”green”)
points(D,f.AGB(D=D,Hmax=1500,Dmax=120),col=”red”)
legend(”bottomright”,pch=1,legend=c(”Hmax=20m”,”Hmax=35m”,”Hmax=15m”),
col=c(”green”,”blue”,”red”))
So the curves are very comparable. The second set can be produced by making a single reasonable assumption regarding maximum height. In fact the equation for above ground biomass has four parameters. Maximum height, maximum diameter, “form factor” and wood density. The form factor is a term used in forestry and here it might best be expressed as “branchiness”. Due to the way tree growth occurs (the “pipestem”model) total above ground volume can never be expected to exceed the volume of a cylinder. Foresters are only interested in the trunk, but a factor ranging between 0.6 and 1 can be applied quite reasonably to biomass as a whole. Here I have set the value at 0.8. This reproduces quite well the original allometric equations. We can add in uncertainty regarding this parameter in the analysis.
This is now the point at which the equations become useful. If a lot of knowledge is available, then separate parameters can be assigned for each species. Maximum height is quite easy to get.Simply by making this justifiable assumption accuracy should improve markedly over the “one size fits all” assumption that was implicit in the paper under discussion. A nice element is that the quadratic is constrained so the meaningless estimates that can sometimes arise when using polynomials outside their range of applicability will never occur. With less information functional groupings can be used. In the worst case scenario the same equation can be used for all trees in the forest, as was apparently done with the empirical allometric equation in the paper (with the exception of including a species specific term for wood density).
Using R it is very easy to include uncertainty regarding these parameters. As has been pointed out in the previous post, this is likely to affect the conclusions drawn much more than within sample variability. The point is that bootstrapping on parameters of a polynomial allometric equation with no real life interpretation would be very dangerous. The curve could go in many unexpected directions. In contrast bootstrapping on an equation with direct interpretation is reasonable. Lets use a boxplot as a quick way of looking at the results. Lets assume a great deal of uncertainty in Hmax (20m with a sd of 3m) and some uncertainty in the form factor (0.8, sd=0.1). It is easy to extend the uncertainty to all four parameters.
boot.strap<-replicate(10000,f.AGB(D,Hmax=rnorm(1,2000,300),FormFactor=rnorm(1,0.8,0.1)) )
boot.strap<-data.frame(rep(10:120,times=10000),as.vector(boot.strap))
names(boot.strap)<-c(”D”,”V”)
boxplot(V~D,data=boot.strap,range=0.1,outline=F,col=2)
So we can see how uncertainty in total above ground biomass increases with the diameter of the tree. Finally lets see how this could be applied to the simulated data. If we just want an uncertain estimate of total above ground biomass by applying the bootstrapped conversion formula to a single large plot we can do it like this.
D<-rlnorm(100000,2.2,0.8 )
D<-D[D>10]
D<-D[D<120]
boot.strap<-replicate(1000,sum(f.AGB(D,Hmax=rnorm(1,2000,300),FormFactor=rnorm(1,0.8,0.1))) )
quantile(boot.strap/1000000,c(0.05,0.5,0.95))
5% 50% 95%
6.97 9.75 13.04
plot(density(boot.strap/1000000,bw=0.7),xlab=”Total above ground biomass kg x 10^6″)
abline(v=quantile(boot.strap/1000000,c(0.05,0.5,0.95)),col=c(”blue”,”black”,”red”))
The extension of the technique to be used on the difference between two measurement periods is trivial if the data is available. More accuracy is obtained by narrowing the range of parameters by using a separate parameter for each species, so there is clearly a cost of ignorance at work here.. This can even be quantified (cf with this post) The method can clearly also be applied to estimating future biomass yield in the case of plots being planted for fuelwood or biofuels, based on making some very simple but justifiable assumptions regarding growth form, wood density and planting density. Because a measure of uncertainty is produced this is very useful in decision making.
Permanent forest plots
On friday my colleague Mario Gonzalez brought an interesting article to my attention that has been published on line in the Open Access biology journal PLoS Biology. I am a fan of this journal myself and I have tried to place a feed to the table of contents on this site (WordPress apparently doesn’t pick it up). I was particularly moved by the editors’ personal reasons for supporting Open Access.
The article in question is available by clicking here.
I was intrigued by the use of the words “Assessing evidence” in the title. The monitoring of large permanent plots has become a staple source of evidence for tropical forest ecologists. In recent years a great deal of progress has been made in understanding the processes at work in tropical forests, largely as a result of heroic efforts in a few well studied plots. Permanent plot monitoring involves a huge investment in time and labour. It produces a great deal of replication at the level of individual trees. However the amount of replication at the landscape scale is clearly limited. This leads to inevitable challenges for both statistical analysis and for the interpretation of results. I was very encouraged by the honesty and transparency in the article’s method section regarding problems associated with data quality. For example the authors explain that “For the plots with more than two censuses, we were able to correct anomalous dbh values by comparing the stem dbh growth rates across census intervals. If a tree showed a dramatic change in dbh growth rate, we changed the one outside of the range (−5 mm/y, +45 mm/y) with the likely value, and updated the dbh value accordingly. This filter was applied using a computer routine, and then checked manually.”
This sort of analysis involves a lot of work after the data has been collected. It also leads to some necessary but rather arbitrary decisions being taken. Trees generally really shouldn’t shrink over time, but anyone who has worked with this sort of data in the tropics has confronted the problem. One element that might help with this in the future is the wider use of PDAs in the field to enable real time sanity checking of data as it is recorded. Errors are often made in the data capture process and not detected until field work is completed. The level of training of technicians also tends to improve over time. So the next decade’s results should be more consistent than the last.
However this is not th element that most concerned me. Despite the interesting title I was left rather unsure whether the authors really had managed to evaluate all the evidence the data set provided. The problem in my mind arose from a common assumption that uncertainty within a statistical analysis is all attributable to variability in the data. In fact any assumption used in any calculation will change the results.. It is now becoming common practice to attempt to quantify the sensitivity of conclusions to assumptions, but this doesn’t seem to have been done here. Instead,even though a complete census was undertaken, statistical significance was based on within sample variability using an arbitrary division into sub-plots and the assumption of independence between them. It is not clear how useful this is.
In fact the main conclusion from the work would seem to be heavily reliant on the use of an allometric equation to convert diameter (usually measured at breast height) into an estimate of total oven-dry aboveground biomass, the fundamental response variable of interest.
Let’s look at how sensitive conclusions might be to this. There is a mistake in the parentheses of the functions as printed in the paper, but as I understand the equations they can be implemented in R using the following code. I have repeated the function with different name and parametrisation for each of the forest types. The important point is that, at least on my initial reading of the analysis, the authors’ applied the same function to all trees in each plot assuming the equation for a forest type applied to the whole plot.
(The code shown below is available in this text file. Use it if there are problems with quotation marks .. permanentplots1.doc )
dry<-function(r=0.5,D,a1=0.667,a2=1.784,a3=0.207,a4=0.0281){
D<-log(D)
r*exp(-a1+a2*D+a3*D^2 -a4*D^3)}
moist<-function(r=0.5,D,a1=1.499,a2=2.148,a3=0.207,a4=0.0281){
D<-log(D)
r*exp(-a1+a2*D+a3*D^2 -a4*D^3)}
wet<-function(r=0.5,D,a1=1.239,a2=1.980,a3=0.207,a4=0.0281){
D<-log(D)
r*exp(-a1+a2*D+a3*D^2 -a4*D^3)}
x<-1:120
plot(x,wet(0.5,D=x),type=”l”,lwd=2,col=”green”,ylab=”Above ground biomass”,xlab=”Diameter”)
lines(x,moist(0.5,D=x),col=”blue”,lwd=2)
lines(x,dry(0.5,D=x),col=”red”,lwd=2)
legend(”topleft”,lwd=2,legend=c(”wet”,”moist”,”dry”),col=c(”green”,”blue”,”red”))
A point to notice is that the equations suggest that trees in moist forest have higher biomass than those in dry or wet forest. Seven of the ten plots in the paper are considered “moist” but these must apparently lie somewhere on a continuum between dry and wet. There seems to be no clear justification for using the same optimum equation for all of these plots. I am not quite sure of the units as the paper says this produces above ground biomass in Mg (tons). The units look as if they should be Kg to me.
Also these rather complex looking formula basically reproduce the same curve that could be provided by simply fitting a quadratic function with three parameters over the likely range for the data (10:120 cm lets say). Two parameters do not change.
The curves diverge only diverge sharply for larger trees and it may be argued that really large trees are rare, so this shouldn’t matter too much. In all undisturbed tropical forests that I know of there tend to be a large number of small trees and a very few large trees. Thus a simple simulated data set can be produced by slightly truncating a log normal distribution.
set.seed(1)
D1<-rlnorm(100000,2.2,0.8 )
D1<-D1[D1>10]
D1<-D1[D1<130]
hist(D1,main=”Simulated diameter distribution”,xlab=”DBH (cm)”,col=”lightgrey”)
With time, patience, knowledge of the plots and perhaps a suitable individual based simulation model to hand it would be possible to simulate some reasonable growth and mortality data for each tree over time. For the moment I will simply assume (quite wrongly of course) that all trees increase their diameters equally by 1cm. I will also assume no mortality at all, so an increment in biomass is assured. Not at all realistic, but the question is, how might the increment in biomass in a plot affected by the assumption made regarding the fixed paramaterisation of the allometric equation.
Let’s compare the assumption that the trees follows the dry forest formula against the assumption of the moist forest formula. The code divides the total number of trees into 200 random subplots and carries out the sort of bootstrap resampling that was used to provide confidence intervals. The test of significance used in the paper relied on whether 95% of the bootstrapped values where above zero. In this simulated case, without mortality, this is not interesting. The question is how important is the assumption made for the allometric equation.
nsubplots<-200
subplot<-as.factor(rep(1:nsubplots,each=length(D1)/nsubplots) )
D1<-D1[1:length(subplot)]
D2<-D1+1
d<-data.frame(subplot,D1,D2,moist=moist(D=D2)-moist(D=D1),dry=dry(D=D2)-dry(D=D1))
a<-with(d,tapply(moist,subplot,sum))
boot.moist<-replicate(1000,mean(a[sample(1:nsubplots,nsubplots,replace=T)]))
b<-with(d,tapply(dry,subplot,sum))
boot.dry<-replicate(1000,mean(b[sample(1:nsubplots,nsubplots,replace=T)]))
d2<-data.frame(category=rep(c(”moist”,”dry”),each=1000),bootstrap=c(boot.moist,boot.dry))
bwplot(bootstrap~category,data=d2,breaks=1000)
So in this simulated data, the confidence intervals attributable to the bootstrapping procedure are overwhelmed completely by the effect of changing the assumptions in the allometric equation. This was simulated data and the effect may be much less in real data. But it should be taken into account in some sense. One way of dealing with it is to bootstrap on the parameters used in the entire calculation as was done here.
Another interesting point is that although PLOs Biology is laudably open access, the raw data itself is not made available. Thus I had to simulate data to try to make a point. The future of open access research should (IMHO) be directed towards open, transparent and documented data analysis. R can play an important role in this as any analysis can be freely shared as a script. We always have to make assumptions when analysing data and some assumptions are a matter of personal preference or instinct. A future model for meta-analysis might be based on an “analysis of analyses”. The more ways a data set is looked at the more likely it is that its fundamental content will be revealed.
Aggregating time series in R: The Iraq body count
Yesterday I was asked whether R could be used to analyse time series. The answer is of course it can. R is used extensively in the financial sector for analysing complex time series such as stock prices. I have already included an example using R in the context of climate variability (El Niño). One challenge is that there are a lot of different ways of working with time series and representing dates. Aggregation can be rather tricky. The zoo package is one of the most powerful tools for working with time series, but it is not always simple to use. I still haven’t got my head around all the different ways to achieve results.
So here is an example. The code first reads in the data from an online database, then uses tapply to sum the number of casualties per day. Zoo is then used to aggregate by month and the total is plotted. If anyone reading this can suggest better ways to do this or add a more sophisticated analysis I would like to know.
(Open this document if the code doesn’t work due to problems with quotation marks bodycount.doc)
Or try this …
source(url(”http://duncanjg.files.wordpress.com/2008/03/bodycount.doc”)) ,
again you might need to retype the quotation marks)
library(zoo)
library(date)
a<-”http://www.iraqbodycount.org/database/download/ibc-incidents”
d<-read.csv(url(a),skip=11,header=T)
a<-tapply(d$Reported.Minimum,d$Start.Date,sum)
x <- zoo(as.vector(a), as.POSIXct(as.date(row.names(a))))
f<-function(x)as.POSIXct(as.yearmon(x))
Deaths<-aggregate(x,f, sum)
par(bg=”lightgrey”)
plot(Deaths,lwd=3,col=2)
However you get there, the result is shocking. These are documented civilian casualties and I chose the lower estimate.
Even if the trend was initially downwards after the start of the surge, it still only bottomed out at around the level it began at when the “mopping up” operations were taking place in the first few months after the invasion. The time series taken from the compiled online data base stops in January 2008, so it doesn’t take into account the recent renewal of violence. Today, Thursday 6 March there were 86 civilian dead. Two bomb attacks killed 68 in Baghdad alone. Apparently this is not an isolated incident. The trend is sadly upwards again.
Joseph Stiglitz has estimated the price of the war at 3 trillion dollars. I wasn’t even sure what a trillion was until he used the figure. It has twelve zeros, in other words a thousand billion. That works out at 30 million dollars for each dead Iraqi civilian or enough to make ever one of these people as rich as Bill Gates. No further comment, apart from a heartfelt request to visit the site of those who have worked so hard to compile this important data set and re-run the code periodically to check the updated figures.
As an addition, I was saddened to hear that Harvard professor Samantha Power has resigned from Obama’s campaign team this week, apparently for speaking too openly on this, among other, issues. I was extremely impressed by her thoughtful, yet emotional, contribution to BBC radio’s start the week in which she talked of the biography she has written on Sergio Vieira de Mello, a heroic figure who constantly impressed me in every interview I heard with him up to his tragic death in Iraq. The recent news suggests that honest, open minded expression by academics is still considered to be a liability, even for apparently honest, open minded, politicians. This is a link to the Hard Talk interview.
A simple illustration of an ecological lottery model in R
A colleague contacted me yesterday with some interesting modelling work based on the neutral community model of Stephen Hubbell.
Hubbell’s work fascinated me when it was first published. I have always assumed that it is easily misunderstood. If Hubbell is taken to imply that all species are equal, then the work appears to be eroding the basic subject matter of ecology. Ecologists tend to look for informative differences between species. However this is not exactly what the theory is based on. It is rather more subtle. The theory concerns the equality of individuals (in this sense it almost has political implications). Hubbell writes.
“The theory treats organisms in a community as essentially identical in their per capita probabilities of giving birth, dying, migrating, and speciating. This neutrality is defined at the individual level, not the species level. All that is required is that all individuals of every species obey exactly the same rules of ecological engagement. So, for example, if all individuals and species enjoy a frequency-dependent advantage in per capita birth rate when rare, and if this per capita advantage is exactly the same for each and every individual of a species of equivalent abundance then such a theory would qualify as a bona fide neutral theory by the present definition.”
I have illustrated this idea for students with a very small simulation in R . R code, like matlab or octave, is very terse and efficient so a couple of lines can do a fair amount of work. Lets assume a very simple reproductive rule applies to all individuals. The rule is that all have the same discrete generation time, in other words they all survive one time step of a simulation. At the end of the time step each individual produces two offspring. The lottery then applies. The offspring are randomly mixed and placed back on the finite space occupied by the previous generation. So half find a home and half “die”. The process is then repeated. This was exactly how I first implemented the model. I then re-implemented the idea in fewer lines by making the probability of selection depend on the proportion of individuals in each species, which amounts to the same thing.
This is all the code needed to run the model. (Try this if the quotation marks are not right lottery.doc)
nspecies<-8
gridsize<-20
time<-100
RUN<-function(){
X<<-matrix(0,nspecies,time)
palette(rainbow(nspecies))
mat<-sample(1:nspecies,gridsize*gridsize,replace=T)
X[,1]<-table(mat)
image(matrix(mat,gridsize,gridsize),col=sort(unique(mat)))
for (i in 2:time){
a<-X[,i-1]
mat<-sample(1:nspecies,gridsize*gridsize,prob=a,replace=T)
X[,i]<-table(factor(mat,levels=1:nspecies))
image(matrix(mat,gridsize,gridsize),col=sort(unique(mat)))
gc()
}
matplot(t(X),type=”l”,lwd=2,col=rainbow(nspecies))
}
#However to get a little tcltk interface add this
############################################
library(tcltk)
Run<-function(…){
nspecies<<-as.numeric(tclvalue(”nspecies”))
gridsize<<-as.numeric(tclvalue(”gridsize”))
time<<-as.numeric(tclvalue(”time”))
RUN()}
tt <-tktoplevel()
d1<-tklabel(tt,text=”Numero de species”)
d2<-tklabel(tt,text=”Tamaño del grid”)
d3<-tklabel(tt,text=”Tiempo”)
s1 <- tkscale(tt,from=0, to=20, variable=”nspecies”,
showvalue=TRUE, resolution=1, orient=”horiz”)
s2 <- tkscale(tt, from=2, to=100, variable=”gridsize”,
showvalue=TRUE, resolution=1, orient=”horiz”)
s3 <- tkscale(tt, from=10, to=500, variable=”time”,
showvalue=TRUE, resolution=10, orient=”horiz”)
but <- tkbutton(tt, text=”Run”,command=Run)
tkgrid(d1,s1)
tkgrid(d2,s2)
tkgrid(d3,s3)
tkgrid(but)
The model is to be played with (just download R from CRAN and paste all the code into the console), rather than watched but here are some example runs on You Tube,
The moral of the story is that under a neutral model some species start a random walk to extinction if co-existing in a small space with other species. However we cannot tell which will dominate and which become extinct from their characteristics, because all individuals, regardless of species are governed by exactly the same rules. The fate of species is determined by the cumulative luck or misfortune of the individuals that have been given the label. The larger the grid the more species that can be supported for a longer time, but extinction risk is always present if a long losing streak hits. The element of Hubbell’s theory that this model does not reproduce well is the log-normal (-ish) relative abundance distribution of the surviving species, although come to think of it I have not tried running the model with a really large number of species and a really large grid. R runs out of colours for the illustration, but this is not important.
Many interesting questions remain. Could a simulation model based on this sort of idea be used to predict extinction over a real landscape? Models of this sort seem far too abstract to provide real insight into the real world. But in some sense the process that Hubbell describes in a more formal mathematical way in his book probably does apply. But, how can these sort of ideas be communicated accurately to prevent them from provoking the sort of unproductive SLOSS debate that has often resulted from theory being pushed too far?
Deforestation in La Sepultura 2.
Understanding and predicting patterns of deforestation in El Tablon.
(Note: This is an informal marker for work in progress that will be formalised at a later date as part of the reforlan project. Any suggestions and contributions are welcome. The work has not been reviewed and the conclusions may change. Please contact me if you wish to use any part of this analysis )
Deliberate destruction of existing forest cover is clearly a necessary component of anthropogenic deforestation. However, it is not a sufficient condition. For permanent deforestation to result there must be an impediment to forest regeneration. If vigorous regeneration occurs after timber harvest, or other forest use, then the activity meets at least one of the most basic criteria for sustainability. Unfortunately disturbance of the dry forests of La Sepultura biosphere is not being followed by regeneration. In the last decade the number of cattle grazing extensively within the buffer region of the biosphere reserve has increased rapidly. This has led not only to direct, deliberate removal of forest in order to provide pasture but also an increase in fire severity as the vegetation becomes more open and fuel dries out. Loss of vegetation leads to both chronic and acute soil erosion. Trampling and browsing by cattle prevents tree seedlings establishing. The result is generally agreed by both residents and researchers in the region to be causing degradation in the long term productivity.
This afternoon a colleague sent me a students’ analysis of deforestation in a very different region for comments. The student had produced tables of correlation coefficients and then then fitted a multiple linear regression in order to “explain” the factors responsible for deforestation. This is common way of looking











