Duncan Golicher’s weblog

Priority areas for dry forest restoration in La Frailesca

Posted in Climate change, Evidence and Ecology, POSTGIS, Uncategorized by Duncan Golicher on July 8th, 2008

This is a marker for work in progress. Do not cite.

This morning I was contacted in order to provide geographical information for one of our partners in the Reforlan project. The aim of the research is to find priority areas for forest restoration.

The question of how to go about the analysis is thought provoking. I have already mentioned the problem of aggregation error in this weblog. Pooling data in order to find mean values does not work at the scale in which most of the important processes act. The classic example is mean rainfall for Mexico. At the same time, some elements on the landscape cannot be mapped at a fine scale, and others, such as management units (farms or ejidos) form natural aggregation according to social factors.

So perhaps the first step would be determine priority pixels according to a limited number of physical processes that can be resonably accurately mapped at fine resolution (30 m x 30 m) then aggregate to find the number of pixels in each agronomic unit that could be restored. An analysis of social and economic factors at this scale could then be undertaken in order to discover where restoration might be feasible and bring the most social benefit.

The mappable factors at the 30m pixel scale would be

1. Forest cover and recent deforestation: Pixels that have forest cover presumably do not need restoring. Pixels that have been recently deforested may be prioritised, although natural regneration may be a suitable mechanism in some areas close to the original forest.

2. Slope: Areas with steep slopes must be prioritised for restoration in order to avoid soil erosion. Factors such as slope length may be taken into account.

3. Insolation (aspect): South facing slopes that receive intense insolation during the dry winter months may be in need of restoration, but successful restoration may be more challenging.

4. Distance from rivers: Riparian areas may be in greater need of restoration and trees may establish more successfully in moist soil. At the same time these may be the most valuable areas for irrigated  agriculture.

5. Conservation status: Reserves should perhaps take priority when restoration is being considered.

The interesting element of the research involves finding a suitable way of combing the layers in order to allow ranks to be calculated. A typical approach is weighted averaging, but this may not allow interactive effects to be included.

The map below shows the estimated current extent of the forest in the region and recent deforestation hotspots.

Shaded areas are federal reserves.

Forest cover in La Frailesca: Red and yellow areas are recently deforested: Shaded areas are federal reserves.

Areas with the steepest slopes do tend to coincide with remaining forest cover, as would be expected, but some recent deforestation both intentionally and as a result of forest fires has taken place in the steeply sloping “Sierra” region in the West and South West  and South of the study region.

Slope in degrees (high to low is red through to green passing through blue)

Slope in degrees (high to low is red through to green passing through blue)

North facing slopes receive less direct sunlight during clear winter days and are more likely to be covered in evergreen forest. Fuel dries out on south facing slopes leading to more intense fires. Regeneration may be difficult due to hydric stress.

Yellow pixels are level areas that receive the \

Winter insolation: Yellow pixels are level areas that receive the \


Most deforestation has occured close to the major rivers.

Buffers around major rivers

Buffers around major rivers

It may also be useful to evaluate soil type (not shown) although the maps tend to be at a coarse scale.

Once pixels have been ranked according to aptitude and priority for restoration (the two factors may be inversely related in many cases) they may be aggregated to agricultural management units for further analysis of socio-economic factors.

A range of social economic parameters may be associated with agricultural management units (agebs)

A range of social economic parameters may be associated with agricultural management units (agebs)

Do we see trees or forests?: Knowledge gaps

Posted in Evidence and Ecology, POSTGIS, Uncategorized by Duncan Golicher on July 6th, 2008

It would seem reasonable to assume that research aimed to fill a basic knowledge gap should receive more attention than projects aimed at evaluating complex untestable theory. However in the field of tropical forest ecology it not always clear that fundamental basic knowledge gaps are either being identified or filled quickly enough. An issue that has concerned me for many years is the persistent difficulty in accurately tracing and understanding the spatial distribution of tropical organisms, particularly trees and other plants.

Trees are immobile, long lived, highly visible elements of the tropical landscape. It would therefore seem reasonable to assume that we already know a great deal regarding their distribution. However trees form forests. Forests cause numerous logistical difficulties for field research.  Tropical forests can be hostile, disorienting, pathless environments. They also contain a complex set of natural resources. Illegal exploitation of some of these resources such as valuable timber or plants with interesting pharmaceutical properties (of various types) may obstruct scientific research. Many tropical forests have also provided refuge at times for members of radical political movements.

Tropical trees can be extremely difficult to identify to species. Flowers, fruits and leaves may be out of reach. Their taxonomy is complex and dynamic. Tropical families are unfamiliar to researchers trained in temperate zones. Many species have converged in terms of leaf shape and characteristics.

This has led to chronic ignorance of the distribution of tropical tree species. Only a small fraction of the total area covered by tropical forest of any description has been studied in any detail. Regional species lists rely on data from a few well known areas, but within any given region distribution patterns are poorly known. Many assumptions are made concerning the nature of anthropogenic influence on areas of forest. Some are justifiable, some tend to exaggerate human impact on naturally resilient forest ecosystems.

Given the poor state of our knowledge it is vitally important that the few reliable data sources that are available are used to their full capacity. However even the best global data sources can only provide a small part of the picture alone. Missouri Botanical Gardens (MOBOT) has one of the largest collections of tropical plant specimens of any herbarium. MOBOT has provided online access to most of this data. Yet even when the whole set of tropical tree collections with known coordinates from this source are mapped on to space large gaps are revealed in the empirically based knowledge they provide regarding tree diversity.

A PostGIS query can be used together with the graticule trick mentioned in the last post to map the density of collections within 0.1 degree (approximately 100 km2) grid squares in the MOBOT Vast data base. The query is provided as a model for counting incidences within grid squares. The data itself is stored locally and forms part of a larger set that we have collated over a period of several years.
CREATE TABLE meso.count with oids as
SELECT g.ncollects as Ncollects, a.*  from

grat01 a,

(SELECT d.gid, count(c.id) as ncollects
FROM meso.trees c, grat01 d
WHERE c.the_geom && d.the_geom
AND contains(d.the_geom, c.the_geom)
AND c.source like ‘MOBOT%’
group by d.gid) g

WHERE a.gid=g.gid;

If it is assumed that 100 collections per 100 km2 is a minimum needed to provide a reasonable picture of the distribution of tree species (there may be more than 400 species in each grid square in some regions) the map below shows that only a fraction of the area  of mesoamerica can be considered to be well enough typified by this data set to allow distribution patterns to be mapped through the data alone. This contrasts sharply with the floristic atlases available for most of Western Europe based on known occurrences.

Even when the grid squares with less than 50 collections (far fewer than needed) are added to the map there are many completely unknown areas when this data source is taken alone. Where collections have been taken the density is generally below 5 per 100 km2. Direct empirical evidence regarding tree diversity hot spots is simply not available. Comparisons  between Chiapas, Yucatan, Belize and Costa Rica cannot be made from the data alone.

Light green 1-2, dark green 2-5, blue 5- 50, red >50

Light green 1-2, dark green 2-5, blue 5- 50, red >50

In contrast to our poor knowledge regarding forest composition, our knowledge regarding the extent of both forests and deforestation is becoming ever more accurate and detailed as high resolution satelite imagery becomes available. It appears that we are looking at forests, but not seeing the trees.

Visualising deforestation:Marques de Comillas

Posted in Climate change, Evidence and Ecology, POSTGIS, Uncategorized by Duncan Golicher on July 2nd, 2008

Our ongoing work with Conservation International has led to the first regional forest cover change map that has been derived using a single consistent methodology. The details of the study have not yet been published. However as we work on validation and interpretation of the results a pattern has become very clear.

Previous deforestation studies in Chiapas have usually concentrated attention on well defined study areas. This could produce the impression that deforestation is a homogeneous process over the whole state. In fact study areas selected for deforestation analysis are usually those with the highest rates when compared to the rest of the region. This is quite natural. Many of the areas where deforestation is no longer occurring have already lost a large proportion of their forest cover. However the overall regional deforestation we have quantified is considerably lower than previous studies imply.

Where deforestation has followed the classic pattern of forest conversion to permanent agriculture or pasture the CI methodology, which is based on Landsat imagery, has provided a remarkably good match with high resolution imagery. However where chronic, low level forest disturbance takes place the overall impact of human activities are much less easily quantified at the resolution of Landsat imagery. The difficulty in accurately evaluating forest cover change increases in areas of dry forest. Nevertheless regional patterns are robust.

The clearest deforestation hotspot in the state of Chiapas remains the Marques de Comillas area in the Southern Lacandon. Deforestation and carbon sequestration in this area has previously been studied in some detail by De Jong et al 2000

The two images below are animated gif files which change when clicked on to enlarge them to full size. The show clearly how Landsat based deforestation analysis coincides in this area with the conclusions drawn from visual analysis of recent high resolution imagery. The visual analysis is produced by overlaying our change analysis in Google Earth using Geoserver, and through the use of QGis.

Deforestation:Overlaying vector and raster layers in PostGIS using GRASS

Posted in Evidence and Ecology, POSTGIS, R scripts, Uncategorized by Duncan Golicher on July 1st, 2008

A very common operation in GIS involves overlaying vector polygons on a raster map and calculating the area of each class in the raster map. The results are then added to the attribute table of the vector map. How can this be achieved in PostGIS?

If the raster layer is quite small it could be imported as points and the result achieved through a point in polygon SQL overlay. However that is not the usual way to achieve the operation. When large raster layers derived from the analysis of satelite imagery are of interest the conventional GIS approach  is to first rasterise the vector layer. The computations involved in the overlay are then simple and fast.

This example taken from our ongoing work. It overlays a map of agricultural units on a raster layer representing deforestation in the last decade derived from analysis of satelite data. Because deforestation in Chiapas has tended to take place on a rather small scale involving only a few hectares at a time, it is not at all easy to see the large scale regional pattern from an image. Calculating the proportion of the total area of each agricultural unit deforested helps to see the pattern. The map above shows the percentage of the total (forested + non-forested) area of each agricultural unit estimated to have been deforested between 1990 and 2000.

The simplest way to achieve this result with PostGIS is to import the vector layer into GRASS, where the raster is also held. There are a few issues that have to be addressed when doing this. The first is that the gid index is lost when the attributes are imported. As a unique identifier for each polygon is needed during the overlay an index should be added to the table first if the gid is the only unique identifier for the table. This is very simple.

ALTER TABLE chiapas.agros ADD COLUMN id int8;
UPDATE chiapas.agros SET id=gid;

Now enter GRASS in a location with the same projection as the PostGIS data base (typically EPSG:4326)

Now there is another technical issue. Clearly vector processing in any GIS involves a combination of the geometery and attribute tables. We have both already in PostGIS. However in GRASS we use  functions that work on GRASS’s own representation of the geometry, so an import is needed. GRASS can use a wide variety of databases as a back end for storing attribute data. Among these is Postgresql. What is not obvious, and  potentially quite confusing, is that we already have the data in a Postgresql data base! So, does it all need to be imported again? The answer seems to be yes and no. It is quite possible to link the GRASS topology with the original table. However it is much easier to import it all in the usual way using v.in.ogr. This produces a new table, that could either stored in an external data base or perhaps even within the original PostGIS data base (as a new copy). Of course this attribute only table lacks the geometry column.

For example to set up GRASS to use the original local PostGIS data base as the backend for vector layers

db.connect driver=pg database=”host=localhost,dbname=gisdb2″
db.login user=Duncan pass=******

Note that this step is not necessary. By default Grass will store the table as in dbf format.

A layer can be imported from PostGIS using

v.in.ogr dsn=”PG:host=localhost dbname=gisdb2 user=Duncan” layer=chiapas.agros output=agros type=boundary,centroid

If the raster layer is in another coordinate system, as it is in this case, it will be necessary to exit the location and reproject in the location of the raster layer.

v.proj input=agros location=Global mapset=mesoanalysis output=agros

Then form a raster layer.

v.to.rast input=agros output=agros use=attr column=id layer=1 value=1

Then in GRASS the overlay is achieved quite simply using r.stats and sending the output to a text file.

r.stats -a input=agros,defor >agros.txt

This forms a file in which the first column holds the id of the vector layer, the second the class in the raster layer and the third the area in square meters of each combination. For example…

1 111 1319094.000000
1 112 159201.000000
1 122 1041304.500000
1 211 4026323.250000
1 222 85285437.750000
1 244 89347.500000
1 422 52796.250000
1 444 1749586.500000
2 111 565326.000000

So the task is now done, apart from the job of adding the data back into the original PostGIS table. There are many ways to do this. Often only one, or a few focal classes are of sufficient interest to merit a new column. So an easy way to process the output and send it back to PostGIS involves R.

Connect to the PostGIS data base using RODBC and set the schema

library(RODBC)
con<-odbcConnect(”mydb”)
odbcQuery(con,”SET search_path =chiapas, pg_catalog;”)

Read in the data and set the first column as numeric, removing any missing data (by default it will probably have been interpreted as a factor as GRASS will use a * for missing data).

d<-read.table(”agros.txt”)

d$V1<-as.numeric(as.character(d$V1))
d<-na.omit(d)

Now a subset of the data can be taken for each of the columns of interest, named and added back to the data base. In this case code “111″ represents area that was classified as forest in 1990,2000 and 2005. “122″ pixels that were deforested in the period between 1990 and 2000.

odbcQuery(con,”ALTER TABLE agros ADD COLUMN forest int4;”)
d2<-subset(d,d$V2==111)
d2<-d2[,-2]
names(d2)<-c(”gid”,”forest”)
d2$forest<-round(d2$forest/10000,0)
sqlUpdate(con,d2,”agros”)

odbcQuery(con,”ALTER TABLE agros ADD COLUMN defor2000 int4;”)
d2<-subset(d,d$V2==122)
d2<-d2[,-2]
names(d2)<-c(”gid”,”defor2000″)
d2$defor90<-round(d2$defor2000/10000,0)
sqlUpdate(con,d2,”agros”)

We are rather concerned that this cover change analysis, that has been achieved with the standard methodology used by Conservation International, tends to quite seriously underestimate small scale disturbance and canopy opening. Nevertheless in comparative terms it does demonstrate that absolute deforestation in Chiapas is certainly not a homogeneous process. Extensive deforestation is not occurring, although if the map is zoomed out further to show the Marques de Comillas area a very clear hotspot of deforestation is shown. In this part of Chiapas (Central Valley, Sierra and Altos) comparatively few agricultural units have deforested more than 2% of their total area in the last decade.

The other way of looking at this would be as the proportion of remaining forest lost. This woud tend to emphasise deforestation rates in ejidos and agricultural units with a very small ammount of forest in 1990. However at this level the approach would be somewhat unfair as it would be biased by classification errors in small areas.

To put the map in context a map with the proportion of the total area classified as forest can be used. In R the column can be added using

odbcQuery(con,”ALTER TABLE agros ADD COLUMN propforest float;”)

odbcQuery(con,”update agros set propforest=100*forest/hectares;”)

A surprising number of the agricultural units have over 40% tree cover. This has to be placed in the context of a classification method that draws no distinction between a pixel within a pasture with sufficient cover of secondary vegetation to be classified as forest and untouched primary forest. The next step clearly involves analysis of fragmentation in order to clarify this.

Telecomunicaciones e informática en Ecosur

Posted in Evidence and Ecology, Linux by Duncan Golicher on June 28th, 2008

No creo que sea apto usar este sitio para hacer comentarios con respecto a características de la institución donde trabajo. Nada mas aprovecho una oportunidad para citar un icono cultural.

“Turning and turning in the widening gyre
The falcon cannot hear the falconer;
Things fall apart; the centre cannot hold;
Mere anarchy is loosed upon the world,
The blood-dimmed tide is loosed, and everywhere
The ceremony of innocence is drowned;
The best lack all conviction, while the worst
Are full of passionate intensity.
Surely some revelation is at hand;
Surely the Second Coming is at hand.”
- W. B. Yeats (‘The Second Coming’)

Landscape of the Highlands of Chiapas

Posted in Evidence and Ecology, Poverty in Chiapas and social issues by Duncan Golicher on May 12th, 2008

While backing up some old information on my hard drive I rediscovered a document I produced for Pronatura at the end of November 2003. It lies somewhere between the status of informal field notes and a formal technical report. I presented it to Pronatura a week after a field trip in a light plane. The idea of the document was to record the discussion during the flight.

This was very much at the end of the “pre-google earth” era. Spatial analysis was still rather more of a specialised activity than it is today and I had only begun to become interested in the subject myself. My previous research was generally non-spatial.

Now a “fly over” of the region can be achieved rather more comfortably and safely with a laptop than a four seater plane. The experience was stomach churning at times with a few moments of sheer terror as the turbulence caused by thermals rising from the central depression hit.

Despite the fact that some of the analytical methods used were hurriedly applied and could certainly be refined, I still generally agree with most of the conclusions in the document.So I have placed it here in case it could be useful as a general introduction to the region. It is not directly citable, but similar statements to those included in the document have been made in peer reviewed work I have published together with Luis Cayuela, although there have been very slight differences in emphasis between my interpretation of patterns and causes of deforestation in the region and those adopted by Luis Cayuela and some other colleagues.

Here is the document in PDF form ….

reflections-after-a-flight-over-the-altos-of-chiapas

Hurricanes in Chiapas

Posted in Climate change, Evidence and Ecology, POSTGIS, Uncategorized by Duncan Golicher on April 19th, 2008

I have just added the historical hurricane paths to my PostGIS data base from NOAA http://maps.csc.noaa.gov/hurricanes/

The data base has storm tracks from 1851 to 2006. The above maps shows only those since 1990. This is a very valuable resource that is well worth including in the regional data base and looking at in more detail. The interesting element as far as the state of Chiapas is concerned is that very few hurricanes actually track over the state. This is clearer when the whole (150 year) historical data set is included as shown in the figure below. However the moist air masses associated with hurricanes can cause substantial rainfall in the state even when the centre of the hurricane is some distance to the north or east.

When the map is zoomed out further a very clear cut off point emerges, forming an almost horizontal line at around 8 degrees North. This is a feature associated with  the Intertropical Convergence Zone (ITCZ),

This rather sharp transition also explains some of the strange patterns in the maps of global circulation model simulations which also show a pronounced band around the equatorial ITCZ. The cut off may alter in extent over the next century as the effects of global warming are felt. If it expands slightly northwards, then rainfall associated with hurricanes and tropical storms may also move, leading to a general decline in rainfall in southern Mexico. This could occur even if hurricane intensity and frequency in the Atlantic basin follows an overall increasing trend.

Deforestation and floods 4: PostGIS raster layers?

Posted in Evidence and Ecology, POSTGIS, R scripts by Duncan Golicher on April 15th, 2008

One element struck me when looking at the original data compiled by Dartmouth college on flood  events. The data base contained a table headed “cause”. This was full of text such as “rain”, “heavy rain”, “torrential rain”. The orange points on the map above are the centres of recorded flood events in thals three years. Note VillaHermosa in Tabasco. The coloured layer is shaded by rainfall intensity (five bands ranging from below 600mm to 2500 mm +).

So, the headline would be that “scientists conclude that flooding is caused by ….. rain!”

This is not in fact as trivial as it seems. Devastating floods can occur in regions where heavy rain is uncommon, but there must be a clear correlation between flood frequency and the sheer amount of annual rainfall a region experiences. I am constantly surprised by how quickly the vegetation in Mexico changes from dry forest to exuberant tropical forest. The transition zone can be a matter of kilometers on the road north from Chiapas to Tabasco. The flooding in Villa Hermosa in 2007 was mainly the result of rainfall falling on the state of Tabasco itself (and a small part of the extreme north of Chiapas). During the dramatic events a very light drizzle was falling in San Cristóbal. The lowlands of Tabasco are almost completely deforested, but the mountains of northern Chiapas are covered in a mixture of secondary forest and cattle pastures. Rains in late October fall on already saturated soil profiles. Vegetation cover could only have played a marginal role in reducing the amount of water flowing out of the highland watersheds in this particular case.

This provided me with an excuse to experiment with ways to move raster layers into POSTGIS. The POSTGIS data base structure is clearly not designed for raster layers, but some quite good results can be obtained simply by moving a fairly coarse raster layer, such as an interpolated rainfall coverage, into POSTGIS as points. This is similar to a SpatialPixelsDataFrame in R. In fact the easy way to set up the coverage is to move the data from a sp object in R to POSTGIS by converting the coordinates into a geometry. When viewed at some scales an interference pattern results, but the coverage can be very useful because clicking the information tool on any point on the map results in information on the pattern of rainfall over twelve months, although the visualzation can use the total annual rainfall.

The first image used Udig. The lower image uses Qgis. Again although I have not put a clear legend in this very quick informal treatment, the rainfall ranges from 800mm per year (bright red)to 4000 mm (dark blue)

La niña and early rainfall

Posted in Evidence and Ecology, R scripts, Uncategorized by Duncan Golicher on April 14th, 2008

One of my first posts to this weblog mentioned the statistical correlations between low mid pacific sea surface temperatures and early rainfall in Chiapas.

http://duncanjg.wordpress.com/2008/02/06/checking-on-the-el-nino-effect-using-r/

Heavy rains often do not begin in the state until mid May. However this weekend we have experienced a major rain storm (12 April). Persistent rain fell throughout the night and early morning (14 April). Rerunning the code shows that la Niña effect is still strong.

Of course as any linkages are statistical in nature I do not expect my temporary success at long term weather forecasting to continue.

Tagged with: , ,

Floods and forest cover 2: Ecological inference

Posted in Evidence and Ecology, Probability and statistics by Duncan Golicher on April 3rd, 2008

I have been building up carefully to making a specific comment on Bradshaw et al (2007) because the subject that the paper addresses is too important to risk mistakes.

http://www.blackwell-synergy.com/doi/abs/10.1111/j.1365-2486.2007.01446.x

The first steps I took yesterday were to look at the data available to the authors. In addition to discovering that the data on the floods themselves were of rather variable quality and lacking in geographical resolution I was surprised by some of the values for recent deforestation in the dataset (from http://www.wri.org) included in the study. In this case I was unable to locate the original data on this web site. However I reformated the data in the paper itself into a GIS and made it available through this weblog.

Despite what looks like exaggeration of the extent of Mexican deforestation and thus probably some other countries, my main issue is not with details regarding numbers within the dataset itself. It is with the underlying logic of the whole analysis. This paper seems to be a classic (almost embarrassing), case of the so called “ecological fallacy” at work. This problem is so well known I was actually quite surprised to see a work of this sort published in a journal with an impact factor of 4.3. It will no doubt be widely cited, particularly by authors interested in strengthening arguments in favour of payments for ecological services. This is likely to have a positive impact on forest conservation. The paper is therefore “politically correct” and could be considered a helpful contribution. However the underlying science is certainly open to criticism.

The term “ecological inference” strangely enough does not come from the discipline of ecology. Ecologists do not often use the term in its original sense. It is much more widely used by social scientists. I will set the scene starting with two classic examples from the social sciences cited by the statistician David Freedman (the tech report from which I took the examples is available here freedman549.pdf)

In 19th century Europe, suicide rates apparently were higher in countries that were more heavily Protestant. The inference could be drawn that suicide is promoted by Protestantism itself. Death rates from breast cancer are higher in countries where fat is a larger component of the diet. Fat intake therefore could cause breast cancer. More recently, floods cause more damage in countries that have high rates of recent deforestation, therefore deforestation causes floods.

These are all “ecological inferences,” that is, inferences about individual behavior drawn from data about aggregates. The ecological fallacy arises when it is believed that characteristics of aggregated data extend to the units from which they have been compiled. In the social sciences this can lead to incorrect inference being drawn regarding the characteristics of the individuals themselves. Individuals do of course live within countries. Floods do take place in defined political units (although lack of exclusivity is an additional issue in this case). However the factors that influence the probability of committing suicide, suffering from cancer or experiencing severe flooding are experienced at the level of the individual or the subregion affected by the flood event. Protestant countries in the nineteenth century were clearly different from the Catholic countries in many ways besides religion (confounding). The original data did not associate individual suicides to any particular religious faith directly (aggregation).

The first problem, that of confounding, must be dealt with in any observational study. But the second problem, that exposure and response are measured only for aggregates rather than for individual, is specific to so called “ecological” studies (the word here being used in the social sciences sense). If there is no confounding, the expected difference between effects for groups and effects for individuals is the “aggregation bias”. In real studies with aggregated data there is usually some confounding and some aggregation bias. Sometimes the message extracted from the aggregated data coincides well with that obtained by an analysis of its components. There is no imperative that fallacious conclusions are always going to be drawn. For example RA Fisher quite wrongly argued against the causal link between smoking and lung cancer due to his feeling that inference was impossible due to confounding (I am not sure how far aggregation was an issue in this particular case). In contrast, in the case of the link between fat intake and breast cancer, more recent studies using data from individuals did raise questions regarding the validity of conclusions drawn from aggregated data.

When data is aggregated in any way at all, details that can help to ensure accurate inference will always be lost. The common reason for this in the social sciences is a bureaucratic imperative to compile tables of regional statistics and a common requirement for confidentiality at the individual level. Election results are a classic example.

In my own experience I have often found that appropriate detail gets lost simply because a student naively aggregates data with the intention of producing a dataset that is easy to handle in a spreadsheet. This can be particularly frustrating if it is done before the data are captured in digital form. There is then no way back. Modern analytical techniques using contemporary statistical environments such as R can handle vast numbers of data points with great ease and speed. Raw data can read be strait into R from online repositories and aggregated according to the needs of a specific analysis with a couple of lines. Mixed effects (hierarchical) models and data mining techniques work with data labelled by individual or unit. There is no longer any reason to continue to repeat avoidable aggregation errors. This is a suitable point to provide a link to a previous post on data handling (in Spanish).

In the final posting on this thread I will finally deal with the specific content of the article in more detail.

Floods and forest cover 1

Posted in Evidence and Ecology, R scripts, Uncategorized by Duncan Golicher on April 2nd, 2008

Good decision making should be based on good scientific advice. It is clearly the job of the scientific community to support decision makers whenever they feel able to do so. Floods cause tremendous suffering, not only in terms of loss of life but in the immense disruption caused to lives, livelihoods and economies. Most residents of Southern Mexico have been affected in some way by floods in Tapachula (2005) and Villa Hermosa (2007). Fortunately these severe events caused comparatively few lives to be lost directly. However their consequences to the regional economy were extremely grave. In both cases politicians in the lower part of the watersheds cited deforestation as a major contributory factor.

Conservation of the native forests of Chiapas will not be achieved unless the full value of the services they provide is appreciated by society. Yet the linkages between flooding and deforestation are complex. When a study claims to provide new evidence regarding linkages between deforestation and flooding it must be considered with some care.

Comments on the scientific value of the paper I cited in the previous post should be based only on evidence. I therefore checked most of the websites the authors used to derive their information before analyzing the paper. An important source was the world atlas of flood events at Dartmouth College

http://www.dartmouth.edu/%7Efloods/archiveatlas/index.htm

The data found there can be mapped directly using their web site as shown below.

floods1.png

I also downloaded and reformatted the raw data and moved it into postgis. It can be used as a shapefile through the zipped file here floodevents(change-extension-to-zip).doc

The data is clearly of variable quality. This is par for the course. Prior to 2006 the geographical coordinates were not entered explicitly in this particular data base. All the original raw data from the Dartmouth data (that I downloaded in Excel format) including textual descriptions of the floods can be loaded strait into R using read.table by changing the extension of the following file to txt flood_events.doc. Here are some quick (Qgis) maps derived from the data showing loss of life from the floods that count with coordinates.

floods2.pngfloods4.pngfloods5.pngfloods3.png

There are various points to note before I look at the statistical analysis used in the article in greater detail. Southern and Eastern Asia seems to be particularly flood prone. Most floods on this continent occur around coastlines, on the lowlying floodplains and in the Himalayan watersheds. Africa has comparatively few floods. Flooding in North and Central America tends to be associated with the effects of tropical depressions and hurricanes.

The patterns of flooding at this scale seem to be determined largely by global circulation patterns, but the consequences and probability of being reported will be determined by settlement patterns and economic activity. Floods are rarely reported from the Amazon basin, even though much of the low lying areas of forest (várzea) are seasonally flooded and these floods vary in severity tremendously between years. These sort of details interest forest ecologists trying to trace tree distribution patterns but never make news because they do not affect human lives.

Tagged with: , ,

PostGIS and R continued

Posted in Climate change, Evidence and Ecology, R scripts, Uncategorized by Duncan Golicher on April 2nd, 2008

Last week a colleague sent me an article published in Global Change Biology for comment. I have just got round to looking at it in detail. The abstract is available here (with full text if you have the institutional access.)

http://www.blackwell-synergy.com/doi/abs/10.1111/j.1365-2486.2007.01446.x

The authors claim have identified a correlation at the global scale between forest loss and flooding.

I have a lot of issues with the way the data has been analysed that I intend to discuss here. However one of the positive features of this article is that all the data that used is included as an appendix. This allows anyone to make their own mind up regarding the merits of the analysis.

I will make specific comments in a later post. But I first deal with different technical issue. Having setup a postgis database for Mexico and Central America I realised that I lacked a global country map to help visualise data of this sort. I tried importing the shapefiles that come with Arcview, but with no success. They apparently have errors in the topography. I also tried converting the worldhires data in the R mapdata package, but that also had issues with bad topography (apparently American Samoa was to blame). Finally I found this reply to a R help post from the prolifically helpful Roger Bivand

http://finzi.psych.upenn.edu/R/Rhelp02a/archive/78339.html

I tried

load(url(”http://spatial.nhh.no/R/etc/world_countries.rda”))
library(rgdal)
writeOGR(world_countries, “PG:dbname=mydb”, “world_countries”, “PostgreSQL”)

Bingo. In a couple of minutes I had a rather low resolution world countries map with ISO3166 two letter country codes ready for linkage to any data with a column that uses this international standard. It is now nicely tucked away in my postgis data base ready for use whenever needed. Very satisfying. Notice that I include this code as an example of how R connects to a postgis database once one has been set up. It obviously won’t work if you haven’t gone through the steps to establish your local data base. Here is how on Ubuntu. It is not something I would try under Windows,even though posgresql 8.2 apparently has now been ported to Windows.

I then had to reformat the data from the pdf copy of the article. This was slightly frustrating as it needed a bit of manual work in an open office spreadsheet. I then added the ISO3166 codes. This line (also suggested by Roger Bivand) helped a lot. It would be even better if editors and reviewers insisted on the use of standards when they are available.

d<-data.frame(d,ISO3166=world_countries
$ISO3166[match(toupper(as.character(d$Country)),as.character(world_countries$names))])

The nice thing about a postgis data base is that you can use it for storing all your data, spatial or otherwise, in one convenient place. So using RODBC connectivity the floods data can be incorporated by

library(RODBC)
con<-odbcConnect(”mydb”)
sqlSave(con,d,”floods”)

The only slightly fiddly aspect is that the saved table lacks a primary key. That can be resolved from R with few lines that run sql in postgresql.

odbcQuery(con,”ALTER TABLE floods ADD COLUMN index int4;”)
odbcQuery(con,”UPDATE floods SET index=cast(rownames AS integer);”)
odbcQuery(con,”ALTER TABLE floods ADD CONSTRAINT floods_pkey PRIMARY KEY(\”index\”);”)
odbcQuery(con,”ALTER TABLE floods DROP COLUMN rownames;”)

Then the tables can linked together as a temporary view from within R

sql<-”create view floods2 as select f.*, w.wkb_geometry from floods f,world_countries w where f.ISO3166=w.ISO3166″
odbcQuery(con,sql)

The view can be removed from the database whe it is finished with by

odbcQuery(con,”drop view floods2″)

With the view in place Qgis can be used for the visualisation instead of R, which is really helpful for casual analysis. This convenient interoperability, drawing on the particular strengths of a range of different tools within a common framework (statistical environment, database, GIS visualisation), is one of the great plus points of Open Source software from the point of view of a user. However you do have to discover how to do it first. You don’t get a “wizard” to help.

R does produce very good publication quality maps, but there is rather more overhead involved than when using the simple point and click graphical user interface of QGis I am now finding it much easier to send vector layers strait to postgis and work from there.

floods.png

The view can also be exported as a shapefile strait from QGis. As the wordpress site only allows me to upload files it thinks are documents I have placed it here with a false doc extension. Don’t try to open it, save the file to disk and then change the extension to .zip. floods2thisisreallyazipfile.doc

Northern flicker

Posted in Evidence and Ecology, Natural history, Personal and family by Duncan Golicher on March 23rd, 2008

colaptes-auratus.pngcolaptes-auratus1.pngcolaptes-auratus2.pngcolaptes-auratus3.png

There are four groups of the Northern Flicker (Colaptes auratus). This is the southernmost sub species, Colaptes auratus mexicanoides (sometimes known as the Guatemalan Flicker). It is a typical woodpecker in many ways and uses old, decaying trees for food and nesting sites. However the species also spends a surprising amount of time on the ground, often picking insects out of hard dried cow pats. This sort of complex habitat use is very common among “forest” birds. Conservation initiatives in the tropics thus need to take a quite sophisticated view of landscape connectivity and habitat diversity. It is not a simple case of planting more trees. Incidentally as I have been asked to provide georeferences and dates, all photos unless otherwise stated have been taken within 1 km of my house ( 16°42′14.62″N,  92°36′32.71″W) and on the same day that they are posted to the weblog. This was on an alder tree along the side of a small stream.

Potential species distributions 1

Posted in Evidence and Ecology, Natural history, R scripts, Uncategorized by Duncan Golicher on March 22nd, 2008

A constant frustration when working in tropical regions is the shortage of really high quality data. The quantitative knowledge that applied conservation needs in order to prevent the extinction of the world’s endangered species of plants and animals is still surprisingly hard to come by. At best the available hard data is fragmented, of variable quality, difficult to organize and thus hard to analyze in a formal manner. At worst it is either missing or positively misleading

Information used for conservation planning also forms the central subject matter of the scientific discipline of ecology. In other words, it concerns the abundance and distribution of organisms. Our ignorance of this most basic characteristic of the planet we live on is remarkable. Astronomers (arguably) provide estimates of the number of stars in our galaxy with narrower proportional margins of error than we have for the number of species in a Mexican forest. For example, Wikipedia cites three sources in stating that of 2006, the Milky Way is thought to be comprised of between 200 to 400 billion stars. A defensible estimate of the number of tree species in Chiapas could still range from 500 to 2500 depending on how data of dubious quality is interpreted. This is not good enough for such highly visible and important elements of ecological communities. The hard data is just not there to improve on this.

Although in recent years considerable advances have been made in systematics (Linnean shortfalls), these so called “Wallacean” shortfall remains. The large scale, systematic, planned, coordinated field work needed to redress our ignorance is simply not being funded.

In recent years I have been working with techniques for producing automated distribution maps of tropical tree species using R. Even using comparatively good data for areas that we know at first hand, it is a remarkably difficult task. Modelling species distributions can involve more pragmatism than theoretical insight. At a regional scale the task becomes yet more challenging.

This is not because R is too limited in the number of useful tools it provides. A very broad range of models are available. They range from rule based classification and regression trees through AI “black boxes” such as neural networks, to generalised additive models and simple logistic regression. All can be fit within R in a common framework. The models can be used to produce predicted distributions based on either presence-absence data or known occurrences alone (with the addition of “pseudo absences”). It is also possible to automate linkages between R and other popular software such as maxent or GARP. Increasingly accurate layers for many key predictor variables are now available.

However at least fifty well distributed occurrence points are needed to provide credibly well validated distribution maps for a single species using a fully automated procedure. If fewer data points are available some sort of input from “expert judgement” is inevitably required in order to evaluate which of the potentially infinite outputs from species distribution models are most credible. Well documented, repeatable automation of the fitting process using R is very useful in this process, as many maps can be produced in a short space of time based on different assumptions. However resorting to visual pattern matching is frustratingly informal and breaks a standard rule of data analysis, i.e. ensuring perfectly reproducible results.

The code example below is one possible very simple implementation of GAMs with pseudo absences. It is is applied here to modelling the potential distribution of tree species using data points provided mainly from MOBOT. The code (deliberately in this case) does not take into account spatial trends or autocorrelation, thus potential matching climates will often be suggested outside a “known” species range. This can be corrected by adding in coordinates as predictors to the model, but at the cost of potential loss of insight into the distribution of species that may not yet have been collected from their entire range. Any modelling technique has to find a balance between too many false positives and too many false negatives. A common technique is to look at the ROC curve, but this is not particularly useful if the number of known occurrences is very low and restricted to a small part of the suspected range.

I have come to the conclusion that we should not be overly defensive regarding the failings of automated species distribution mapping algorithms when these are inevitably attributable to the poor quality of the underlying data. There is an unavoidable “garbage in, garbage out” syndrome that is difficult to avoid without extremely time consuming data cleaning. This is best undertaken at the point of origin, i.e. in the herbaria and museums where the data is collated.

The models can however provide some useful heuristic input to the evaluation of the potential range of a species. In my opinion, any assessment of the area actually occupied should not be attempted using regional scale models alone. Even obvious techniques such as using forest cover maps as masks to remove non-forested pixels will have mixed results. Some tree and shrub species are still common in areas classified as pasture at a regional scale. Urban areas also can have many trees. Unpublished studies have suggested that tree diversity is higher in urban Managua than in surrounding agricultural areas. Other species have very specific requirements for non mappable habitat characteristics.

These concerns led our group to the conclusion that given the current data the best that can be achieved at a regional scale is to map the distribution of “climatically associated species pools” (CASPS). The “CASP” approach suffers the serious weakness of rather devaluing individualistic species responses, but may be the best that can be achieved until new initiatives begin to provide contemporary, accurately georeferenced and correctly determined occurrence (and absence) data from across the tropics. Data sharing and pooling between currently active research groups will be a vital first step in this process.

simple-gam-models.doc

The output from the R code above is overlain on the 3d Blue Marble image from code in the previous posts to give a visual impression at a very broad regional level. Green points are the input to the model (recorded occurrences) and red points are grid squares with similar combinations of regionally important climatic variables mapped at an approximately 5 km x 5 km scale. Further details of this and similar procedures including mapping of species pools are available in this document.mnp-workshop3.pdf

qcrispipilis.png

This model could provide some heuristic input to expert evaluation of the potential distribution and threats to some species currently being assessed for IUCN red list status. 450 maps produced through an automated procedure are included as low resolution jpg files within pdf archives in alphabetical order here.

A B C D E F G H I J L M N O P Q R S T V W Y Z

Blue marble and Landsat in R 3.

Posted in Evidence and Ecology, R scripts, Uncategorized by Duncan Golicher on March 14th, 2008

botanical-collections.png


A final note on this theme. To add points to a 3d visualisation you need to know z values in order to place them on the surface. The way x,y,z are referred to by rgl is not intuitive and it took me some time to get this to work. The example uses a set of coordinates for the botanical collections and inventories of tree species that are compiled into a database. We are using these data for modelling regional tree species distributions and their potential response to climate change. A major source of this information has been the wonderful online resource provided by Missouri Botanical Gardens MOBOT, but data has also been provided from other sources including Conabio. Here I only provide the raw coordinates of collections, without species names or further details as some of the data is restricted. The interesting point to note is the spatial distribution. Notice that these points are for all trees, thus any one species will form an extremely small subset. Our knowledge of the real distribution of tree species in Mexico is thus clearly still very dependent on expert knowledge and uncompiled data from little explored areas.

The code is here. You should load rgl and sp before running it. addpoints3d.doc

load(url(”http://duncanjg.files.wordpress.com/2008/03/colpoints.doc”))
load(url(”http://duncanjg.files.wordpress.com/2008/03/bmmexicoapril.doc”))

coordinates(a)<-~x+y
a<-overlay(d,a)
a<-as.data.frame(a)

view.3d.rgb<-function(d,dem=4,red=1,blue=2,green=3,exag=1/30000){
fullgrid(d)<-TRUE
y<-d[[dem]]*exag
y[is.na(y)]<-0
dim(y)<-d@grid@cells.dim

x<-seq(min(d@coords[,1]),max(d@coords[,1]),length=d@grid@cells.dim[1])
z<-seq(min(d@coords[,2]),max(d@coords[,2]),length=d@grid@cells.dim[2])

colvec<- rgb(d[[red]],d[[green]],d[[blue]],m=255)

rgl.clear()
rgl.surface(x, z, y, color=colvec)
}

add.points.3d<-function(d=d,x,y,elev,exag=1/30000,…){
rgl.points(x=x,z=max(d@coords[,2])-y+min(d@coords[,2]), y=elev*exag,…)}

view.3d.rgb(d)
add.points.3d(d,a$x,a$y,a$alt,col=2,size=2)

Disturbance and the Lotka Volterra competition model

Posted in Evidence and Ecology, R scripts, Uncategorized by Duncan Golicher on March 12th, 2008

lotka-volterra.png

One of the works that provided constant inspiration both as an undergraduate and graduate student was Huston’s huge 1994 monograph. Despite some inevitable repetition in a book of this length I have continued to find new ideas in it over the years. The book contained a very simple demonstration of the intermediate disturbance hypothesis using the Lotka Volterra competition equations that I continue to use in classes. It can be implemented very concisely in R. The idea is simple. Reset the classic competition equations at periodic intervals and see how it affects the balance between species with r or K strategies. A very simple model with important implications.

Source here …lotka-volterra.doc


R<-c(0.1,0.12,0.15,0.2,0.3)

K<-c(1000,800,600,200,100)

Run<-function(runtime=100,freq=101,intensity=0.1){
N<-numeric(runtime*5)
N<-matrix(N,nrow=runtime,ncol=5)
N[1,]<-10

for (i in 2:runtime){
Totpop<-sum(N[i-1,])
for (j in 1:5){
N[i,j]<-N[i-1,j]+N[i-1,j]*R[j]*(1-Totpop/K[j])
}

if(!i%%freq){
N[i,]<-N[i,]*intensity
}
}
res<-data.frame(N)
res
}
#####################################
par(mfcol=c(2,2))
res<-Run()
matplot(1:100,res,type=”l”,xlab=”Tiempo”,ylab=”Poblaciones”,lwd=2)
title(”Sin perturbacion”)
res<-Run(freq=60)
matplot(1:100,res,type=”l”,xlab=”Tiempo”,ylab=”Poblaciones”,lwd=2)
title(”Perturbacion infrequente”)
res<-Run(freq=25)
matplot(1:100,res,type=”l”,xlab=”Tiempo”,ylab=”Poblaciones”,lwd=2)
title(”Perturbacion intermedio”)
res<-Run(freq=12)
matplot(1:100,res,type=”l”,xlab=”Tiempo”,ylab=”Poblaciones”,lwd=2)
title(”Perturbacion frequente”)

Reference


Huston, M.A. 1994. Biological Diversity: The Coexistence of Species on Changing Landscapes. Cambridge University Press, 708 pp.

Fuelwood model

Posted in Evidence and Ecology, Probability and statistics, R scripts by Duncan Golicher on March 12th, 2008

fuelwood-model1.png

fuelwood-model2.pngfuelwood-model3.png

This morning I designed a simple simulation model in R with a small tcltk interface based on the concepts in the previous two posts. The aim is to allow a prediction of the probable future yield of fuelwood from plots with uncertain planting or survival densities, uncertain diameter distributions and uncertain parameters regarding diameter-height relationships and wood densities.

The model can be run by sourcing the file below into R (install tcltk first). Again this is work in progress and may change as colleagues provide improvement through feedback and new data. The aim is to provide a useful tool for prediction in a highly uncertain environment through giving a probability that a fuelwood or carbon sequestration project’s aims could be met based on known or assumed planting densities and growth rates.

biomasa1.doc

Permanent forest plots 2.

Posted in Climate change, Evidence and Ecology, Probability and statistics, R scripts, Uncategorized by Duncan Golicher on March 11th, 2008

After thinking a little more about the problem of the allometric equation in the paper that I discussed yesterday, I realised I had a partial solution already to hand. (Again please read this as a marker for an idea that could be developed further. This is one of the motivations for using an informal web log instead of leaving ideas as notes on my own hard disk.)

In order to work with a fairly simple individual based forest model I have implemented many different allometric equations in R over time. My favourites for use in data poor areas are still those derived originally by Botkin and Shugart from way back in the late seventies (see for example Shugart 1984). I have used them in both my thesis and a later implementation of the same model. Results from this are included in a recent book chapter. I aim to include the entire R code for this model on this site some point soon (after tidying up some elements and documenting).

The nice thing about the parameterisations used in these early models is that they were explicitly designed to be run with a minimal amount of knowledge. Superficially that doesn’t sound too encouraging. Shouldn’t we try to improve a model by adding more explicit detail? Not necessarily. After all purely empirical allometric equations assume no knowledge at all, apart from that provided through the data. That is their strength and weakness. In order to be used correctly you have to build them afresh for each area being studied. However if equations can be constrained by some very easily obtained field observations a sensible adjustment can be made for the conditions of each plot without going through the costly process of measuring everything afresh. The results certainly won’t be precise, but that is the point. Uncertainty can be included in the calculations through Monte Carlo simulation. The true value should then fall somewhere in the predicted range.

Botkin’s equations can turn measured diameter at breast height into estimated height and therefore estimated tree volume for any species based simply on an estimate of the maximum diameter the species can obtain and its maximum diameter. It is fundamentally a three parameter quadratic model constrained to take realistic values. As maximum height is quite site specific and can be estimated through rapidly made observations the equations are potentially very useful in the tropics where actual measurements of tree height are hard to obtain, or in cases of models of newly established forests when final height may have to be estimated from some knowledge of site quality. Notice that the equation discussed yesterday must be based implicitly on some sort of rule regarding tree height and form factor. You simply can’t get around this. The equation is a statistical black box that just takes in diameter and churns out above ground biomass. The fact that the equation doesn’t explicitly state that height has been estimated doesn’t make it robust. You simply can’t turn diameter into volume and thus biomass without making some sort of assumption regarding the height and shape of a tree, even if it is hidden within an impenetrable calculation. This is a very common failing of empirical equations and a good reason to generally mistrust their uncritical application (This is often the result of a misplaced appeal to authority, an empirical equation has been published by an important research group so it is eminently citable. Yes fine and good, but the equation still only applies to their particular data set.)

So here are the set of simple equations Botkin suggested. I have prefixed each function with f. to avoid confusing a function placed in the global environment in R with a variable.

(All the code shown below can be downloaded in this script … permanentplots2.doc)

f.B2<-function(Hmax,Dmax)2 * ((Hmax - 137) / Dmax)
f.B3<-function(Hmax,Dmax) (Hmax - 137) / Dmax ^ 2
f.BA<-function(Diameter)pi * (Diameter / 2) ^ 2
f.Height<-function(Diameter,Hmax,Dmax)137 + f.B2(Hmax,Dmax) * Diameter - f.B3(Hmax,Dmax) * Diameter ^ 2
f.AGB<-function(D,Hmax=2000,Dmax=100,FormFactor=0.8,r=0.5)r*FormFactor * f.BA(D) * f.Height(D,Hmax,Dmax)/1000

Let’s see how the equation for height works in practice. Assume maximum diameter is 120 cm, but forests (or species) vary in the maximum height they can reach. Note that as this is a quadratic (i.e. parabola) maximum diameter must coincide with the maximum in the data or the curve turns downwards.

D<-10:120

plot(D,f.Height(D,Dmax=120,Hmax=3000)/100,type=”l”,lwd=2,col=”green”,ylab=”Height (m)”,xlab=”Diameter”,ylim=c(0,30))
lines(D,f.Height(D,Dmax=120,Hmax=2000)/100,col=”blue”,lwd=2)
lines(D,f.Height(D,Dmax=120,Hmax=1000)/100,col=”red”,lwd=2)
legend(”topleft”,lwd=2,legend=c(”High (30m)”,”Medium (20m)”,”Low (10m)”),col=c(”green”,”blue”,”red”)
)

permanent-plots4.png

Now let’s compare with the formulas used in the paper and included in the previous post and multiply appropriately by basal area and form to produce volume.

plot(D,wet(0.5,D=D),type=”l”,lwd=2,col=”green”,ylab=”Above ground biomass (kg)”,xlab=”Diameter”)
lines(D,moist(0.5,D=D),col=”blue”,lwd=2)
lines(D,dry(0.5,D=D),col=”red”,lwd=2)
legend(”topleft”,lwd=2,legend=c(”wet”,”moist”,”dry”),col=c(”green”,”blue”,”red”))

points(D,f.AGB(D=D,Hmax=3500,Dmax=120),col=”blue”)
points(D,f.AGB(D=D,Hmax=2000,Dmax=120),col=”green”)
points(D,f.AGB(D=D,Hmax=1500,Dmax=120),col=”red”)
legend(”bottomright”,pch=1,legend=c(”Hmax=20m”,”Hmax=35m”,”Hmax=15m”),

col=c(”green”,”blue”,”red”))

permanent-plots6.png

So the curves are very comparable. The second set can be produced by making a single reasonable assumption regarding maximum height. In fact the equation for above ground biomass has four parameters. Maximum height, maximum diameter, “form factor” and wood density. The form factor is a term used in forestry and here it might best be expressed as “branchiness”. Due to the way tree growth occurs (the “pipestem”model) total above ground volume can never be expected to exceed the volume of a cylinder. Foresters are only interested in the trunk, but a factor ranging between 0.6 and 1 can be applied quite reasonably to biomass as a whole. Here I have set the value at 0.8. This reproduces quite well the original allometric equations. We can add in uncertainty regarding this parameter in the analysis.

This is now the point at which the equations become useful. If a lot of knowledge is available, then separate parameters can be assigned for each species. Maximum height is quite easy to get.Simply by making this justifiable assumption accuracy should improve markedly over the “one size fits all” assumption that was implicit in the paper under discussion. A nice element is that the quadratic is constrained so the meaningless estimates that can sometimes arise when using polynomials outside their range of applicability will never occur. With less information functional groupings can be used. In the worst case scenario the same equation can be used for all trees in the forest, as was apparently done with the empirical allometric equation in the paper (with the exception of including a species specific term for wood density).

Using R it is very easy to include uncertainty regarding these parameters. As has been pointed out in the previous post, this is likely to affect the conclusions drawn much more than within sample variability. The point is that bootstrapping on parameters of a polynomial allometric equation with no real life interpretation would be very dangerous. The curve could go in many unexpected directions. In contrast bootstrapping on an equation with direct interpretation is reasonable. Lets use a boxplot as a quick way of looking at the results. Lets assume a great deal of uncertainty in Hmax (20m with a sd of 3m) and some uncertainty in the form factor (0.8, sd=0.1). It is easy to extend the uncertainty to all four parameters.

boot.strap<-replicate(10000,f.AGB(D,Hmax=rnorm(1,2000,300),FormFactor=rnorm(1,0.8,0.1)) )
boot.strap<-data.frame(rep(10:120,times=10000),as.vector(boot.strap))
names(boot.strap)<-c(”D”,”V”)
boxplot(V~D,data=boot.strap,range=0.1,outline=F,col=2)

permanent-plots7.png

So we can see how uncertainty in total above ground biomass increases with the diameter of the tree. Finally lets see how this could be applied to the simulated data. If we just want an uncertain estimate of total above ground biomass by applying the bootstrapped conversion formula to a single large plot we can do it like this.

D<-rlnorm(100000,2.2,0.8 )
D<-D[D>10]
D<-D[D<120]

boot.strap<-replicate(1000,sum(f.AGB(D,Hmax=rnorm(1,2000,300),FormFactor=rnorm(1,0.8,0.1))) )
quantile(boot.strap/1000000,c(0.05,0.5,0.95))

5% 50% 95%
6.97 9.75 13.04

plot(density(boot.strap/1000000,bw=0.7),xlab=”Total above ground biomass kg x 10^6″)
abline(v=quantile(boot.strap/1000000,c(0.05,0.5,0.95)),col=c(”blue”,”black”,”red”))

permanent-plots8.png

The extension of the technique to be used on the difference between two measurement periods is trivial if the data is available. More accuracy is obtained by narrowing the range of parameters by using a separate parameter for each species, so there is clearly a cost of ignorance at work here.. This can even be quantified (cf with this post) The method can clearly also be applied to estimating future biomass yield in the case of plots being planted for fuelwood or biofuels, based on making some very simple but justifiable assumptions regarding growth form, wood density and planting density. Because a measure of uncertainty is produced this is very useful in decision making.

biomasa.doc

Permanent forest plots

Posted in Climate change, Current affairs, Evidence and Ecology, R scripts, Uncategorized by Duncan Golicher on March 9th, 2008

On friday my colleague Mario Gonzalez brought an interesting article to my attention that has been published on line in the Open Access biology journal PLoS Biology. I am a fan of this journal myself and I have tried to place a feed to the table of contents on this site (WordPress apparently doesn’t pick it up). I was particularly moved by the editors’ personal reasons for supporting Open Access.

The article in question is available by clicking here.

I was intrigued by the use of the words “Assessing evidence” in the title. The monitoring of large permanent plots has become a staple source of evidence for tropical forest ecologists. In recent years a great deal of progress has been made in understanding the processes at work in tropical forests, largely as a result of heroic efforts in a few well studied plots. Permanent plot monitoring involves a huge investment in time and labour. It produces a great deal of replication at the level of individual trees. However the amount of replication at the landscape scale is clearly limited. This leads to inevitable challenges for both statistical analysis and for the interpretation of results. I was very encouraged by the honesty and transparency in the article’s method section regarding problems associated with data quality. For example the authors explain that “For the plots with more than two censuses, we were able to correct anomalous dbh values by comparing the stem dbh growth rates across census intervals. If a tree showed a dramatic change in dbh growth rate, we changed the one outside of the range (−5 mm/y, +45 mm/y) with the likely value, and updated the dbh value accordingly. This filter was applied using a computer routine, and then checked manually.”

This sort of analysis involves a lot of work after the data has been collected. It also leads to some necessary but rather arbitrary decisions being taken. Trees generally really shouldn’t shrink over time, but anyone who has worked with this sort of data in the tropics has confronted the problem. One element that might help with this in the future is the wider use of PDAs in the field to enable real time sanity checking of data as it is recorded. Errors are often made in the data capture process and not detected until field work is completed. The level of training of technicians also tends to improve over time. So the next decade’s results should be more consistent than the last.

However this is not th element that most concerned me. Despite the interesting title I was left rather unsure whether the authors really had managed to evaluate all the evidence the data set provided. The problem in my mind arose from a common assumption that uncertainty within a statistical analysis is all attributable to variability in the data. In fact any assumption used in any calculation will change the results.. It is now becoming common practice to attempt to quantify the sensitivity of conclusions to assumptions, but this doesn’t seem to have been done here. Instead,even though a complete census was undertaken, statistical significance was based on within sample variability using an arbitrary division into sub-plots and the assumption of independence between them. It is not clear how useful this is.

In fact the main conclusion from the work would seem to be heavily reliant on the use of an allometric equation to convert diameter (usually measured at breast height) into an estimate of total oven-dry aboveground biomass, the fundamental response variable of interest.

Let’s look at how sensitive conclusions might be to this. There is a mistake in the parentheses of the functions as printed in the paper, but as I understand the equations they can be implemented in R using the following code. I have repeated the function with different name and parametrisation for each of the forest types. The important point is that, at least on my initial reading of the analysis, the authors’ applied the same function to all trees in each plot assuming the equation for a forest type applied to the whole plot.

(The code shown below is available in this text file. Use it if there are problems with quotation marks .. permanentplots1.doc )

dry<-function(r=0.5,D,a1=0.667,a2=1.784,a3=0.207,a4=0.0281){
D<-log(D)
r*exp(-a1+a2*D+a3*D^2 -a4*D^3)}

moist<-function(r=0.5,D,a1=1.499,a2=2.148,a3=0.207,a4=0.0281){
D<-log(D)
r*exp(-a1+a2*D+a3*D^2 -a4*D^3)}

wet<-function(r=0.5,D,a1=1.239,a2=1.980,a3=0.207,a4=0.0281){
D<-log(D)
r*exp(-a1+a2*D+a3*D^2 -a4*D^3)}

x<-1:120
plot(x,wet(0.5,D=x),type=”l”,lwd=2,col=”green”,ylab=”Above ground biomass”,xlab=”Diameter”)
lines(x,moist(0.5,D=x),col=”blue”,lwd=2)
lines(x,dry(0.5,D=x),col=”red”,lwd=2)
legend(”topleft”,lwd=2,legend=c(”wet”,”moist”,”dry”),col=c(”green”,”blue”,”red”))

permanent-plots.png

A point to notice is that the equations suggest that trees in moist forest have higher biomass than those in dry or wet forest. Seven of the ten plots in the paper are considered “moist” but these must apparently lie somewhere on a continuum between dry and wet. There seems to be no clear justification for using the same optimum equation for all of these plots. I am not quite sure of the units as the paper says this produces above ground biomass in Mg (tons). The units look as if they should be Kg to me.

Also these rather complex looking formula basically reproduce the same curve that could be provided by simply fitting a quadratic function with three parameters over the likely range for the data (10:120 cm lets say). Two parameters do not change.

The curves diverge only diverge sharply for larger trees and it may be argued that really large trees are rare, so this shouldn’t matter too much. In all undisturbed tropical forests that I know of there tend to be a large number of small trees and a very few large trees. Thus a simple simulated data set can be produced by slightly truncating a log normal distribution.

set.seed(1)
D1<-rlnorm(100000,2.2,0.8 )
D1<-D1[D1>10]
D1<-D1[D1<130]
hist(D1,main=”Simulated diameter distribution”,xlab=”DBH (cm)”,col=”lightgrey”)

permanent-plots1.png

With time, patience, knowledge of the plots and perhaps a suitable individual based simulation model to hand it would be possible to simulate some reasonable growth and mortality data for each tree over time. For the moment I will simply assume (quite wrongly of course) that all trees increase their diameters equally by 1cm. I will also assume no mortality at all, so an increment in biomass is assured. Not at all realistic, but the question is, how might the increment in biomass in a plot affected by the assumption made regarding the fixed paramaterisation of the allometric equation.

Let’s compare the assumption that the trees follows the dry forest formula against the assumption of the moist forest formula. The code divides the total number of trees into 200 random subplots and carries out the sort of bootstrap resampling that was used to provide confidence intervals. The test of significance used in the paper relied on whether 95% of the bootstrapped values where above zero. In this simulated case, without mortality, this is not interesting. The question is how important is the assumption made for the allometric equation.

nsubplots<-200
subplot<-as.factor(rep(1:nsubplots,each=length(D1)/nsubplots) )
D1<-D1[1:length(subplot)]
D2<-D1+1
d<-data.frame(subplot,D1,D2,moist=moist(D=D2)-moist(D=D1),dry=dry(D=D2)-dry(D=D1))
a<-with(d,tapply(moist,subplot,sum))
boot.moist<-replicate(1000,mean(a[sample(1:nsubplots,nsubplots,replace=T)]))
b<-with(d,tapply(dry,subplot,sum))
boot.dry<-replicate(1000,mean(b[sample(1:nsubplots,nsubplots,replace=T)]))

d2<-data.frame(category=rep(c(”moist”,”dry”),each=1000),bootstrap=c(boot.moist,boot.dry))
bwplot(bootstrap~category,data=d2,breaks=1000)

permanent-plots2.png

So in this simulated data, the confidence intervals attributable to the bootstrapping procedure are overwhelmed completely by the effect of changing the assumptions in the allometric equation. This was simulated data and the effect may be much less in real data. But it should be taken into account in some sense. One way of dealing with it is to bootstrap on the parameters used in the entire calculation as was done here.

Another interesting point is that although PLOs Biology is laudably open access, the raw data itself is not made available. Thus I had to simulate data to try to make a point. The future of open access research should (IMHO) be directed towards open, transparent and documented data analysis. R can play an important role in this as any analysis can be freely shared as a script. We always have to make assumptions when analysing data and some assumptions are a matter of personal preference or instinct. A future model for meta-analysis might be based on an “analysis of analyses”. The more ways a data set is looked at the more likely it is that its fundamental content will be revealed.

Open source, open ideas

Posted in Current affairs, Evidence and Ecology, Linux by Duncan Golicher on March 7th, 2008

The nobel prize winning economist Joseph Stiglitz is very fond of quoting Thomas Jefferson. “Knowledge is like a candle; as it lights another candle, its original light is not diminished”.

Economists like Stiglitz refer to the opposite of this property as exclusivity. In other words if I eat a box of chocolates or turn a tree into timber, you can’t have the chocolates or enjoy the shade and scenic beauty provided by the tree. Only one of us can consume a consumable. On the other hand if I have an idea I can share it with you, if you are interested. If my idea happens to be a good one then denying access to it, or the right to use it, gives rise to an inefficiency. Traditional intellectual property is based on such exclusion. Knowledge is a global public good that is of potential benefit to anyone in the world. There is a global social cost in depriving anyone in the world the right to use available knowledge.

Most of our individual ideas are “memes”. In other words we have all been lit from someone else’s candle. Originality in science is extremely difficult and is only partially achieved even by the small handful of creative individuals that manage to provide a new spark. We all worry that we are wrong or being too derivative when we think or write about scientific issues but I have (quite recently) come to the conclusion that we really shouldn’t be worried that our ideas are not original providing we are casting some light, somewhere. The more candles that are out there, the more light is cast and the collective effect is to everyone’s benefit.

In the open source model of software development the light cast has an even more positive effect, because software is itself useful as a tool for achieving other goals. Software developers in fact make some of the candles. I have neither the time nor the ability to contribute code to open source projects, but I appreciate the tremendous generosity of all those who do and try to support and use open source alternatives to commercial software as a matter of principle.

A simple illustration of an ecological lottery model in R

Posted in Evidence and Ecology, Natural history, Probability and statistics, R scripts by Duncan Golicher on March 7th, 2008

A colleague contacted me yesterday with some interesting modelling work based on the neutral community model of Stephen Hubbell.

Hubbell’s work fascinated me when it was first published. I have always assumed that it is easily misunderstood. If Hubbell is taken to imply that all species are equal, then the work appears to be eroding the basic subject matter of ecology. Ecologists tend to look for informative differences between species. However this is not exactly what the theory is based on. It is rather more subtle. The theory concerns the equality of individuals (in this sense it almost has political implications). Hubbell writes.

“The theory treats organisms in a community as essentially identical in their per capita probabilities of giving birth, dying, migrating, and speciating. This neutrality is defined at the individual level, not the species level. All that is required is that all individuals of every species obey exactly the same rules of ecological engagement. So, for example, if all individuals and species enjoy a frequency-dependent advantage in per capita birth rate when rare, and if this per capita advantage is exactly the same for each and every individual of a species of equivalent abundance then such a theory would qualify as a bona fide neutral theory by the present definition.”

I have illustrated this idea for students with a very small simulation in R . R code, like matlab or octave, is very terse and efficient so a couple of lines can do a fair amount of work. Lets assume a very simple reproductive rule applies to all individuals. The rule is that all have the same discrete generation time, in other words they all survive one time step of a simulation. At the end of the time step each individual produces two offspring. The lottery then applies. The offspring are randomly mixed and placed back on the finite space occupied by the previous generation. So half find a home and half “die”. The process is then repeated. This was exactly how I first implemented the model. I then re-implemented the idea in fewer lines by making the probability of selection depend on the proportion of individuals in each species, which amounts to the same thing.

This is all the code needed to run the model. (Try this if the quotation marks are not right lottery.doc)


nspecies<-8
gridsize<-20
time<-100

RUN<-function(){
X<<-matrix(0,nspecies,time)
palette(rainbow(nspecies))

mat<-sample(1:nspecies,gridsize*gridsize,replace=T)
X[,1]<-table(mat)
image(matrix(mat,gridsize,gridsize),col=sort(unique(mat)))

for (i in 2:time){
a<-X[,i-1]
mat<-sample(1:nspecies,gridsize*gridsize,prob=a,replace=T)
X[,i]<-table(factor(mat,levels=1:nspecies))
image(matrix(mat,gridsize,gridsize),col=sort(unique(mat)))
gc()
}

matplot(t(X),type=”l”,lwd=2,col=rainbow(nspecies))
}

#However to get a little tcltk interface add this

############################################

library(tcltk)

Run<-function(…){
nspecies<<-as.numeric(tclvalue(”nspecies”))
gridsize<<-as.numeric(tclvalue(”gridsize”))
time<<-as.numeric(tclvalue(”time”))
RUN()}

tt <-tktoplevel()
d1<-tklabel(tt,text=”Numero de species”)
d2<-tklabel(tt,text=”Tamaño del grid”)
d3<-tklabel(tt,text=”Tiempo”)

s1 <- tkscale(tt,from=0, to=20, variable=”nspecies”,
showvalue=TRUE, resolution=1, orient=”horiz”)
s2 <- tkscale(tt, from=2, to=100, variable=”gridsize”,
showvalue=TRUE, resolution=1, orient=”horiz”)
s3 <- tkscale(tt, from=10, to=500, variable=”time”,
showvalue=TRUE, resolution=10, orient=”horiz”)

but <- tkbutton(tt, text=”Run”,command=Run)

tkgrid(d1,s1)
tkgrid(d2,s2)
tkgrid(d3,s3)
tkgrid(but)

The model is to be played with (just download R from CRAN and paste all the code into the console), rather than watched but here are some example runs on You Tube,

The moral of the story is that under a neutral model some species start a random walk to extinction if co-existing in a small space with other species. However we cannot tell which will dominate and which become extinct from their characteristics, because all individuals, regardless of species are governed by exactly the same rules. The fate of species is determined by the cumulative luck or misfortune of the individuals that have been given the label. The larger the grid the more species that can be supported for a longer time, but extinction risk is always present if a long losing streak hits. The element of Hubbell’s theory that this model does not reproduce well is the log-normal (-ish) relative abundance distribution of the surviving species, although come to think of it I have not tried running the model with a really large number of species and a really large grid. R runs out of colours for the illustration, but this is not important.

Many interesting questions remain. Could a simulation model based on this sort of idea be used to predict extinction over a real landscape? Models of this sort seem far too abstract to provide real insight into the real world. But in some sense the process that Hubbell describes in a more formal mathematical way in his book probably does apply. But, how can these sort of ideas be communicated accurately to prevent them from provoking the sort of unproductive SLOSS debate that has often resulted from theory being pushed too far?

Deforestation in La Sepultura

Posted in Climate change, Evidence and Ecology, GRASS scripts, R scripts, Uncategorized by Duncan Golicher on March 3rd, 2008