A constant frustration when working in tropical regions is the shortage of really high quality data. The quantitative knowledge that applied conservation needs in order to prevent the extinction of the world’s endangered species of plants and animals is still surprisingly hard to come by. At best the available hard data is fragmented, of variable quality, difficult to organize and thus hard to analyze in a formal manner. At worst it is either missing or positively misleading
Information used for conservation planning also forms the central subject matter of the scientific discipline of ecology. In other words, it concerns the abundance and distribution of organisms. Our ignorance of this most basic characteristic of the planet we live on is remarkable. Astronomers (arguably) provide estimates of the number of stars in our galaxy with narrower proportional margins of error than we have for the number of species in a Mexican forest. For example, Wikipedia cites three sources in stating that of 2006, the Milky Way is thought to be comprised of between 200 to 400 billion stars. A defensible estimate of the number of tree species in Chiapas could still range from 500 to 2500 depending on how data of dubious quality is interpreted. This is not good enough for such highly visible and important elements of ecological communities. The hard data is just not there to improve on this.
Although in recent years considerable advances have been made in systematics (Linnean shortfalls), these so called “Wallacean” shortfall remains. The large scale, systematic, planned, coordinated field work needed to redress our ignorance is simply not being funded.
In recent years I have been working with techniques for producing automated distribution maps of tropical tree species using R. Even using comparatively good data for areas that we know at first hand, it is a remarkably difficult task. Modelling species distributions can involve more pragmatism than theoretical insight. At a regional scale the task becomes yet more challenging.
This is not because R is too limited in the number of useful tools it provides. A very broad range of models are available. They range from rule based classification and regression trees through AI “black boxes” such as neural networks, to generalised additive models and simple logistic regression. All can be fit within R in a common framework. The models can be used to produce predicted distributions based on either presence-absence data or known occurrences alone (with the addition of “pseudo absences”). It is also possible to automate linkages between R and other popular software such as maxent or GARP. Increasingly accurate layers for many key predictor variables are now available.
However at least fifty well distributed occurrence points are needed to provide credibly well validated distribution maps for a single species using a fully automated procedure. If fewer data points are available some sort of input from “expert judgement” is inevitably required in order to evaluate which of the potentially infinite outputs from species distribution models are most credible. Well documented, repeatable automation of the fitting process using R is very useful in this process, as many maps can be produced in a short space of time based on different assumptions. However resorting to visual pattern matching is frustratingly informal and breaks a standard rule of data analysis, i.e. ensuring perfectly reproducible results.
The code example below is one possible very simple implementation of GAMs with pseudo absences. It is is applied here to modelling the potential distribution of tree species using data points provided mainly from MOBOT. The code (deliberately in this case) does not take into account spatial trends or autocorrelation, thus potential matching climates will often be suggested outside a “known” species range. This can be corrected by adding in coordinates as predictors to the model, but at the cost of potential loss of insight into the distribution of species that may not yet have been collected from their entire range. Any modelling technique has to find a balance between too many false positives and too many false negatives. A common technique is to look at the ROC curve, but this is not particularly useful if the number of known occurrences is very low and restricted to a small part of the suspected range.
I have come to the conclusion that we should not be overly defensive regarding the failings of automated species distribution mapping algorithms when these are inevitably attributable to the poor quality of the underlying data. There is an unavoidable “garbage in, garbage out” syndrome that is difficult to avoid without extremely time consuming data cleaning. This is best undertaken at the point of origin, i.e. in the herbaria and museums where the data is collated.
The models can however provide some useful heuristic input to the evaluation of the potential range of a species. In my opinion, any assessment of the area actually occupied should not be attempted using regional scale models alone. Even obvious techniques such as using forest cover maps as masks to remove non-forested pixels will have mixed results. Some tree and shrub species are still common in areas classified as pasture at a regional scale. Urban areas also can have many trees. Unpublished studies have suggested that tree diversity is higher in urban Managua than in surrounding agricultural areas. Other species have very specific requirements for non mappable habitat characteristics.
These concerns led our group to the conclusion that given the current data the best that can be achieved at a regional scale is to map the distribution of “climatically associated species pools” (CASPS). The “CASP” approach suffers the serious weakness of rather devaluing individualistic species responses, but may be the best that can be achieved until new initiatives begin to provide contemporary, accurately georeferenced and correctly determined occurrence (and absence) data from across the tropics. Data sharing and pooling between currently active research groups will be a vital first step in this process.
The output from the R code above is overlain on the 3d Blue Marble image from code in the previous posts to give a visual impression at a very broad regional level. Green points are the input to the model (recorded occurrences) and red points are grid squares with similar combinations of regionally important climatic variables mapped at an approximately 5 km x 5 km scale. Further details of this and similar procedures including mapping of species pools are available in this document.mnp-workshop3.pdf
This model could provide some heuristic input to expert evaluation of the potential distribution and threats to some species currently being assessed for IUCN red list status. 450 maps produced through an automated procedure are included as low resolution jpg files within pdf archives in alphabetical order here.