Preguntas sobre PostGIS y la presentación de ayer.
1. ¿Necesito Linux para usarlo?
No. El modelo es lo que se llama “cliente-servidor”. Es decir que PostGIS es el servidor. Proporciona los datos, pero la visualización de los datos o el procesamiento de las consultas es al nivel del cliente. Tu puedes escoger el cliente que mas te gusta. En este momento hay varios programas clientes de acceso abierto (gratis) que pueden leer, procesar y visualizar información de un servidor de PostGIS. Lo mas fácil de usar es probablemente QGIS, aunque Udig también tiene buenas características. Open Jump es otra alternativa. Todos estos programas se puede instalar en Windows. Los clientes están en un proceso de desarrollo todavía y van a incrementar su utilidad con el tiempo. PostGIS ya tiene toda la funcionalidad de servir datos a clientes con mas capacidades.
Al mismo tiempo programas como Excel y Access pueden vincularse con el servidor para procesar datos no espaciales.
2. ¿Es difícil instalar Post GIS? Necesitamos apoyo técnico externo para hacerlo.
No. Se instala PostGIS con un servidor de Linux en alrededor de una media hora. Se requiere un poco de planificación y esfuerzo adicional para la configuración óptima de las opciones de seguridad. Pero es importante recordar que los usuarios de la información no van a instalar PostGIS. Ellos pueden bajar un archivo qgis.exe del internet y instalarlo bajo Windows en minutos. Si usan Linux también van a instalar Qgis, o Udig, no PostGIS. El uso de Qgis es sencillo e intuitivo. Se puede dar capacitación en su uso practico en el curso de una mañana.
3. Hemos intentado bajar capas de un mapserver, pero parece complicado.
La relación entre PostGIS y un mapserver es potencialmente un elemento potente, pero no se necesita correr un mapserver para usar PostGIS. Un mapserver recibe capas de PostGIS, Asi que es otra capa entre el usuario y servidor. La configuración de un mapserver para correr con PostGIS no es muy difícil. Hay dos mapservers de acceso abierto (mapserver y geoserver) que pueden usarse. Sin embargo un mapserver es realmente una forma de dar información al publico. El primer paso hacia el uso de un mapserver sería asegurar la velocidad de la conectividad al Internet de Ecosur.
4.Estoy acostumbrado de usar ArcView, tengo licencia para usarlo y muchos shapefiles disponibles. ¿Por qué me interesa esta iniciativa?
La ventaja de una base de datos espacial es que toda la información esta guardada en el mismo lugar, bajo el mismo sistema de coordenadas y actualizado. Las capas pueden comunicarse entre ellas. Si ves por ejemplo que las coordenadas del censo nacional no coinciden con los del conteo de población sería posible tratar de resolver el problema. Una vez resuelto todos los usuarios de la base no van a repetir los errores. Usuarios de ArcView pueden bajar la información como shapefiles con un clic y trabajarla al nivel local de sus propios Pcs. Es muy importante enfatizar que la iniciativa no fuerza usuarios de cambiar su forma de trabajar. Cada persona desarrolla sus propias preferencias por software al nivel de sus propia “desktop” y hay que respetar sus decisiones. Pero al nivel del servidor hay que asegurar que todos los datos están mantenidos bajo estándares rigurosos, uniformes y convencionales.
5.¿Cuando podemos contar con un servicio así?
Un prototipo ya esta funcional y ya puede ser muy útil. Se puede proporcionar una herramienta de uso interno en el espacio de una semana. Pero la verdadera potencia de un sistema así viene con mas esfuerzo. Se puede programar portales Web atractivos para el publico. Se puede organizar sistemas amigables para que estudiantes entren sus datos en el sistema sin tener que saber detalles sobre el uso de SQL. El control de calidad y mejoramiento de la información se consigue a través de esfuerzos de revisión meticulosos y programación con datos. Para contar con un sistema del cual podemos estar orgullosos como institución debemos emplear por lo menos una persona con capacidad en programación de bases de datos tiempo completo.
Sería vital la coordinación entre investigadores que producen nuevos datos de campo, las colecciones, LAIGE, informática y la biblioteca.
6. ¿Cuales son la ventajas de Linux para mi como individuo?
Mis propias razones estan expresados aqui.
Presentación de PostGIS en Ecosur
Abajo he incluido las diapositivas de una presentación que voy a dar hoy en Ecosur sobre la potencia de PostGIS como un base de datos para la investigación institucional en Ecosur. Incluyo una introducción sobre la filosofía del movimiento de aceso libre y la potencia del nuevo modelo “profesionalizado” del software de aceso libre.
Los ejemplos del uso de POSTGIS se ven en los videos de You Tube abajo.
Desafortunadamente la calidad del imagen no es muy alto. De todos modos dado la velocidad de la conexión institucional no creo que se pueden ver dentro del institución.
Hurricanes in Chiapas
I have just added the historical hurricane paths to my PostGIS data base from NOAA http://maps.csc.noaa.gov/hurricanes/
The data base has storm tracks from 1851 to 2006. The above maps shows only those since 1990. This is a very valuable resource that is well worth including in the regional data base and looking at in more detail. The interesting element as far as the state of Chiapas is concerned is that very few hurricanes actually track over the state. This is clearer when the whole (150 year) historical data set is included as shown in the figure below. However the moist air masses associated with hurricanes can cause substantial rainfall in the state even when the centre of the hurricane is some distance to the north or east.
When the map is zoomed out further a very clear cut off point emerges, forming an almost horizontal line at around 8 degrees North. This is a feature associated with the Intertropical Convergence Zone (ITCZ),
This rather sharp transition also explains some of the strange patterns in the maps of global circulation model simulations which also show a pronounced band around the equatorial ITCZ. The cut off may alter in extent over the next century as the effects of global warming are felt. If it expands slightly northwards, then rainfall associated with hurricanes and tropical storms may also move, leading to a general decline in rainfall in southern Mexico. This could occur even if hurricane intensity and frequency in the Atlantic basin follows an overall increasing trend.
An example of using PostGIS from R
One of the interesting elements of PostGIS is that it can be run from R using RODBC. I have found this to be very useful, especially when tables are being altered, as it allows the properties of the results of a query to be quickly analysed and checked for sanity before being added to the data base. So in order to work with POSTGIS I usually open QGIS, PgAdmin and an R console and switch between them.
Here is an example of how you might calculate the population density within polygons and add the results to a PostGIS table.
In R first connect to a data base assuming an ODBC connector has been set up
library(RODBC)
con<-odbcConnect(”mydb”)
Now the example will find the sum of the population within polygons representing watersheds. This layer was originallybuilt using r.watershed in GRASS followed by r.to.vect. The automated watershed limitation doesn’t work very well along the coast, but provides a useful set if polygons for the rest of Chiapas, particularly the central valley. First add a new column to hold the result.
odbcQuery(con,”ALTER TABLE chiapas_basins ADD COLUMN poptot int4;”)
Then the query that calculates the total population can be run from R. The result can by looked at first in R to check that it is sensible. This is useful, as I found differences between using the 2000 census data and the 2005 population count that are not attributable to population change or migration, but rather are caused by errors in the geopositioning of the data.
sql<-”select sum(m.z1) as poptot, b.id from chiapas_conteo_2005 m, chiapas_basins b where m.the_geom && b.the_geom AND contains(b.the_geom, m.the_geom) group by b.id;”
d<-sqlQuery(con,sql)
This takes a while to run. I quickly discovered that it is extremely important to make sure that a GIST index is added before attempting point in polygon operations. This can be added to each table in PgAdmin, for example.
CREATE INDEX chiapas_basins_idx_geo on chiapas_basins USING GIST(the_geom GIST_GEOMETRY_OPS);
Now histograms and totals of the population in the watersheds can be produced in R to look at the results before adding them back into the data base.
For example
hist(log(d$poptot))
The population density is typically log normal.
sum(d$poptot)
4255957
This gives Chiapas a total population of 4.25 million which looks about right. As the query returned the results with a column name that has already been set up and has the primary key this line will update the database.
sqlUpdate(con,d,”chiapas_basins”)
Now we need to calculate the area of the watersheds in km2. To do this I first had to insert an srid for Albers conic projection for North America. This can be found here
http://www.spatialreference.org/ref/epsg/102008/
Then some more queries that can either be run in pgAdmin or wrapped up by odbcQuery(con, “some sql statements “) and run from R directly.
odbcQuery(con,”ALTER TABLE chiapas_basins ADD COLUMN areakm2 float4;”)
odbcQuery(con,”UPDATE chiapas_basins SET areakm2 = area(transform(the_geom,9102008))/1000000; “)
odbcQuery(con,”ALTER TABLE chiapas_basins ADD COLUMN density float4;”)
odbcQuery(con,”UPDATE chiapas_basins SET density = poptot/areakm2; “)
The results can then be visualised in QGIS. Whether this is easier than using ARCGIS certainly depends on what you are used to. I am still working out how to achieve fairly basic operations like this in PostGIS. However when I have achieved them once, it is then easy to repeat the operation for other examples, which is why I am posting fairly simple examples like this to the weblog for reference in case they can help others to overcome the intially steep learning curve with PostGIS. Although the geoprocessing wizard in ArcView is very simple to use and the point in polygon operation does seem to run much faster than the corresponding query in PostGIS, this particular operation would still require some additional work with the database tables if it were to be done using ArcView alone.
Importing maps from BERDS: Reprojection using ogr2ogr and shp2pgsql
I am constantly impressed by the BERDS database for Belize. It is powered by POSTGIS, and also offers downloads of some coverages as shapefiles in UTM. I have just downloaded and imported Jan Meerman’s fire risk map into my POSTGIS system. I did it in two steps that I have found to be generally more robust than trying to use ogr2ogr. directly. First reproject to a new shapefile then import the result using shp2pgsql. I am not sure why this process results in fewer errors, but it does seem to work with some difficult imports I have been attempting (although this particular example is very clean).
ogr2ogr -f “ESRI Shapefile” belizefire.shp -t_srs EPSG:4326 -s_srs EPSG:2028 firerisk.shp
shp2pgsql -s 4326 -c belizefire belizefire mydb | psql -d mydb
Deforestation and floods 4: PostGIS raster layers?
One element struck me when looking at the original data compiled by Dartmouth college on flood events. The data base contained a table headed “cause”. This was full of text such as “rain”, “heavy rain”, “torrential rain”. The orange points on the map above are the centres of recorded flood events in thals three years. Note VillaHermosa in Tabasco. The coloured layer is shaded by rainfall intensity (five bands ranging from below 600mm to 2500 mm +).
So, the headline would be that “scientists conclude that flooding is caused by ….. rain!”
This is not in fact as trivial as it seems. Devastating floods can occur in regions where heavy rain is uncommon, but there must be a clear correlation between flood frequency and the sheer amount of annual rainfall a region experiences. I am constantly surprised by how quickly the vegetation in Mexico changes from dry forest to exuberant tropical forest. The transition zone can be a matter of kilometers on the road north from Chiapas to Tabasco. The flooding in Villa Hermosa in 2007 was mainly the result of rainfall falling on the state of Tabasco itself (and a small part of the extreme north of Chiapas). During the dramatic events a very light drizzle was falling in San Cristóbal. The lowlands of Tabasco are almost completely deforested, but the mountains of northern Chiapas are covered in a mixture of secondary forest and cattle pastures. Rains in late October fall on already saturated soil profiles. Vegetation cover could only have played a marginal role in reducing the amount of water flowing out of the highland watersheds in this particular case.
This provided me with an excuse to experiment with ways to move raster layers into POSTGIS. The POSTGIS data base structure is clearly not designed for raster layers, but some quite good results can be obtained simply by moving a fairly coarse raster layer, such as an interpolated rainfall coverage, into POSTGIS as points. This is similar to a SpatialPixelsDataFrame in R. In fact the easy way to set up the coverage is to move the data from a sp object in R to POSTGIS by converting the coordinates into a geometry. When viewed at some scales an interference pattern results, but the coverage can be very useful because clicking the information tool on any point on the map results in information on the pattern of rainfall over twelve months, although the visualzation can use the total annual rainfall.
The first image used Udig. The lower image uses Qgis. Again although I have not put a clear legend in this very quick informal treatment, the rainfall ranges from 800mm per year (bright red)to 4000 mm (dark blue)
La niña and early rainfall
One of my first posts to this weblog mentioned the statistical correlations between low mid pacific sea surface temperatures and early rainfall in Chiapas.
http://duncanjg.wordpress.com/2008/02/06/checking-on-the-el-nino-effect-using-r/
Heavy rains often do not begin in the state until mid May. However this weekend we have experienced a major rain storm (12 April). Persistent rain fell throughout the night and early morning (14 April). Rerunning the code shows that la Niña effect is still strong.
Of course as any linkages are statistical in nature I do not expect my temporary success at long term weather forecasting to continue.
Las Chimalapas: A reserve waiting to be declared?
Watch the animation below carefully (Click on the map to open the animated GIF). First a blue marble image is presented. Then a map of settlements is overlain (red points) Then the current protected areas of southern Mexico are added (blue shading). Finally the registered plant collections are added.
So, what is that large hole in the centre of the map doing? Does no-one live there? Hasn’t it been declared a reserve? Have any botanical collections been made there?
This is the Chimalapas region, in the isthmus of eastern Oaxaca bordering on Chiapas. The vegetation ranges from lowland rainforest in the Atlantic north to tropical dry forests in the Pacific south Cloud forest and montane pine-oak forest lie in between. Rainfall ranges from 1100 mm to over 3500 mm. The whole area totals more than 600,000 ha. It is still in a good state of preservation. The avifauna is apparently the most diverse for any region of comparable size in the country, totalling at least 464 species in the region as a whole (with more than 300 species in the lowland rainforest) representing 44% of the bird species known from Mexico. Important species present in the region include Harpy Eagle Harpia harpyja and several other large eagles. Yet it has not received protected status and is seldom visited by plant collectors.
Giving protected status to the region would seem on the face of it to to be an automatic “no brainer”. There would be few losers. The resources of this uninhabited region are presently not being exploited. Pressures on the region are ever present, but not intense. The publicity associated with protected status might encourage incipient ecoturism in the area. Various conservation organizations such as the TNC and Conservation International are interested in the region and my group in Ecosur are currently looking at deforestation in the region in detail in order to strengthen the arguments.
However, is the protected area model really the best option for the area? I leave the question open in the hope of receiving some comments.
There are a few points that might be discussed.
1. Are governmental resources available in order to ensure that protected status would indeed prevent activities that negatively impact the biodiversity of las Chimalapas?
2. Could protected status have unexpected consequences, such as encouraging unsustainable activities in the the area?
3. Would protected status for las Chimalapas divert resources from initiatives that could be effectively combating more immediate threats?
4. Would protected status interfere with wider long term economic development of the region, specifically in terms of trans isthmus transportation?
It is quite astonishing to notice that the “hole” in question even stands out at a national level when desert regions are included. The only other large areas without inhabitants in Southern Mexico are around Calakmul.
Floods and forest loss 3
Unfortunately time is short. I have found it more difficult than I expected to make a handful of specific criticisms of the statistics in Bradshaw et al.
I originally aimed to concentrate attention on a few narrow methodological issues when the paper was brought to my attention. I now feel that statistical weaknesses are indicative of a much deeper problem with papers of this sort.
Environmental science should guide environmental policy. Environmental politics should not exercise direct influence on the way scientific evidence is evaluated. If this is allowed to occur there is a constant risk of the underlying science becoming devalued. This has to be resisted. Scientific publishing concerns itself with rigor and objectivity. Scientists do have an obligation to make decision makers more aware of issues such as the wider positive value of forests. There are now many forums available that can allow us to achieve these goals. As individuals we can find always find, develop and strengthen, outlets that permit us to express subjective opinions freely and explicitly. A weblog such as this may even be an example.
However, the traditional scientific peer review process should not be seconded to an environmental cause. It must have a different role. The strict objectivity of peer review gives us the credentials and the confirmed evidence that we can use in order to make a difference when we express our opinions as scientists.
In this particular case the authors’ openly state that they deliberately set out to support decisions that strengthen forest conservation. They write that “for conservation to receive wide political and popular attention and priority, especially in the developing world, there needs to be empirical evidence of nature’s role in supporting human well-being. … For centuries it has been vehemently claimed, and hotly disputed, that forests provide natural protection from floods … Yet, because the claim lacks broad-scale empirical support, the development and implementation of clear flood-mitigation policies regularly stall (FAO & CIFOR, 2005; Calder & Aylward, 2006).” The authors stated intention was thus to correct this deficit. I find this quite uncomfortable reading. Scientific evidence doesn’t fall out of a data set as a result of political necessity to provide it.
Why does the claim of a link between deforestation and flooding “lack empirical support”? I don’t believe that it does. There is ample evidence on which to base a consensus. Flooding, and more specifically, the loss of life and property associated with flooding, is the result of a broad set of factors that combine in complex, case specific ways. These combinations (that could be called the “cause” of flooding) are often quite well understood at the local level when any specific event is investigated. However the “causes” of floods in general can rarely be analyzed successfully by looking for global scale correlations. The number of interactions involved prevents purely statistical analyses from incorporating enough of our current understanding to cast new light on the issue. Contemporary, process based, hydrological models now capture the dynamics of flooding remarkably well. Water does move in predictable ways, even if higher powers, including politicians, do not. There are relatively few deep scientific mysteries to be resolved.
The intrinsic difficulty is still in effectively communicating what is already known to decision makers. Flooding is largely predictable (at least in a probabilistic sense). Nevertheless, poor political decisions are constantly made that ignore both science and common sense. They will continue to be made if decision makers can claim that a scientific consensus has not been reached and then use the resulting confusion as a justification to act on non-scientific agendas.
In some specific circumstances forest cover can indeed mitigate the effects of intense rainfall events. This is especially true in the case of small scale, local events. In other cases flood plains are inundated regardless of whether the upper parts of watersheds are forested. Politicians sometimes use deforestation in order to distract attention from their own poor planning in heavily populated flood plains. Sound environmental management involves both protecting forests and ensuring high quality urban planning with sophisticated risk management in the flood plains. Distracting attention from the complexities of the trade offs involved by setting up false dichotomies based on questionable analysis of evidence is unhelpful. Floods cost lives. We must try to get this right.
Here are a few quick maps based on the authors, data.
- Proportion of deforestation in South America
- Proportion of deforestation in Sub Saharan Africa
- Proportion of deforestation in South Asia
- Legend for absolute deforestation (km2)
- Absolute deforestation (global)
- Legend for relative deforestation (% of 2000 cover lost since 1990)
The authors investigated two statements. (i) flood frequency is correlated with the total forest cover (natural and plantation) and/or (ii) flood frequency is correlated with the total forest cover loss over the period of interest. Only floods caused by heavy or brief torrential rain were included; those caused by typhoons, cyclones, dam breakage and tsunamis were excluded because they represent events that originate independently of landscape characteristics.
It should be clear that total forest cover must be correlated generally with climate. Absolute forest loss is positively correlated with the size of the country with large countries clearly losing more. Proportional forest loss is negatively correlated with the size of the country, with small countries losing more forest. So the authors’ attempted to hold for these effects statistically by including them in an initial model.
As I read the paper, this analysis looks increasingly odd. Instead of concentrating on “offsetting” effects such as total area, population or rainfall they threw them directly into the statistical model. The use of “random effect” to refer to country level soil moisture is also extremely odd. On initial reading I thought that country id must have been used as a random factor in a GLMM that looked at within country effects, sensu hierarchical mixed effects models as used in social science. I could find no clue that the authors had investigated any random effects as I understand the term.
Using the data set that I had previously made available for download in a previous post on this weblog I fit one of the models cited in the paper using R.
mod summary(mod)
Call:
glm(formula = nfloods ~ area + slope + rain + degrad + for2000 +
forloss, family = gaussian(link = “sqrt”), data = d)
Deviance Residuals:
Min 1Q Median 3Q Max
-12.570 -4.091 -1.526 2.090 38.183
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.377e-01 6.697e-01 1.102 0.27666
area 6.295e-07 3.268e-07 1.926 0.06054 .
slope 1.182e-01 2.022e-01 0.585 0.56171
rain 9.215e-04 3.056e-04 3.015 0.00425 **
degrad 1.925e-06 7.054e-07 2.729 0.00909 **
for2000 -1.732e-06 5.392e-07 -3.211 0.00247 **
forloss 8.676e-05 5.967e-05 1.454 0.15307
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 74.80764)
Null deviance: 15817.0 on 50 degrees of freedom
Residual deviance: 3291.4 on 44 degrees of freedom
(5 observations deleted due to missingness)
AIC: 373.26
In other words, at first sight no significant effect of deforestation at this scale. I am still not sure what the authors’ analysis actually involved but although the (often justifiable) use of information criteria to mediate between models appears sophisticated, and the lack of mention of statistical significance testing is sensible, in this particular setting the contemporary feel to the analysis mainly serves to make it more difficult to penetrate for the uninitiated.
As my major issue remains that of aggregation errors (for example what does median annual rainfall of 616 mm mean for a countrty like Mexico?) I am reluctant to go into further detail.
Connecting Qgis to an Ecosur based POSTGIS server
The internet connection speed from Ecosur is unfortunately extremely slow. We have a limited number of external IPS. In order to set up a temporary trial POSTGIS server I had to commandeer a PC that is also being used for image processing. So there is only limited and intermittent connectivity. It would fail completely if many users connected at the same time.
However I aim to put some of the most useful basic coverages for Chiapas and Mexico on this server as an alternative to passing around shapefiles. One advantage is that everything is held in the same projection using the same datum (EPSG code 4326, PROJ4 +proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs
ESRI GEOGCS["GCS_WGS_1984",DATUM["D_WGS_1984",SPHEROID["WGS_1984",6378137,298.257223563]],
PRIMEM["Greenwich",0],UNIT["Degree",0.017453292519943295]]
Import to POSTGIS requires clean topology. I have unfortunately found that many of the available layers have problems with this and do not import cleanly. I am trying to fix by cleaning them in GRASS but it is tedious and does not always work. However the layers I do get into the system are then guaranteed to be usable by any software.
The easiest way to connect to the server is to download Qgis (available for windows here http://download.qgis.org/downloads.rhtml - choose a stable version)
Once Qgis is running choose “add a postgis layer”.
Connect to the server with the details shown below (The password is available by contacting me directly to avoid over use and details such as the name of the database may change).
Then layers can be loaded strait into Qgis. For example the 1:25000 soils map for Mexico can be downloaded by choosing the layer shown below.
This can then be saved locally as a shapefile by right clicking on the name of the layer in the panel on the right. I have also had problems with encoding (bad characters) as most of the dbf files are not in UTF8. I am working on this. Again the advantage is that once everything is done there are no future problems.
Floods and forest cover 2: Ecological inference
I have been building up carefully to making a specific comment on Bradshaw et al (2007) because the subject that the paper addresses is too important to risk mistakes.
http://www.blackwell-synergy.com/doi/abs/10.1111/j.1365-2486.2007.01446.x
The first steps I took yesterday were to look at the data available to the authors. In addition to discovering that the data on the floods themselves were of rather variable quality and lacking in geographical resolution I was surprised by some of the values for recent deforestation in the dataset (from http://www.wri.org) included in the study. In this case I was unable to locate the original data on this web site. However I reformated the data in the paper itself into a GIS and made it available through this weblog.
Despite what looks like exaggeration of the extent of Mexican deforestation and thus probably some other countries, my main issue is not with details regarding numbers within the dataset itself. It is with the underlying logic of the whole analysis. This paper seems to be a classic (almost embarrassing), case of the so called “ecological fallacy” at work. This problem is so well known I was actually quite surprised to see a work of this sort published in a journal with an impact factor of 4.3. It will no doubt be widely cited, particularly by authors interested in strengthening arguments in favour of payments for ecological services. This is likely to have a positive impact on forest conservation. The paper is therefore “politically correct” and could be considered a helpful contribution. However the underlying science is certainly open to criticism.
The term “ecological inference” strangely enough does not come from the discipline of ecology. Ecologists do not often use the term in its original sense. It is much more widely used by social scientists. I will set the scene starting with two classic examples from the social sciences cited by the statistician David Freedman (the tech report from which I took the examples is available here freedman549.pdf)
In 19th century Europe, suicide rates apparently were higher in countries that were more heavily Protestant. The inference could be drawn that suicide is promoted by Protestantism itself. Death rates from breast cancer are higher in countries where fat is a larger component of the diet. Fat intake therefore could cause breast cancer. More recently, floods cause more damage in countries that have high rates of recent deforestation, therefore deforestation causes floods.
These are all “ecological inferences,” that is, inferences about individual behavior drawn from data about aggregates. The ecological fallacy arises when it is believed that characteristics of aggregated data extend to the units from which they have been compiled. In the social sciences this can lead to incorrect inference being drawn regarding the characteristics of the individuals themselves. Individuals do of course live within countries. Floods do take place in defined political units (although lack of exclusivity is an additional issue in this case). However the factors that influence the probability of committing suicide, suffering from cancer or experiencing severe flooding are experienced at the level of the individual or the subregion affected by the flood event. Protestant countries in the nineteenth century were clearly different from the Catholic countries in many ways besides religion (confounding). The original data did not associate individual suicides to any particular religious faith directly (aggregation).
The first problem, that of confounding, must be dealt with in any observational study. But the second problem, that exposure and response are measured only for aggregates rather than for individual, is specific to so called “ecological” studies (the word here being used in the social sciences sense). If there is no confounding, the expected difference between effects for groups and effects for individuals is the “aggregation bias”. In real studies with aggregated data there is usually some confounding and some aggregation bias. Sometimes the message extracted from the aggregated data coincides well with that obtained by an analysis of its components. There is no imperative that fallacious conclusions are always going to be drawn. For example RA Fisher quite wrongly argued against the causal link between smoking and lung cancer due to his feeling that inference was impossible due to confounding (I am not sure how far aggregation was an issue in this particular case). In contrast, in the case of the link between fat intake and breast cancer, more recent studies using data from individuals did raise questions regarding the validity of conclusions drawn from aggregated data.
When data is aggregated in any way at all, details that can help to ensure accurate inference will always be lost. The common reason for this in the social sciences is a bureaucratic imperative to compile tables of regional statistics and a common requirement for confidentiality at the individual level. Election results are a classic example.
In my own experience I have often found that appropriate detail gets lost simply because a student naively aggregates data with the intention of producing a dataset that is easy to handle in a spreadsheet. This can be particularly frustrating if it is done before the data are captured in digital form. There is then no way back. Modern analytical techniques using contemporary statistical environments such as R can handle vast numbers of data points with great ease and speed. Raw data can read be strait into R from online repositories and aggregated according to the needs of a specific analysis with a couple of lines. Mixed effects (hierarchical) models and data mining techniques work with data labelled by individual or unit. There is no longer any reason to continue to repeat avoidable aggregation errors. This is a suitable point to provide a link to a previous post on data handling (in Spanish).
In the final posting on this thread I will finally deal with the specific content of the article in more detail.
Floods and forest cover 1
Good decision making should be based on good scientific advice. It is clearly the job of the scientific community to support decision makers whenever they feel able to do so. Floods cause tremendous suffering, not only in terms of loss of life but in the immense disruption caused to lives, livelihoods and economies. Most residents of Southern Mexico have been affected in some way by floods in Tapachula (2005) and Villa Hermosa (2007). Fortunately these severe events caused comparatively few lives to be lost directly. However their consequences to the regional economy were extremely grave. In both cases politicians in the lower part of the watersheds cited deforestation as a major contributory factor.
Conservation of the native forests of Chiapas will not be achieved unless the full value of the services they provide is appreciated by society. Yet the linkages between flooding and deforestation are complex. When a study claims to provide new evidence regarding linkages between deforestation and flooding it must be considered with some care.
Comments on the scientific value of the paper I cited in the previous post should be based only on evidence. I therefore checked most of the websites the authors used to derive their information before analyzing the paper. An important source was the world atlas of flood events at Dartmouth College
http://www.dartmouth.edu/%7Efloods/archiveatlas/index.htm
The data found there can be mapped directly using their web site as shown below.
I also downloaded and reformatted the raw data and moved it into postgis. It can be used as a shapefile through the zipped file here floodevents(change-extension-to-zip).doc
The data is clearly of variable quality. This is par for the course. Prior to 2006 the geographical coordinates were not entered explicitly in this particular data base. All the original raw data from the Dartmouth data (that I downloaded in Excel format) including textual descriptions of the floods can be loaded strait into R using read.table by changing the extension of the following file to txt flood_events.doc. Here are some quick (Qgis) maps derived from the data showing loss of life from the floods that count with coordinates.
There are various points to note before I look at the statistical analysis used in the article in greater detail. Southern and Eastern Asia seems to be particularly flood prone. Most floods on this continent occur around coastlines, on the lowlying floodplains and in the Himalayan watersheds. Africa has comparatively few floods. Flooding in North and Central America tends to be associated with the effects of tropical depressions and hurricanes.
The patterns of flooding at this scale seem to be determined largely by global circulation patterns, but the consequences and probability of being reported will be determined by settlement patterns and economic activity. Floods are rarely reported from the Amazon basin, even though much of the low lying areas of forest (várzea) are seasonally flooded and these floods vary in severity tremendously between years. These sort of details interest forest ecologists trying to trace tree distribution patterns but never make news because they do not affect human lives.
PostGIS and R continued
Last week a colleague sent me an article published in Global Change Biology for comment. I have just got round to looking at it in detail. The abstract is available here (with full text if you have the institutional access.)
http://www.blackwell-synergy.com/doi/abs/10.1111/j.1365-2486.2007.01446.x
The authors claim have identified a correlation at the global scale between forest loss and flooding.
I have a lot of issues with the way the data has been analysed that I intend to discuss here. However one of the positive features of this article is that all the data that used is included as an appendix. This allows anyone to make their own mind up regarding the merits of the analysis.
I will make specific comments in a later post. But I first deal with different technical issue. Having setup a postgis database for Mexico and Central America I realised that I lacked a global country map to help visualise data of this sort. I tried importing the shapefiles that come with Arcview, but with no success. They apparently have errors in the topography. I also tried converting the worldhires data in the R mapdata package, but that also had issues with bad topography (apparently American Samoa was to blame). Finally I found this reply to a R help post from the prolifically helpful Roger Bivand
http://finzi.psych.upenn.edu/R/Rhelp02a/archive/78339.html
I tried
load(url(”http://spatial.nhh.no/R/etc/world_countries.rda”))
library(rgdal)
writeOGR(world_countries, “PG:dbname=mydb”, “world_countries”, “PostgreSQL”)
Bingo. In a couple of minutes I had a rather low resolution world countries map with ISO3166 two letter country codes ready for linkage to any data with a column that uses this international standard. It is now nicely tucked away in my postgis data base ready for use whenever needed. Very satisfying. Notice that I include this code as an example of how R connects to a postgis database once one has been set up. It obviously won’t work if you haven’t gone through the steps to establish your local data base. Here is how on Ubuntu. It is not something I would try under Windows,even though posgresql 8.2 apparently has now been ported to Windows.
I then had to reformat the data from the pdf copy of the article. This was slightly frustrating as it needed a bit of manual work in an open office spreadsheet. I then added the ISO3166 codes. This line (also suggested by Roger Bivand) helped a lot. It would be even better if editors and reviewers insisted on the use of standards when they are available.
d<-data.frame(d,ISO3166=world_countries
$ISO3166[match(toupper(as.character(d$Country)),as.character(world_countries$names))])
The nice thing about a postgis data base is that you can use it for storing all your data, spatial or otherwise, in one convenient place. So using RODBC connectivity the floods data can be incorporated by
library(RODBC)
con<-odbcConnect(”mydb”)
sqlSave(con,d,”floods”)
The only slightly fiddly aspect is that the saved table lacks a primary key. That can be resolved from R with few lines that run sql in postgresql.
odbcQuery(con,”ALTER TABLE floods ADD COLUMN index int4;”)
odbcQuery(con,”UPDATE floods SET index=cast(rownames AS integer);”)
odbcQuery(con,”ALTER TABLE floods ADD CONSTRAINT floods_pkey PRIMARY KEY(\”index\”);”)
odbcQuery(con,”ALTER TABLE floods DROP COLUMN rownames;”)
Then the tables can linked together as a temporary view from within R
sql<-”create view floods2 as select f.*, w.wkb_geometry from floods f,world_countries w where f.ISO3166=w.ISO3166″
odbcQuery(con,sql)
The view can be removed from the database whe it is finished with by
odbcQuery(con,”drop view floods2″)
With the view in place Qgis can be used for the visualisation instead of R, which is really helpful for casual analysis. This convenient interoperability, drawing on the particular strengths of a range of different tools within a common framework (statistical environment, database, GIS visualisation), is one of the great plus points of Open Source software from the point of view of a user. However you do have to discover how to do it first. You don’t get a “wizard” to help.
R does produce very good publication quality maps, but there is rather more overhead involved than when using the simple point and click graphical user interface of QGis I am now finding it much easier to send vector layers strait to postgis and work from there.
The view can also be exported as a shapefile strait from QGis. As the wordpress site only allows me to upload files it thinks are documents I have placed it here with a false doc extension. Don’t try to open it, save the file to disk and then change the extension to .zip. floods2thisisreallyazipfile.doc
















