I have been building up carefully to making a specific comment on Bradshaw et al (2007) because the subject that the paper addresses is too important to risk mistakes.
The first steps I took yesterday were to look at the data available to the authors. In addition to discovering that the data on the floods themselves were of rather variable quality and lacking in geographical resolution I was surprised by some of the values for recent deforestation in the dataset (from http://www.wri.org) included in the study. In this case I was unable to locate the original data on this web site. However I reformated the data in the paper itself into a GIS and made it available through this weblog.
Despite what looks like exaggeration of the extent of Mexican deforestation and thus probably some other countries, my main issue is not with details regarding numbers within the dataset itself. It is with the underlying logic of the whole analysis. This paper seems to be a classic (almost embarrassing), case of the so called “ecological fallacy” at work. This problem is so well known I was actually quite surprised to see a work of this sort published in a journal with an impact factor of 4.3. It will no doubt be widely cited, particularly by authors interested in strengthening arguments in favour of payments for ecological services. This is likely to have a positive impact on forest conservation. The paper is therefore “politically correct” and could be considered a helpful contribution. However the underlying science is certainly open to criticism.
The term “ecological inference” strangely enough does not come from the discipline of ecology. Ecologists do not often use the term in its original sense. It is much more widely used by social scientists. I will set the scene starting with two classic examples from the social sciences cited by the statistician David Freedman (the tech report from which I took the examples is available here freedman549.pdf)
In 19th century Europe, suicide rates apparently were higher in countries that were more heavily Protestant. The inference could be drawn that suicide is promoted by Protestantism itself. Death rates from breast cancer are higher in countries where fat is a larger component of the diet. Fat intake therefore could cause breast cancer. More recently, floods cause more damage in countries that have high rates of recent deforestation, therefore deforestation causes floods.
These are all “ecological inferences,” that is, inferences about individual behavior drawn from data about aggregates. The ecological fallacy arises when it is believed that characteristics of aggregated data extend to the units from which they have been compiled. In the social sciences this can lead to incorrect inference being drawn regarding the characteristics of the individuals themselves. Individuals do of course live within countries. Floods do take place in defined political units (although lack of exclusivity is an additional issue in this case). However the factors that influence the probability of committing suicide, suffering from cancer or experiencing severe flooding are experienced at the level of the individual or the subregion affected by the flood event. Protestant countries in the nineteenth century were clearly different from the Catholic countries in many ways besides religion (confounding). The original data did not associate individual suicides to any particular religious faith directly (aggregation).
The first problem, that of confounding, must be dealt with in any observational study. But the second problem, that exposure and response are measured only for aggregates rather than for individual, is specific to so called “ecological” studies (the word here being used in the social sciences sense). If there is no confounding, the expected difference between effects for groups and effects for individuals is the “aggregation bias”. In real studies with aggregated data there is usually some confounding and some aggregation bias. Sometimes the message extracted from the aggregated data coincides well with that obtained by an analysis of its components. There is no imperative that fallacious conclusions are always going to be drawn. For example RA Fisher quite wrongly argued against the causal link between smoking and lung cancer due to his feeling that inference was impossible due to confounding (I am not sure how far aggregation was an issue in this particular case). In contrast, in the case of the link between fat intake and breast cancer, more recent studies using data from individuals did raise questions regarding the validity of conclusions drawn from aggregated data.
When data is aggregated in any way at all, details that can help to ensure accurate inference will always be lost. The common reason for this in the social sciences is a bureaucratic imperative to compile tables of regional statistics and a common requirement for confidentiality at the individual level. Election results are a classic example.
In my own experience I have often found that appropriate detail gets lost simply because a student naively aggregates data with the intention of producing a dataset that is easy to handle in a spreadsheet. This can be particularly frustrating if it is done before the data are captured in digital form. There is then no way back. Modern analytical techniques using contemporary statistical environments such as R can handle vast numbers of data points with great ease and speed. Raw data can read be strait into R from online repositories and aggregated according to the needs of a specific analysis with a couple of lines. Mixed effects (hierarchical) models and data mining techniques work with data labelled by individual or unit. There is no longer any reason to continue to repeat avoidable aggregation errors. This is a suitable point to provide a link to a previous post on data handling (in Spanish).
In the final posting on this thread I will finally deal with the specific content of the article in more detail.