Duncan Golicher’s weblog

Aggregating time series in R: The Iraq body count

Posted in Current affairs, Personal and family, R scripts, Uncategorized by Duncan Golicher on March 7th, 2008

Yesterday I was asked whether R could be used to analyse time series. The answer is of course it can. R is used extensively in the financial sector for analysing complex time series such as stock prices. I have already included an example using R in the context of climate variability (El Niño). One challenge is that there are a lot of different ways of working with time series and representing dates. Aggregation can be rather tricky. The zoo package is one of the most powerful tools for working with time series, but it is not always simple to use. I still haven’t got my head around all the different ways to achieve results.

So here is an example. The code first reads in the data from an online database, then uses tapply to sum the number of casualties per day. Zoo is then used to aggregate by month and the total is plotted. If anyone reading this can suggest better ways to do this or add a more sophisticated analysis I would like to know.

(Open this document if the code doesn’t work due to problems with quotation marks bodycount.doc)

Or try this  …

source(url(”http://duncanjg.files.wordpress.com/2008/03/bodycount.doc”)) ,

again you might need to retype the quotation marks)

library(zoo)

library(date)

a<-”http://www.iraqbodycount.org/database/download/ibc-incidents”
d<-read.csv(url(a),skip=11,header=T)
a<-tapply(d$Reported.Minimum,d$Start.Date,sum)
x <- zoo(as.vector(a), as.POSIXct(as.date(row.names(a))))
f<-function(x)as.POSIXct(as.yearmon(x))
Deaths<-aggregate(x,f, sum)
par(bg=”lightgrey”)
plot(Deaths,lwd=3,col=2)

deaths.png

However you get there, the result is shocking. These are documented civilian casualties and I chose the lower estimate.

Even if the trend was initially downwards after the start of the surge, it still only bottomed out at around the level it began at when the “mopping up” operations were taking place in the first few months after the invasion. The time series taken from the compiled online data base stops in January 2008, so it doesn’t take into account the recent renewal of violence. Today, Thursday 6 March there were 86 civilian dead. Two bomb attacks killed 68 in Baghdad alone. Apparently this is not an isolated incident. The trend is sadly upwards again.

Joseph Stiglitz has estimated the price of the war at 3 trillion dollars. I wasn’t even sure what a trillion was until he used the figure. It has twelve zeros, in other words a thousand billion. That works out at 30 million dollars for each dead Iraqi civilian or enough to make ever one of these people as rich as Bill Gates. No further comment, apart from a heartfelt request to visit the site of those who have worked so hard to compile this important data set and re-run the code periodically to check the updated figures.

As an addition, I was saddened to hear that Harvard professor Samantha Power has resigned from Obama’s campaign team this week, apparently for speaking too openly on this, among other, issues. I was extremely impressed by her thoughtful, yet emotional, contribution to BBC radio’s start the week in which she talked of the biography she has written on Sergio Vieira de Mello, a heroic figure who constantly impressed me in every interview I heard with him up to his tragic death in Iraq. The recent news suggests that honest, open minded expression by academics is still considered to be a liability, even for apparently honest, open minded, politicians. This is a link to the Hard Talk interview.

A forest inventory problem

Posted in R scripts by Duncan Golicher on February 13th, 2008

Sometimes actions that are easy using “group by” in SQL aren’t obvious in R or appear to need too many lines of code using “tapply”, “aggregate” or “by” in R.

Splitting and substituting back into the original data frame can work neatly in some cases.

Take this example that is also mentioned in the document available here. A study measures the diameter at breast height and the heights of trees. The trees may have multiple stems. The researcher wants to work with the total sum of the basal area per individual, the maximum height of the tallest stem and the maximum diameter of the stems.

A trial data set can be simulated with.

cuad_Id<-sort(sample(1:180,replace=T,1000))
esp_Id<-sample(1:10,replace=T,1000)
ind_Id<-1:1000
dap<-round(rnorm(1000,30,5),1)
alt<-round(dap/2+rnorm(1000,0,2),2)
tallos<-data.frame(cuad_Id,ind_Id,esp_Id,dap,alt)

Add multiple stems per individual.

tallos2<-tallos[sample(1:1000,500,rep=T),]
tallos2$dap<-round(rnorm(500,30,5),1)
tallos2$alt<-round(tallos2$dap/2+rnorm(500,0,2),2)

tallos<-rbind(tallos,tallos2)
tallos<-tallos[order(tallos$ind_Id),]
row.names(tallos)<-1:1500

So the simulated data looks like this …

fiig141.png

The basal area (ab) in m² per stem can be calculated using ..

tallos<-data.frame(tallos,ab=pi*(tallos$dap/200)^2)

To achieve the result substitute the sum of the basal area for each individual using split and tapply.

split(tallos$ab,tallos$ind_Id)<-tapply(tallos$ab,tallos$ind_Id,sum)

Use the same method to substitute the maximum heights and diameters

split(tallos$alt,tallos$ind_Id)<-tapply(tallos$alt,tallos$ind_Id,max)
split(tallos$dap,tallos$ind_Id)<-tapply(tallos$dap,tallos$ind_Id,max)

Now just remove duplicated rows.

tallos<-tallos[!duplicated(tallos),]

Done.

Tagged with: , , ,