The text below lacks graphics and tables, However just click on the blue first paragraph to download it all as a PDF .
One of the most frustrating and time consuming parts of statistical analysis is shuffling data into a format for analysis. No one enjoys changing data formats. Researchers want to get results, finish the task, move on. Routine reformatting of data is made difficult by complications that could have been avoided. Students who are not instructed in data management, or those who ignore instruction, use spreadsheets to hold data. This practice leads to confusion. The reshape package in R not only help to rescue data lost in overly complex Excel spreadsheets, it changes what could be a boring cut and paste operation into an interesting intellectual exercise in its own right.
Students should be taught to collect and hold data in a standardized “long” form in which there is only one variable for each form of measurement from the start. This resolves most problems and saves hours of tutors time. However even of they do, they still need help in reshaping this data to their own needs. This is especially true if they use SPSS. Data reformatting in SPSS is very complex and confusing and SPSS does not have powerful generic routines for handling very complex operations.
There are many tricks for handling data in R using tapply, lapply, stack, aggregate, by and a range of other functions. However R now has an efficient and powerful syntax for data handling provided by Hadley Wickham’s elegant reshape package. I have only just discovered reshape and I am very impressed. Reshaping operations were always far easier in R than in a spreadsheet, but that wasn’t to say that they were ever that easy. With reshape they are a lot simpler. I now think I can safely trust my students who use R to be able to handle most of the cases they come accross with a little thought and guidance. As an example here is some data extracted from tables of goals scored in the 2006 -2007 and 2007-2008 football season.
The number of goals scored by each team are identified by the fifteen minute period of the match (0-15 = m15, 15-30 =m30 etc) and the year when the season started. This is not raw data. That would consist of the exact times the goals are scored identified by team. The data tables available on the web site are in a wide format. There was an issue caused by the fact the original tables hold for and against goals together. This had to be resolved using strsplit. However this is a different matter that I don’t go into here. The reshaped data looks like this. Note how xtable can provide nicely formatted tables for reports.
Visualizing data in long format using R
If data is in this standard “long format” everything is just so easy. For example after loading lattice a box plot is produced simply by using the standard model formula.
Or the boxplots can be placed the other way around.
The plotmeans function in gplots is very useful for plotting confidence intervals. Notice this is not Hadley Wickham’s ggplot package, it is Greg Warnes gplots. This has a very convenient function for plotting confidence intervals for means, one of the more difficult common operations in R for students to achieve.
Modelling data in long format
It is particularly nice to have data in long format because the way we visualize the data using formulae corresponds directly to a simple linear model for the data
There is clearly one significant main effect. This is the time in the match. Neither season nor interaction is significant.
The effects package can be used to form a very quick graphical picture of the model that is also an effective alternative to using plotmeans.
Notice that the main message is that more goals are scored before half time and at the end of the match. Some of this may be attributable to a few minutes injury time being added on, but there is a upward trend anyway in both halves.
The standard “long” format unifies data analysis and visualization in R. Unfortunately we do often need to switch between this convenient long format and wider formats for presentation of data as tables and some forms of analysis.
How does Hadley Wickham’s package reshape help us to do this?
Full technical details of the package are available here
To kick start the process of using reshape I will provide practical examples as templates, although to understand its full power I recommend the original formal documentation.
The logic of the package is first to “melt” the data into a form that is consistent for any type of data. This is similar to the long form of the data except that one column is used to store the name of the variable and one for the value. You thus always have a number of columns that correspond to the identifying (grouping) variables + 2.
To melt the data you need to specify which are the variables used for identifying cases and which are the measurements. These can be specified as the names of the columns or their numbers. It is often quicker to use numbers.
As Hadley notes, getting data into this form is not always easy. However any properly designed “long” data set will melt like butter. Wide data sets are sometimes more challenging. Once you have the data in this standard “molten” form Hadley Wickham’s formula based approach takes over and is used to reshape it to almost any form you could want. There is something quite magical about seeing data turn itself into exactly what you need with just two lines of code, especially if you have had sufficient previous experience with Excel to understand just how time consuming it can be without reshape to help. Even with tapply, by and aggregate in R things can often be tricky. Under Windows data can be moved from and to Excel through the clipboard, so an R window can easily do the work that pivot tables previously managed in a less elegant way.
The logic of reshape is rather challenging at first. There is no free lunch. It does take a little bit of getting used to, but once mastered it is just so powerful that you will never look back. Not only can you turn long tables into wide tables, but you can easily apply aggregations using the same logic. The trick is that the variables you want as the column names are placed in the second part of a formula. Understanding this and the how to use two special variables “.” and “…” is the key to using reshape. The “…”syntax means “all the variables” and “.” means none. This if all the variables are used to form column names a “wide” format results. Look at the example on the next pages.
Moving to a wide format
Reshape the whole table to a wide format using all the variables. Time will form the columns holding the measurements (number of goals). Notice the way … is being used in the formula.
Sum the goals for each team over the two seasons with Time as column header. In this case note that some teams only played in the premier league for one season. Reshape can quite easily produce a simple count for this by further aggregation and this can be used to subset the data.
Margins can also be calculated using reshape. This is a nice feature in some cases when you need to display the data. A bad part of student’s use of Excel is their frequent inclusion of marginal totals within the columns in which the data is held. This breaks a fundamental rule. Everything in a column of data should be the product of the same measurement protocol. However marginal totals are often useful when the data is presented in a final form. This is the right time to calculate them and it is easy with reshape.
Moving from wide to long format
This is one of the most complex operations to get right using the traditional stack. There is often a need for renaming column headings. However it works fine with reshape.
A simple summary of forest inventory data using reshape
A common task is calculating summed basal areas and counts of individuals from forest inventory data. First lets make up a very short simulated data set with just four species in ten plots. The procedure for reshaping data on 400 species in 10,000 plots is exactly the same and takes no more time, although the resulting table would make a very long appendix! Notice that the species names are first kept as a separate table then merged to form the data. It is very important always to do this to avoid mistakes. I often seen students typing species names multiple times, which is clearly a recipe for disaster.
Counts of individuals are simple as the function “length” counts the number of observations in each cell. Notice that you must not include margins=T if you are going to work with the data as a matrix for multivariate analysis as this produces the row and column totals.
More interestingly we can apply a function that calculates and sums basal areas directly to the melted data in the same way.