Analysing Likert scale satisfaction scores

Analysing Likert scale satisfaction scores

Duncan Golicher


Full code on Rpubs


Although there are many possible ways of scoring a response to a question, the Likert scale has recently become a default option. In particular, the national student satisfaction survey (NSS) is based on questions socred on the Likert scale. NSS scores are used in league tables, and are influential metrics that affect decision making within universities. The relative performance of courses based on these scores can impact strategic planning. It is therefore vital to use rigorous statistical methods when analysing NSS scores to ensure that the evidence they provide is neither misinterpreted nor over interpreted.

NSS scores are based on the following responses to questions such as “Overall, I am satisfied with the quality of the course.”

  1. Definitely disagree
  2. Mostly disagree
  3. Neither agree nor disagree
  4. Mostly agree
  5. Definitely agree

A Google search for the terms “Likert scale”” and “statistical analysis” produces a large number of documents of varying academic reliability. Many are not particularly helpful. Some are misleading and provide unsatisfactory advice regarding appropriate analytical techniques. I will base my suggestions on consideration of very well known and uncontroversial statistical principals rather than citing any “conventional” method that has been specifically suggested for the Likert scale, due to number of inapppriate analytical teechniques that have been applied to this type of data.

Suggestions for finding the right “test” for a null hypothesis of no difference in the responses between courses are not generally helpful. The term “significant” can be misinterpreted by decision makers. It is particularly important to base findings not just on a p-value but to use reliable and interpretable measures of effect size along with statistically justifiable confidence intervals. It is also vital to provide a clear visualisation of results that can be quickly scanned and interpreted.

Before considering the statistical nature of the data it is worth considering the potential for unseen biases in the scoring. If an online form presents the scale as a series of checkboxes to choose between, the response could depend on the order in which the choices are listed. If a slider is used on an online form it may provoke different responses than woud be chosen using check boxes. The wording in which the question is framed may lead to aquiescence bias or social desirability bias. Some forms of wording may deter respondents from choosing extreme responses while others may encourage them. These are all psychological aspects that should be taken into account when the survey is designed.

Assuming that the designers of the survey are aware of these issues, the problem I address concerns the mumerical part of the data analysis.

While a mean can be calculated for any set of numbers, most of the advice found online regarding the Likert scale quite rightly points out that a mean Likert score is difficult to interpret. The responses are not on a simple linear scale. The nature of the scale prevents the calculation of a valid standard deviation. Classical parametric methods based on assumption of normality are very clearly not appropriate when analysing responses to any single question. If the responses to many individual questions are pooled it has been suggested that the central limit theorem kicks in and the resulting pooled scores can be treated as if they follow a Gaussian distribution. However this is not very good advice either, as the assumption is being made that there is independence in the responses. This is rarely justifiable in the case of satisfaction scores. Students who are satisfied regarding one aspect of the course will also tend to be satisfied about other aspects, violating the assumption of independence.

Other suggestions for null hypothesis testing include using non-parametric procedures such as Kruskall Wallis tests. However this does not test a very meaningful hypothesis nor does it provide a basis for comparing effect sizes.

Establishing differences in the response to a question between courses involves a simple bivariate analysis with one numerical variabe (score) and one factor (course) which has as many levels as the number of courses. However despite the apparent simplcity, there are several complications.

I will first illustrate the issues involved by setting up some very artificial data and show the consequences of simplistic analyses before going on to suggesting some informative and statistically robust approaches.

The “marmite effect”

One of the difficult aspects of the Likert scale is the potential for confounding a set of neutral responses with a set of strongly held likes and dislikes. I’ll refer to this as the “marmite effect”. In the case of student satisfaction scores, courses with more challenging material often provoke more extreme “marmite” like responses than courses with blander material. Some forms of analysis may not be able to distinguish between these pattern of results. Others may overemphasise the marmite effect.

To illustrate the issue I’ll simulate some data with the same mean for the response scores. Course 1 is the marmite course. The simulation has set an equal probability of a respondent strongly agreeing or strongly disagreeing, with nothing in between. Course two never provokes any strong responses at all. Course three has a flat, equal probability of choosing any of the five responses. All are centred around a score of three on the Likert scale.

set.seed(3) library(ggplot2) n<-50 x1<-sample(c(1,5), n, replace =TRUE) x2<-sample(c(2,3,4), n, replace =TRUE) x3<-sample(c(1,2,3,4,5), n, replace =TRUE) x<-c(x1,x2,x3) q<-rep(c("Course_1","Course_2","Course_3"),each=n) d<-data.frame(q,x) g0<-ggplot(d,aes(x=x)) g0+geom_bar()+facet_wrap("q")

d %>% group_by(q) %>%summarise(mean=mean(x),median=median(x)) ->dd kable(dd)
q mean median
Course_1 2.84 1
Course_2 3.04 3
Course_3 3.00 3

While such an extreme pattern will never occur in practice, the simulation illustrates the issue. The pattern of responses are clearly quite different, but all have similar mean scores. Statistical tests of central tendency will therefore not distinguish between the courses. There is an argument that this could be a desirable result as it avoids any prejudice against “marmite” courses. However note that the median for the marmite course is one, i.e strongly disagree, even though the probability of choosing either of the extremes was set to be identical. “By chance” over 50% of the answers happened to be stongly disagree rather than strongly agree. The next time a sample is drawn the result may be reversed. So if no statistical analysis was used at all it could be stated that “a majority of the students on the course were very disatisfied”, but next time the course is run the “majority”” could be very satisfied. This sort of effect can also occur when respondents are forced into making a binary choice on a nuanced question. such as the EU referendum.

It might be argued that a fair analysis should completely ignore any marmite effect and just score the three courses equally. However an even better analysis would quantify the effect fairly and transparently while taking into account the way random variability influences any conclusions.


A naive but totally statistically invalid test for differences between the courses would be to use one way analysis of variance. It is easy to run, although it is the wrong analysis to use.

mod<-lm(x~q) anova(mod)
## Analysis of Variance Table ## ## Response: x ## Df Sum Sq Mean Sq F value Pr(>F) ## q 2 1.12 0.5600 0.2475 0.7811 ## Residuals 147 332.64 2.2629

The ANOVA shows no significant differences between the mean scores for the three courses. While it might be argued that this is being fair on the marmite course, it is also very uninformative and misses the obvious difference. It also absolutely unjustifiable statistically.

Although an analysis based on the mean score is rather difficult to interpret, it is still possible to find robust confidence intervals for the mean through bootstrapping rather than using an invalid anova model based on assumptions of normality. The general idea of analysing differences in means as a measure of central tendency should not be totally ruled out as a complementary option in combination with other forms of analysis. We will return to this later.

Kruskall-Wallis test

The Kuskall-Wallis test is often advised for Likert scale data. As it is based on ranks it is technically valid. It is tempting to apply it as there is no strong mathematical objection. The null being tested is that the location parameters of the distribution of the scores are the same in each group. This is sometimes assumed to be a test that the median’s are not significantly different, but this is not quite the correct interpretation. This also shows no significant difference between the three courses. However it is very hard to interpret and doesn’t lead to any clear measure of effect size.

## ## Kruskal-Wallis rank sum test ## ## data: d$x by d$q ## Kruskal-Wallis chi-squared = 0.5704, df = 2, p-value = 0.7519

The overall results for each question are indeed centred in the same position so the test correctly fails to reject the null. However it is clear that the are differences in the pattern of responses that are not being picked up by this valid, but uninformative procedure.

Chi-squared test

One way to deal with the “marmite”” issue is to simplify the data into binary classes and look at the number of responses falling into each class. This is in fact the commonest approach taken when analysing NSS scores. The measure typically used is the proportion of students who are satisfied i.e. giving a score of 3 or above.

d$x1<-1*(d$x>3) ## Take only sores above 3 and convert to ones and zeros tb<-table(d$x1,d$q) ## Tabulate round(prop.table(tb,margin=2)*100,1)
## ## Course_1 Course_2 Course_3 ## 0 54 58 60 ## 1 46 42 40

The data can be quite validly analysed using a chi-squared test, although there is a much better way that I will show later

## ## Pearson's Chi-squared test ## ## data: tb ## X-squared = 0.38154, df = 2, p-value = 0.8263

This again, quite correctly, shows no significant difference between the reponses when scores over 3 are scored as ones and scores below are scored as zeros.

Chi-squared could be used to look at stronger feelings towards the course by changing the splitting rule.

d$x1<-1*(d$x>4) ## Score extremely satisfied as ones tb<-table(d$x1,d$q) round(prop.table(tb,margin=2)*100,1)
## ## Course_1 Course_2 Course_3 ## 0 54 100 82 ## 1 46 0 18
## ## Pearson's Chi-squared test ## ## data: tb ## X-squared = 32.018, df = 2, p-value = 1.115e-07

Now there are very clear differences between courses. The test is highly signficant and it could be re-run using other splitting criteria.

So, ANOVA is statistically invalid, Kruskal-Wallis is hard to interpret and doesn’t lead into the calculation of confidence intervals. An approach based on a binary split is capable of either clearly distinguishing between marmite courses or ignoring the marmite effect, depending on how the data are split. It does provide a fairly clear, interpretable score, measured in terms of “proportion of satisfied students” or “proportion of students holding strong feelings regarding their course”. To be most informative the binary split based approach needs to be run more than once. The Chisq-test establishes statistical signficance, but is not particularly useful when used in this form.

So, what is the general solution?

Varying sample sizes

An even more important issue that the marmite effect that must be taken into account in any analysis is the influence of sample size on the overall score. Courses with low numbers of students are much more likely to have either very high or very low scores as result of chance. Statistical methods are specifically designed to handle the sample size issue, but the right method must still be chosen.

Splitting the data into just two classes leads to a situation in which classic binomial theory may be used to calculate confidence intervals and take into account sample size.

Let’s make up some data again to show the issue more clearly. The simulation sets exactly the same probability of a student providing any one of the five responses on all three courses. However the number of students on the courses varies between 5 and 100.

n1<-5 n2<-20 n3<-100 n<-n1+n2+n3 course<-c(rep("Course_1",n1), rep("Course_2",n2), rep("Course_3",n3)) nss<-sample(c(1,2,3,4,5), n, replace =TRUE) d<-data.frame(course,nss) d$satisfied<-(d$nss>3)*1 d %>% group_by(course) %>% summarise(n=n(),satisfied=sum(satisfied),percent=(100*satisfied/n)) ->dd datatable(dd,autoHideNavigation=TRUE)



course n satisfied percent
1 Course_1 5 3 60
2 Course_2 20 7 35
3 Course_3 100 40 40

Showing 1 to 3 of 3 entries

Note that the data were simulated with exactly the same probabilities. However the percent satisfied on the course with only five students appears to be much higher than the on the other two courses. This is the typical effect of small sample sizes. More extreme results are more likely when the sample size is low.

tb<-table(d$course,d$satisfied) chisq.test(tb)
## ## Pearson's Chi-squared test ## ## data: tb ## X-squared = 1.0417, df = 2, p-value = 0.594

So. the Chi-squared test confirms the lack of any significant differences. However we haven’t measured the effect size yet, nor extracted confidence intervals.

One possibility is to use logistic regression.

mod<-glm(data=d,satisfied~course,family="binomial") anova(mod,test="Chisq")
## Analysis of Deviance Table ## ## Model: binomial, link: logit ## ## Response: satisfied ## ## Terms added sequentially (first to last) ## ## ## Df Deviance Resid. Df Resid. Dev Pr(>Chi) ## NULL 124 168.25 ## course 2 1.0226 122 167.23 0.5997

The overall p-value is almost identical to the ch-squared test, as it should be. The coefficients of the model could be back transformed and combined. However there is an even simpler way.

Using the Binomial test to extract confidence intervals

The binomial test in R by default tests the null hypothesis that the true proportion is not equal to 0.5. This can be set to the proportion over all courses in order to test whether any course is “significantly different” from the general overall satisfaction level across all the courses. The binomial test alse provides confidence intervals through the Clopper and Pearson procedure. This guarantees that the confidence level is at least the confidence level, but does not necessarily give the shortest-length confidence interval.

prop<-sum(d$satisfied)/length(d$satisfied) binom.test(3,5,prop)
## ## Exact binomial test ## ## data: 3 and 5 ## number of successes = 3, number of trials = 5, p-value = 0.3952 ## alternative hypothesis: true probability of success is not equal to 0.4 ## 95 percent confidence interval: ## 0.1466328 0.9472550 ## sample estimates: ## probability of success ## 0.6

We can now use the binomial test to build a function that takes the original vector of Likert scores and returns the percentage satisfied along with upper and lower bounds along with a p value for the significance of differences from a baseline value.

satisfied_ci<-function(x,score=3,baseline=0.4){ satisfied<-(x>score)*1 n<-length(x) p<-sum(satisfied) mid<-round(p/n*100,0) b<-binom.test(p,n,baseline) c(mid,round(b$*100,0),round(b$p.value,3)) }

Tabulating confidence intervals

The results can be tabulated for each group using dplyr.

cut<-3 baseline<-sum(d$nss>cut)/length(d$nss) baseline
## [1] 0.4
d %>% group_by(course)%>% summarise(lwr=satisfied_ci(nss,cut,baseline)[2], med=satisfied_ci(nss,cut,baseline)[1], upr=satisfied_ci(nss,cut,baseline)[3], n=n(), n_sat=sum((nss>3)*1), p_val=satisfied_ci(nss,cut,baseline)[4] )->dd datatable(dd,autoHideNavigation=TRUE,fillContainer=FALSE)



course lwr med upr n n_sat p_val
1 Course_1 15 60 95 5 3 0.395
2 Course_2 15 35 59 20 7 0.82
3 Course_3 30 40 50 100 40 1

Showing 1 to 3 of 3 entries

Illustrating the analysis with a larger simulated data set

To show how this can be used with realistic data for thirty courses that have underlying differences in the pattern of responses I will simulate some data by varying both the number of respondents and the pattern of response.

sim_course<-function(i){ course<-paste("course",i,sep="_") n<-sample(4:30,1) shape<-5 scale<-runif(1,2,5) nss<-round(rbeta(n,shape,scale)*4)+1 d<-data.frame(course=course,nss=nss) d } d<"rbind",lapply(1:30,sim_course))

Data table

An interogable data table can be set up that can be ordered by clicking on the column headings.

cut<-3 baseline<-sum(d$nss>cut)/length(d$nss) baseline
## [1] 0.492986
d %>% group_by(course)%>% summarise(lwr=satisfied_ci(nss,cut,baseline)[2], med=satisfied_ci(nss,cut,baseline)[1], upr=satisfied_ci(nss,cut,baseline)[3], n=n(), n_sat=sum((nss>cut)*1), p_val=satisfied_ci(nss,3,baseline)[4] )->dd datatable(dd,autoHideNavigation=TRUE,fillContainer=FALSE)



course lwr med upr n n_sat p_val
1 course_1 39 69 91 13 9 0.174
2 course_2 19 40 64 20 8 0.504
3 course_3 49 79 95 14 11 0.033
4 course_4 28 58 85 12 7 0.576
5 course_5 13 32 57 19 6 0.168
6 course_6 29 53 76 19 10 0.822
7 course_7 3 22 60 9 2 0.18
8 course_8 16 40 68 15 6 0.608
9 course_9 12 33 62 15 5 0.302
10 course_10 26 48 70 21 10 1

Showing 1 to 10 of 30 entries

Eyballing the table suggests that courses with very high, or very low, differences from the baseline tend to have low mumbers of students. The differences also tend to be non-significant for small courses. So, in a real life situation the interpretation of the raw scores must be approached with due caution. Differences between course scores may be indicative but not definitive evidence of a problem with a course, or may be an indication to consider a course to be outstanding but provide insufficient evidence to suggest that it is radically outperforming. A 95% confidence interval may be rather conservative, and decision makers may prefer a narrower interval, but the statistical principle remains.

Plotting the confidence intervals

A quick and intuitive way of looking at the data is to plot the confidence intervals after ranking the scores.

dd<-dd[order(dd$med),] dd$course = factor(dd$course, levels=dd$course[order(dd$med)], ordered=TRUE) #dd<-subset(dd,dd$med>90) g0<-ggplot(dd,aes(x=course)) g0<-g0+geom_point(aes(y=med),colour="red") g0<-g0+ geom_hline(yintercept=baseline*100, col="green") g1<-g0+geom_errorbar(aes(ymin=lwr,ymax=upr))+coord_flip() library(plotly) g1

Notice that the confidence intervals are asymetrical. A course with 100% recorded satisfaction can still have a statistical probability of a lower score.

This allows courses that are significantly underperforming or overperforming with regard to student satisfaction to be quickly identified. There are typically not many courses that are very clearly significantly different from the baseline, although there may be some indications that courses do need attention even if they do not differ significantly from the baseline at the confidence interval chosen.

Notice that the confidence interval for the top performer (course 17) clips the baseline. The evidence that this course is exceptional is not as compelling as that for course 26, which is ranked at #3. The confidence intervals of very few of the courses don’t include the baseline, which is a rather typical result with real life data. Many of the differences are rather illusory and so can be easily reversed if a different sample of students take exactly the same course.

Changing the cut off level

The analysis can be re-run using any cut off point in order to add depth. I’ll just show the figure this time.

cut<-4 baseline<-sum(d$nss>cut)/length(d$nss) baseline
## [1] 0.05811623
d %>% group_by(course)%>% summarise(lwr=satisfied_ci(nss,cut,baseline)[2], med=satisfied_ci(nss,cut,baseline)[1], upr=satisfied_ci(nss,cut,baseline)[3], n=n(), n_sat=sum((nss>cut)*1), p_val=satisfied_ci(nss,3,baseline)[4] )->dd dd<-dd[order(dd$med),] dd$course = factor(dd$course, levels=dd$course[order(dd$med)], ordered=TRUE) #dd<-subset(dd,dd$med>90) g0<-ggplot(dd,aes(x=course)) g0<-g0+geom_point(aes(y=med),colour="red") g0<-g0+ geom_hline(yintercept=baseline*100, col="green") g1<-g0+geom_errorbar(aes(ymin=lwr,ymax=upr))+coord_flip() library(plotly) g1

Course 3 and course 1 stand out as producing significantly high levels of extreme satisfaction. Most A careful inspection of these courses with respect to other measures may be needed. The same analysis could be applied to dissatisfaction to identify courses with levels that are statistically significantly different to the baseline after accounting for chance effects and course numbers.

Bootstrapping mean scores

Although parametric analysis of differences in mean scores are not valid, and the mean itself may be difficult to interpret it is possible to produce confidence intervals for the mean through bootstrapping. This involves resampling with replaecements from the data. In other words, if a course has only five students who score the course as 1,3,3,4,5 we could take a sample from this at random that could very ocassionally be 5,5,5,5,5 or 1,1,1,1,1 but will typically include a mixture of the values. If repeat the resampling thousand of times and exclude the extreme values which occur very infrequently we can get a bootstrapped confidence interval for the mean by calculating it for all the random samples. This approach will occasionally break down for small samples (for example when all the values are identical) but in general it is quite robust and will never throw up values outside the bounds of the data. So this does provide a potential complementary tool to identify courses with unusual scores.

boot_mean<-function(x){ n<-length(x) x<-replicate(1000,mean(sample(x,n,replace=TRUE))) round(quantile(x,c(0.025,0.5,0.975)),2) } d %>% group_by(course)%>% summarise(n=n(), mean=boot_mean(nss)[2], lwr=boot_mean(nss)[1], upr=boot_mean(nss)[3], )->dd datatable(dd,autoHideNavigation=TRUE,fillContainer=FALSE)



course n mean lwr upr
1 course_1 13 3.85 3.54 4.23
2 course_2 20 3.15 2.8 3.5
3 course_3 14 4.07 3.64 4.43
4 course_4 12 3.58 3.33 3.83
5 course_5 19 3.26 3 3.53
6 course_6 19 3.53 3.26 3.74
7 course_7 9 3 2.56 3.44
8 course_8 15 3.13 2.73 3.53
9 course_9 15 3.13 2.8 3.47
10 course_10 21 3.38 3.05 3.71

Showing 1 to 10 of 30 entries

dd<-dd[order(-dd$mean),] dd$course = factor(dd$course, levels=dd$course[order(dd$mean)], ordered=TRUE) g0<-ggplot(dd,aes(x=course)) g0<-g0+geom_point(aes(y=mean),colour="red") g0<-g0+ geom_hline(yintercept=mean(d$nss), col="green") g1<-g0+geom_errorbar(aes(ymin=lwr,ymax=upr))+coord_flip() g1

This analysis generally picks up the same courses in the simulated data set with confidence intervals that do not overlap the baseline. This is useful, as it suggests that the conclusions do not rely too much on the assumptions, providing a statistically justifiable procedure is being used. The bootstrap does break down in this example as one course threw up a set of identical responses. However, again, in a real life situation some additional evidence would be used when considering the results in detail.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s