Over the past 45 years, effect sizes (ESs) – statistics that measure the magnitude of a difference or the degree of an association – have become an ever more common part of research design, analysis and reporting. Their use is central to the following:
(i) the calculation of statistical power, that is to deciding if a planned study has sufficient subjects in order for it to detect the expected differences with the desired probability;
(ii) the calculations required for a meta-analysis, that is for the aggregation and comparison of findings from different studies; and
(iii) the comparison of findings within a single study.
To understand ES, we can start by considering how to compare the outcomes from two clinical trials for major depression: one trialling drug A against placebo and the other trialling drug B against placebo. Table 1 shows results from two such studies. Both studies report outcomes for the Hamilton Rating Scale for Depression (HRSD) on which lower scores indicate lower levels of depression. The second study also reports the percentage of patients who recovered.
Post-treatment outcomes from two clinical trials for major depression

* Hamilton Rating Scale for Depression (HRSD).
† Clinical Global Improvement (CGI) (yes/no improved).
Looking just at the HRSD, the t-tests for the significance of the difference between two means would lead us to conclude from both studies that drug is more effective than placebo. An obvious next question is whether the two drugs differ from each other in efficacy.
One obvious way to try and answer that question is simply to look at the differences between the two studies in their scores on the HRSD: 20 − 16 = 4 vs. 9 − 7 = 2 points to drug A being better. There is a lot to be said for making comparisons directly using our measures, but there is also considerable reluctance. Researchers often do not know how to interpret their measures – what does a 2 point difference mean? – and so they prefer some statistic that appears to relieve them of having to make such judgments. Even when they are comfortable with their measures, they can be unclear about how to factor in aspects of the data like differences in standard deviations. Finally, there is the difficulty of how to compare results when, as above, one study gives mean differences on the HRSD as well as percentage of differences on the Clinical Global Improvement (CGI) scale as its outcomes.
A tempting approach – less often yielded to these days – is to look at the P values or significance levels from the two t-tests (or whichever test has been carried out). Here, this would lead you to regard drug B as more effective as its P is smaller. The use of P values to draw conclusions like this from statistical evidence is controversial, and while they might say something about the strength (or precision) of the evidence for a difference, there is now a general recognition that they say little about the size of the difference. A demonstration of this is to note that if, for example, one simply increases the sample size for the drug A study, then the P value decreases but with no change in the observed difference.
We can acquire a graphical insight into our question from Fig. 1, which shows the distributions of scores from the two studies, with the drug B study drawn below and upside-down for clarity. The shaded sections correspond to the overlap between each study’s two treatment arms and show that, after allowing for the differences in the shapes of the distributions, the proportion of overlap is quite similar. If the outcomes for drug and placebo overlap to a similar extent, then, however, widely or narrowly spread the scores are, we are inclined to believe that there is little difference between the outcomes from the two studies. This lack of difference is not related to the degree of change from before treatment to after treatment because, if in both studies the patients had been at similar levels of initial severity, then clearly in one study both groups have change far more than in the other. How can we talk about this overlap?
Distribution of post-treatment outcomes scores on the HRSD from the two clinical trials. Lower scores correspond to better outcomes. Shaded areas indicate the overlap in outcomes for the pair of treatments in each study. Vertical lines correspond to means.
One tool we have for describing different distributions is standardisation, that is, expressing distances between means in standard deviation units rather than in the original scale. These standardised units allow us to more directly compare (a) the difference between placebo and drug A with (b) the difference between placebo and drug B. A standardised difference between two means is an example of an effect size– although note that not all ESs for two means are standardised differences. In this case, the ES is defined as follows:
where the standardising value, roughly speaking, is an average of the two standard deviations. When applied to the HRSD in Table 1, we obtain the results in Table 2.
Calculation of ESs for HRSD

From the ESs labelled g, we see that they are the same value, −0.67, indicating that there is a sense in which the differences between drug and placebo are identical, a sense related to the overlap of the two distributions. Note how the ES is negative because we have calculated the ES for drug minus placebo and, as drug means are less than placebo means, so we find that the drug mean is 0.67 standardised units below the placebo mean.
To get a feeling for what this ES means, we look at Fig. 2, which shows a drug group whose mean is 1 ES below the placebo groups. Using our assumption that the scores have a normal distribution, we are able to calculate what percentage of each group falls below a particular score. Half the placebo group falls below the mean for that group (μPLACEBO), but 84% of the drug group falls below that mean – 84% of patients on drug do better than the average patient on placebo. If a score of μdrug (or −1σ, one standard deviation below μPLACEBO) corresponds to ‘recovery’, then 50% of those on drug have recovered vs. 16% of those on placebo.
Distribution of outcomes expressed in standardised scores. The drug minus placebo ES is 1. Percentages give the proportion of each group whose scores fall below the indicated score.
The calculations for the ES labelled g combine the two standard deviations (see formulae below), and in doing it that, we are assuming that they are two estimates of the same common population value, which in many cases is a reasonable assumption. There are cases, however, where it is not, and the ‘average’ standard deviation is not useful. An example would be where one group has a small standard deviation because of floor effects (patients are about as well as possible) or ceiling effects (patients are all at the top end of severity). In such instances, an ES is better calculated using the standard deviation of either the placebo (control) group or the drug (treatment) group. Typically, the former is used and we have done that in calculating the ESs labelled Δ in Table 2. The ESs now differ, although not by much because the standard deviations are similar, so that the conclusion is similar. We can also make interpretations as in Fig. 2, but we have to remember that the scale (or x-axis) is in units of the placebo group’s standard deviation so that, for example, placebo scores two standard deviation below μPLACEBO (−2σ) no longer correspond to drug scores one standard deviation below μDRUG.
Formulae: The ES reported as g is often called Hedges’s g, which is defined as follows:
where the standardising value Sp is defined as follows:
and where M 1 is the mean for one group, S 1 its standard deviation, N 1 is the group size and so on. As g tends to be an overestimate, an adjusted value, g′, is often reported instead
We also reported Δ, which is often called Glass’s Δ and which is defined as follows:
and which can also be adjusted into Δ′ as above.

