Reports in both scientific journals and the media have questioned whether the true benefits of antidepressant medications have been exaggerated (Reference GolemanGoleman, 1995; Reference Fisher and GreenbergFisher & Greenberg, 1997; Reference HorganHorgan, 1998; Reference Kirsch, Sapirstein and KirschKirsch & Sapirstein, 1999), and a recent review of the Food and Drug Administration (FDA) database found that that as many as half of antidepressant trials yield negative results (Reference Khan, Khan and BrownKhan et al, 2002). A major hindrance to establishing antidepressant efficacy is the remarkably high rates of improvement among participants receiving placebo, which have been increasing over the past two decades (Reference Walsh, Seidman and SyskoWalsh et al, 2002). Factors that have been implicated in the placebo response include the instillation of hope, response expectancies (Reference KirschKirsch, 1985), motivation to please investigators (Reference Orne, Rosenthal and RosnowOrne, 1969), the therapeutic impact of assessment contact, rater bias and spontaneous improvement (Reference HarringtonHarrington, 1999). A better understanding of how much each contributes would allow a more accurate gauge of the true antidepressant effect and could lead to improved trial designs.
In the present study, we sought to evaluate the therapeutic impact of frequent follow-up assessments. In standard anti-depressant trials, participants are usually seen on a weekly basis to assess depression severity, level of functioning and side-effects. Such visits typically last 30 min or more and are conducted by trained research assistants over the course of 6 weeks. The impact of so much contact with a healthcare provider is unknown but could be substantial. Furthermore, this amount of contact is much greater than in routine clinical practice where two to three 15-min visits for management of medication are the norm (Reference Posternak, Zimmerman and SolomonPosternak et al, 2002a ). To evaluate the impact of these follow-up assessments, we conducted a meta-analysis of 41 double-blind, placebo-controlled anti-depressant trials published over the past two decades. We primarily focused on the impact that follow-up assessments had on the placebo response but also examined their effect on participants receiving active medication.
Sources of data and criteria for review
The collection of studies used here is the same as in our previous meta-analysis which evaluated the time course of improvement on antidepressant medication and placebo (Reference Posternak and ZimmermanPosternak & Zimmerman, 2005). These studies were compiled by reviewing the bibliography of the meta-analysis evaluating placebo response rates in antidepressant trials published over the past two decades (Reference Walsh, Seidman and SyskoWalsh et al, 2002). To augment this database, we also systematically reviewed each article published from January 1992 through December 2001 in six psychiatric journals (American Journal of Psychiatry, Archives of General Psychiatry, British Journal of Psychiatry, Journal of Clinical Psychiatry, Journal of Clinical Psychopharmacology and Psychopharmacology Bulletin).
Studies were included if they: (a) were in English; (b) were published from January 1981 through December 2001; (c) were primarily composed of out-patients with major depressive disorder according to Research Diagnostic Criteria (RDC; Reference Spitzer, Endicott and RobinsSpitzer et al, 1978); (d) had at least 20 participants in the placebo group; (e) randomly assigned participants to receive a putative antidepressant drug or drugs and placebo; (f) reported the total number of participants assigned to placebo and medication group(s); (g) assessed participants under double-blind conditions; and (h) utilised the Hamilton Rating Scale for Depression (HRSD; Reference HamiltonHamilton, 1960) to assess improvement. We excluded studies that did not report mean baseline HRSD scores, did not present weekly or biweekly (every other week) changes in HRSD scores, evaluated agents with unproven antidepressant properties or evaluated accepted anti-depressant agents that were used at subtherapeutic doses, or focused on specific subpopulations of patients such as the elderly. Forty-seven trials that met these inclusion criteria were included in our original meta-analysis. Of these, we excluded six studies (Reference Claghorn, Gershon and GoldsteinClaghorn et al, 1983; Reference Dominguez, Goldstien and JacobsonDominguez et al, 1985; Reference Hormazabal, Omer and IsmailHormazabal et al, 1985; Reference Amsterdam, Case and CsanalosiAmsterdam et al, 1986; Reference Ferguson, Mendels and ManowitzFerguson et al, 1994; Reference KhanKhan, 1995) for the present meta-analysis because they did not conduct outcome assessments at week 6.
For the 41 studies included in the present meta-analysis, three types of follow-up schedules were used: 15 studies (Reference Cohn and WilcoxCohn & Wilcox, 1985; Reference Byerley, Reimherr and WoodByerley et al, 1988; Reference Cohn, Collins and AshbrookCohn et al, 1989; Reference Lineberry, Johnston and RaymondLineberry et al, 1990; Reference Reimherr, Chouinard and CohnReimherr et al, 1990; Reference Smith, Glaudin and PanagidesSmith et al, 1990; Reference Fontaine, Ontiveros and ElieFontaine et al, 1994; Reference Heiligenstein, Tollefson and FariesHeiligenstein et al, 1994; Reference Wilcox, Cohn and KatzWilcox et al, 1994; Reference BremnerBremner, 1995; Reference Claghorn and LesemClaghorn & Lesem, 1995; Reference Fabre, Abuzzahab and AminFabre et al, 1995; Reference Mendels, Reimherr and MarcusMendels et al, 1995; Reference Claghorn, Earl and WalczakClaghorn et al, 1996; Reference SchatzbergSchatzberg, 2000) conducted weekly follow-up assessments over the course of 6 weeks (weekly cohort); 19 studies (Reference Feighner and BoyerFeighner & Boyer, 1989; Reference Versiani, Oggero and AlterwainVersiani et al, 1989; Reference Gelenberg, Wojcik and FalkGelenberg et al, 1990; Reference Claghorn, Kiev and RickelsClaghorn et al, 1992; Reference Cohn and WilcoxCohn & Wilcox, 1992; Reference FabreFabre, 1992; Reference KievKiev, 1992; Reference Rickels, Amsterdam and ClaryRickels et al, 1992; Reference Shrivastava, Shrivastava and OverwegShrivastava et al, 1992; Reference Smith and GlaudinSmith & Glaudin, 1992; Reference Mendels, Johnston and MattesMendels et al, 1993; Reference Cunningham, Borison and CarmanCunningham et al, 1994; Reference CunninghamCunningham, 1997; Reference ThaseThase, 1997; Reference Khan, Upton and RudolphKhan et al, 1998; Reference Rudolph, Fabre and FeighnerRudolph et al, 1998; Reference Rudolph and FeigerRudolph & Feiger, 1999; Reference Silverstone and RavindranSilverstone & Ravindran, 1999; Reference StahlStahl, 2000) conducted assessments at weeks 1, 2, 3, 4 and 6 without an assessment at week 5 (skip week 5 cohort); 7 studies (Reference Feighner, Aden and FabreFeighner et al, 1983; Reference Merideth and FeighnerMerideth & Feighner, 1983; Reference Rickels, Feighner and SmithRickels et al, 1985; Reference Mendels and SchlessMendels & Schless, 1986; Reference Rickels, London and RoxRickels et al, 1991; Anonymous, 1994; Reference Laakman, Faltermaier-Temizel and Bossert-ZaudigLaakman et al, 1995) conducted assessments at weeks 1, 2, 4 and 6 without assessments at weeks 3 and 5 (skip weeks 3 and 5 cohort). We utilised these differences in follow-up schedules as a way to focus on the specific therapeutic effects of follow-up assessments.
Establishing reduction in HRSD scores
The method for establishing mean baseline scores and weekly improvement in HRSD scores is the same as in our previous meta-analysis (Reference Posternak and ZimmermanPosternak & Zimmerman, 2005). Baseline HRSD scores and weekly reductions in HRSD scores were established for each study, and all analyses accounted for differences in sample size between studies. Some studies depicted changes in HRSD scores graphically. In these instances, weekly changes in HRSD scores were obtained by measuring each data-point with rounding to the nearest 0.5. A research assistant who was unaware of the purposes of the study remeasured each data-point. Of the 476 data-points extracted from graphs, 456 (95.8%) were remeasured by the research assistant within 0.5 points, suggesting that data extraction was performed reliably and without bias.
We hypothesised that follow-up assessments would have a discernible therapeutic effect on placebo response rates. Differences in follow-up schedules allowed us to compare reductions in HRSD scores in cohorts that met on a weekly basis with those that by design skipped 1 or 2 weeks. Our specific hypotheses were: (a) reductions in HRSD scores from week 4 to week 6 will be greater for the weekly cohort compared with the skip week 5 and skip weeks 3 and 5 cohort; (b) reductions in HRSD scores from week 2 to week 4 will be greater for the weekly cohort and the skip week 5 cohort compared with the skip weeks 3 and 5 cohort; (c) there will be a proportional and cumulative therapeutic effect of having multiple extra assessments; to examine this question, we compared reductions in HRSD scores from week 2 to week 6 in the skip weeks 3 and 5 cohort, skip week 5 cohort, and the weekly cohort; (d) to confirm that placebo effects do not differ between cohorts, we predicted that reductions in HRSD scores would be comparable between cohorts from baseline through week 2; because we considered this the most direct method to confirm that there are no random differences in placebo response rates, we deemed it unnecessary to control for potential confounding variables such as fixed v. flexible dose design, year of publication, etc.; (e) if follow-up assessments are found to convey a therapeutic effect for participants receiving placebo, we would predict that all of the above findings would be replicated in cohorts receiving antidepressant medication.
Finally, if follow-up assessments convey a non-specific therapeutic effect, we hypothesised that treatment effect sizes would be greater in trials with fewer follow-up assessments. However, only a handful of studies published weekly or end-point standard deviations. Therefore, we were unable to establish effect sizes or confidence intervals.
For participants randomised to placebo, the weekly cohort comprised 941 people from 15 separate studies; the skip week 5 cohort comprised 1449 people drawn from 19 studies and the skip weeks 3 and 5 cohort comprised 673 participants drawn from 7 studies. The baseline mean HRSD scores for these three groups were 25.6 (s.d.=1.78), 25.9 (s.d.=1.47) and 24.3 (s.d.=2.53) respectively.
For participants randomised to active medication, the weekly cohort comprised 1507 people from 25 cohorts (some studies included more than one active medication group); the skip week 5 cohort comprised 2284 people from 31 cohorts and the skip weeks 3 and 5 cohort comprised 820 participants from 9 cohorts. The baseline HRSD scores for these three groups were 25.6 (s.d.=1.82), 25.9 (s.d.=1.49) and 25.0 (s.d.=2.42) respectively.
Week 5 assessment
From week 4 to week 6, the mean decrease in HRSD scores for cohorts receiving placebo that met at week 5 (the weekly cohort) was 1.52 points. For cohorts that did not meet at week 5 (the skip week 5 and the skip weeks 3 and 5 cohorts), the mean decrease in HRSD scores from week 4 to week 6 was 0.85 points. Thus, participants who returned for an extra follow-up visit at week 5 experienced a 0.67 greater reduction in HRSD scores over this 2-week period than those who did not have a week 5 visit. This difference represents 44% of the decrease in HRSD scores over this period.
Week 3 assessment
From week 2 to week 4, the mean decrease in HRSD scores for cohorts receiving placebo that met at week 3 (the weekly cohort and skip week 5 cohort) was 2.56 points. For cohorts that did not have a scheduled follow-up assessment at week 3 (the skip weeks 3 and 5 cohort), the mean decrease in HRSD scores from week 2 to week 4 was 1.70 points. Thus, participants who returned for an extra follow-up visit at week 3 experienced a 0.86 greater reduction in HRSD scores over this 2-week period than those who did not have a week 3 follow-up visit. This represents 34% of the decrease in HRSD scores over this period.
Therapeutic impact of multiple extra assessments
To examine whether there is a cumulative and proportional therapeutic impact of multiple extra assessments, we compared reductions in HRSD scores from week 2 to week 6 in the weekly cohort with reductions in the skip week 5 and skip weeks 3 and 5 cohorts. The first group had four scheduled follow-up assessments, the second group had three and the third group had two. Reductions in HRSD scores were 4.24, 3.33 and 2.49 points respectively. Thus, the reduction with one extra assessment (skip weeks 3 and 5 cohort v. skip week 5 cohort) was 0.84 HRSD points whereas that with two extra assessments (skip weeks 3 and 5 cohort v. weekly cohort) was 1.75 HRSD points. This suggests that the therapeutic impact of follow-up assessments is cumulative and proportional.
To evaluate whether placebo effects are otherwise comparable between the cohorts of interest, we compared reductions in HRSD scores from baseline to week 2 between the weekly cohort and the skip week 5 and skip weeks 3 and 5 cohorts. Because all three cohorts received weekly follow-up assessments through week 2, we predicted that reductions in HRSD scores would be similar. The reduction in HRSD scores from baseline to week 2 in the weekly cohort was 5.35 points. In the two cohorts that subsequently skipped one or two follow-up assessments, the reduction in HRSD scores was 5.41 points. Thus, placebo effects were comparable between the cohorts when the frequency of follow-up visits was the same.
Participants receiving active medication
We repeated all the analyses described above for participants receiving active medication. Reduction in HRSD score from week 4 to week 6 for the weekly cohort was 2.35 points compared with 1.38 for cohorts who did not have a week 5 visit (a difference of 0.97 points). Reduction in HRSD score from week 2 to week 4 for cohorts that met at week 3 (the weekly cohort and the skip week 5 cohort) was 3.69 points compared with 2.57 for cohorts that did not have a week 3 visit (a difference of 1.12 points). Reductions in HRSD scores from week 2 to week 6 for the weekly cohort, skip week 5 cohort and skip weeks 3 and 5 cohort were 5.87, 5.05 and 4.29 respectively. One extra assessment visit therefore accounted for a reduction of 0.76 HRSD points whereas a second extra assessment accounted for an additional 0.82 points. For the control analysis, we again compared reductions in HRSD scores from baseline to week 2 in the weekly cohort with the two cohorts that skipped at least one follow-up assessment. Reductions in HRSD scores were 7.78 and 7.61 HRSD points respectively, again suggesting comparable treatment effects except when there were differences in follow-up schedules.
The ubiquitous and robust placebo response has for years both intrigued and frustrated mood disorder researchers. Although there is general consensus as to which factors are responsible for the placebo response, it remains unclear how much each particular component contributes to the overall effect. One exception to this is the role that spontaneous improvement may play. In a meta-analysis comparing treatment effect sizes for people with depression randomised to placebo with those randomised to no treatment, spontaneous improvement was estimated to constitute about one-third of the placebo response (Reference Kirsch, Sapirstein and KirschKirsch & Sapirstein, 1999). Other investigators have provided independent confirmation of this estimate (Reference Posternak and ZimmermanPosternak & Zimmerman, 2001; Reference Posternak, Solomon and LeonPosternak et al, 2006).
In the present study, we isolated one of the remaining components – the therapeutic impact of follow-up assessments – to determine the importance of this factor to the remaining two-thirds of the placebo response. We found that scheduling an extra follow-up visit at week 3 was associated with an additional 0.86-point reduction in HRSD scores, whereas scheduling an additional week 5 visit was associated with an additional 0.67 reduction in HRSD scores. These reductions represent approximately 40% of the placebo response that occurred over their respective time frames. When we examined the cumulative effect of scheduling two additional follow-up visits, we found that the therapeutic impact of each visit was cumulative and proportional. That is, one extra visit was associated with a 0.84 greater reduction in the HRSD score whereas a second extra visit was associated with a 0.91 further reduction in the HRSD score. As further illustration of the impact of follow-up assessments on the placebo response, participants who were assessed on a weekly basis experienced an overall drop in HRSD scores of 9.6 points over the course of 6 weeks. By comparison, participants receiving placebo who were assessed only four times experienced only a 7.3-point drop in HRSD score.
Since follow-up assessments had a discernible therapeutic effect for participants receiving placebo, we expected they would also have a discernible and comparable effect for those receiving active medication. Indeed, each of our analyses from the placebo cohorts was replicated for cohorts receiving active medication, as each additional follow-up visit was associated with a further reduction of 0.97–1.12 in HRSD scores.
Design of meta-analysis
The ideal method for evaluating the therapeutic impact of follow-up assessments on the placebo response would be to randomise participants with depression receiving placebo to different follow-up schedules. Such a study has not been performed to date and most likely never will. In the present meta-analysis, we have in effect randomised cohorts rather than individuals. Since the methodology of efficacy trials of antidepressants has remained largely unchanged over the years (Reference ThaseThase, 1999), heterogeneity between studies is likely to be minimal: all studies involved out-patients with moderate-to-severe depression who received identical treatment (placebo) over the course of 6 weeks using the same outcome measure (the HRSD). Where an extra follow-up assessment was conducted, a clear therapeutic effect was associated with that visit as hypothesised. Although it is possible that this could be attributable to random differences between studies, we would argue that this is extremely unlikely. The present meta-analysis included the majority of acute-phase, placebo-controlled antidepressant trials published over the past two decades, and our analyses were therefore based on large sample sizes. Second, improvement on placebo was comparable between all three cohorts during the first 2 weeks of treatment when follow-up assessment schedules were identical. As this is the most direct method for evaluating random differences in placebo response rates, it would be superfluous to attempt to control for other potential confounding variables such as year of publication, episode duration, comorbidity, etc. Furthermore, all of our findings that supported a clear, therapeutic effect from assessment contact were replicated in cohorts receiving active medication.
We would argue that our results are not undermined by relying solely on published studies. Publication bias is a concern for many meta-analyses because negative trials often go unpublished, and attempts to establish effect sizes may consequently overestimate treatment benefits. The goal of the present study, however, was to estimate the therapeutic impact of follow-up assessments. The lack of inclusion of unpublished studies would only undermine our results if unpublished studies were found to systematically have less therapeutic impact of their assessment visits (for example, if raters in unpublished studies were consistently less empathic). Unpublished studies, however, by virtue of having failed to separate drug from placebo, would be expected to have more rather than less robust placebo response rates, and the therapeutic impact of follow-up assessments might, if anything, be more pronounced.
One limitation of our study is that because few studies published weekly or end-point standard deviations of HRSD scores, we were unable to confirm that differences between cohorts were statistically significant. Although our analyses yielded what appears to be a large and consistent effect from extra follow-up visits, the lack of statistical confirmation warrants caution in interpreting these findings. We also wondered whether the greater therapeutic effect found in cohorts that met more frequently might be a consequence of greater retention rates in these cohorts. In most clinical trials, rating scores for participants who drop out are handled using the last-observation-carried-forward method of analysis. Perhaps participants who do not present on a weekly basis are more likely to drop out and therefore not have the opportunity to demonstrate improvement. To address this concern, we evaluated completion rates in each of the three cohorts and found no correlation between frequency of visits and completion rates: skip week 3 and 5, 58.5% (326 of 557); skip week 5, 62.5% (847 of 1356); weekly, 58.8% (403 of 685). Thus, the therapeutic effect we found does not appear to be a function of improved adherence.
Design of trials
Considering the relatively modest effect size of FDA-approved antidepressants over placebo, that side-effects may unmask raters in favour of eliciting drug–placebo differences (Reference Greenberg, Bornstein and GreenbergGreenberg et al, 1992) and that most negative trials never get published, several investigators have suggested that the benefits of antidepressant medications have been exaggerated over the years (Reference Fisher and GreenbergFisher & Greenberg, 1997; Reference Kirsch, Sapirstein and KirschKirsch & Sapirstein, 1999). Although these arguments are persuasive, we believe an alternative explanation also exists – that the methodology used to elicit and establish antidepressant efficacy is inefficient. As reviewed elsewhere (Reference Posternak, Zimmerman and KeitnerPosternak et al, 2002b ), the methodology used in antidepressant trials evolved largely from traditions established over three decades ago and has never undergone empirical testing. Our results suggest that the frequent and extensive monitoring that occurs in clinical trials confers a significant therapeutic effect for participants receiving placebo (and active medication). High placebo response rates reduce treatment effect sizes and increase the risk that an efficacious agent will be deemed ineffective. Although a comparable therapeutic effect from follow-up visits was found in participants randomised to active medication, reducing an equivalent amount of ‘noise’ in both cohorts would have the effect of increasing the power to detect differences between the active medication and control group (Reference CohenCohen, 1988).
Knowing the impact that follow-up assessments have on placebo response rates, the design of antidepressant trials could be modified either by reducing the amount of time devoted to assessing participants in follow-up, reducing the frequency of follow-up assessments, or relying more on off-site raters or interactive computer assessment. Of course, consideration of these changes must be balanced against ethical concerns of having insufficient monitoring over the course of a clinical trial. This would apply both to participants randomised to placebo and to those receiving a putative antidepressant agent, especially if there are concerns regarding the potential for increased suicidal ideation following the initiation of an antidepressant.
Explaining the placebo response
Our results suggest that the follow-up assessment schedules of standard antidepressant efficacy trials convey a significant therapeutic effect for participants receiving placebo, and that these assessment visits account for an estimated 40% of the placebo response. This does not take into account the therapeutic effect of the initial evaluation, which is typically much more extensive than follow-up assessments and would be expected to convey a larger therapeutic effect. For years, there has been much speculation as to which ingredients comprise the powerful and seemingly magical placebo pill, with some investigators even suggesting that different coloured pills may be associated with different placebo response rates (Reference Jacobs and NordanJacobs & Nordan, 1979; Reference Buckalew and CoffieldBuckalew & Coffield, 1982). Our findings suggest that, after accounting for spontaneous improvement, the placebo response in trials of antidepressants stems largely from the attention and care received during the course of the clinical trial.
eLettersNo eLetters have been published for this article.