Randomized controlled trials (RCTs) have been around since the 1940s, but the design’s prestige gained a big boost in the 1990s, thanks to the Women’s Health Initiative. Under the aegis of the National Institutes of Health (NIH) in the United States, the Women’s Health Initiative included three clinical trials, alongside one observational study. The goal was to learn about risk factors for cardiovascular disease and cancer, among other matters.
Observational evidence had suggested that a particular hormone replacement therapy – estrogen plus progestin – provided women with strong protection against heart disease and cancer. This was of course good news; at the time when the Women’s Health Initiative RCT was initiated, an estimated 16 million US women used hormone replacement therapy, suggesting that protection against heart disease and cancer was widespread. However, there were concerns that women who took hormone replacement therapy were dissimilar in several ways from women who did not use these medications. In that case, maybe it was not hormone replacement therapy itself that guarded the women against serious disease but rather other characteristics of the sorts of women who received the treatment. To figure out what was responsible for the differential health outcomes, researchers designed and executed an RCT.Reference Vaughan, Espeland and Snively1
In the clinical trial, women who had a uterus were randomly assigned to the estrogen–progestin therapy combination or to a placebo group. At the same time, another group of women, who had had their uteruses removed, were independently randomized to estrogen alone or to placebo. Contrary to expectation, women randomly assigned to receive hormone replacement therapy, and particularly the estrogen–progestin combination, were slightly more likely than women assigned to take a placebo to be diagnosed with heart disease or cancer. The effect of the result on practice was dramatic. Within a few years of the trial, there was a sudden drop in the number of women using hormone replacement therapy.
The Women’s Health Initiative study seemingly proved the advantage of RCTs, ushering in a new era of confidence in the methodology. By now, this advantage should be familiar: RCTs assure that participants in experimental and control groups are comparable or that they differ only for chance reasons, so that investigators can isolate causal relationships from confounding factors. After all, confounding factors are legion; usually, there are many possible explanations for why a measure of health outcomes gets better or worse. When people become ill, it is not uncommon for them to get better, and it is difficult to determine whether they improved because of a treatment or whether they would have naturally healed. It is therefore essential, at least in some research contexts, to control for such confounding factors.
While randomization is a foundation of RCTs, they may be designed with several other anti-bias measures in mind. For instance, many RCTs are double blinded. This is a valuable anti-bias tool because beliefs about treatment benefits or harms might affect the way a research investigator treats participants or assembles data. This is known as experimenter bias. Similarly, study participants are affected by their beliefs about the treatments they are receiving, which may influence their adherence to treatment. This is known as subject bias. Blinding both the investigators and the study participants (double blinding) helps reduce these biases.
Another bulwark against bias is the systematic data-collection protocol introduced in Chapter 2. Such a protocol reduces bias by assuring that information about participants is gathered using a standardized set of rules and on a well-defined schedule. In contrast, records from clinical care can be biased because the decision to see a provider is influenced by a variety of factors, including the patient’s levels of wellness. It is more likely that someone will make the effort to be evaluated and tested when they are concerned about a health condition. The structured protocols in RCTs reduce the influence of this sort of bias.
Is it any wonder, then, that advocates for evidence-based medicine have described a hierarchy of evidence (Figure 5.1) with RCTs and reviews of RCTs at the top level – considered to be the study designs that provide the highest-quality evidence?Reference Yetley, MacFarlane and Greene-Finestone2 As we move down the evidence hierarchy, we find cohort (observational) studies, case-control studies, surveys, case reports, mechanistic studies, and finally editorials, which are seen as inviting the greatest degree of bias. It is assumed that the RCT stands alone as maximally bias-free.

Figure 5.1 Levels of evidence from highest to lowest. Methods at the top are assumed to have less bias. As we go down the list, bias increases and levels of potential bias are used as an index of low quality. Randomized clinical trials (RCTs) are assumed to be free of bias, while aggregations of results from multiple RCTs are considered the highest level of evidence.
The same confidence indicated by the hierarchy of evidence is clear in the translation from research to practice, a process that overwhelmingly favors RCT findings. For example, the United States Preventive Services Task Force (USPSTF), the most influential developer of clinical guidelines in the United States, primarily relies on RCT findings in endorsing the preventive-services guidance used in primary care medicine. Typically, only procedures evaluated using the RCT method get high grades from the USPSTF, which dramatically influences uptake. Favored procedures, such as screening colonoscopies, mammograms, and pap smears, may be made available to US citizens without insurance copayments. Indeed, billions of dollars are spent or withheld on the basis of USPSTF recommendations. Similarly, the Food and Drug Administration (FDA) typically requires RCT evidence for approval of pharmaceutical products.
With so much at stake, it is important that RCTs be as bias-free as decision-makers believe. Unfortunately, they are not. While the RCT is an excellent method for establishing causation under controlled circumstances, it is absolutely subject to bias – sometimes bias that skews results to the point where they are clinically useless and even cause unwarranted harm. In this chapter, we concentrate on two key sources of bias common to RCTs. First is the challenging problem of participant selection. We have noted this issue already, but here dig into it more deeply. We argue that the trial samples in influential RCTs are often systematically different from patients seen in clinical practice. The result is a disconnect between internal and external validity: A careful clinical trial will likely produce internally valid results, but these results may not generalize to the broader population.
It is essential to recognize that the nongeneralizability resulting from restrictive participant selection is not some minor issue, nor can it be addressed through statistical adjustments. To take but one example – and we will detail several more in this chapter – a systematic review of studies concerning interventions for people with alcohol, tobacco, and illicit drug use disorders estimated that between 64 and 96 percent of potential participants were excluded from research studies. Potential participants were excluded because of diagnoses of psychiatric disorders, medical problems, estimated low likelihood of compliance, youth, old age, social instability, geographic distance from the relevant treatment facility, low educational attainment, difficulties with the legal system such as having been arrested, financial instability, and severity of substance-use problems. In other words, the majority of people who volunteered to participate in studies were turned away, and for reasons closely associated with substance abuse. These exclusion policies assured that the trial samples were decidedly not representative of those in the broader population with substance abuse disorders.Reference Moberg and Humphreys3
Bias resulting from participant selection may be incidental – a byproduct of well-intentioned efforts to ensure internal validity – or deliberate. The same is true of the second source of bias we focus on, which lies in the selection of a trial’s control condition. We argue that control and placebo groups are sometimes selected with the intention of emphasizing the effectiveness of a treatment. Shockingly, and in clear contradiction of ethics protocols demanding transparency, investigators often hide the composition of placebos. Placebos are frequently not inert substances, and they may be designed with the intention of making treatments seem harmless or especially effective. Specifically, treatment side effects may be made to appear no worse than those of the placebo, because the placebo could be designed to cause symptoms. Then too, placebos crafted with the best of intentions can lead to significant errors in data interpretation – errors that might be avoided if researchers took greater care. We will return to these issues throughout this chapter.
Internal and External Validity
Assessing bias in RCTs requires that we consider the internal and external validity of the research design. Internal validity addresses whether the design of the study and the analysis of data address the research questions without systematic bias. External validity considers whether the findings of the research study can be applied meaningfully beyond the research context – especially for purposes of real-world clinical care.
Threats to Internal Validity
Most threats to the internal validity of clinical trials were well understood more than 60 years ago. A classic text by Donald Campbell and Julian Stanley, first published in 1963,Reference Campbell and Stanley4 offered a catalogue of such threats. Fortunately, most of these can be controlled by randomization.
When thinking about threats to internal validity, we have in mind nonobvious influences on health. Trialists need not worry too much about results skewed by obvious causes, such as a pandemic, a bad winter, or a military conflict. It is usually easy to ascribe outcomes to events like these, the effects of which should also be roughly evenly distributed across a study sample. A major earthquake that occurs six months into a study might have big effects on health outcomes, yet the consequences of the earthquake should be equally likely in those randomly assigned to a treatment or control condition. So the earthquake is not a threat to internal validity.
But most often events that could affect health are more subtle. When treated and control groups are assessed, these known or unknown events serve as alternative explanations for differences in outcome. Randomization, along with consistent follow-up schedules, is the best method for ruling out these alternative explanations.
Campbell and Stanley also discuss threats to internal validity from maturation effects that occur as a result of developmental changes. In studies of children especially, changes in health outcome or cognitive functioning may result from the ordinary developmental processes. Randomization comes to the rescue: When children are randomly assigned to treatment or control conditions, maturation effects are evenly distributed across the two conditions and therefore do not bias estimates of treatment effects.
Instrumentation effects can also be mitigated through randomization. The effects are consequences of being measured. For example, when a study participant is subject to a blood pressure test in the course of a trial, the test might signal to the participant that they have a health problem that needs attention. That feedback by itself might initiate actions that remedy the problem, complicating study findings. But randomization makes these effects of instrumentation equally likely in the treated and control groups.
As for factors that are not controlled by randomization, Campbell and Stanley describe biases that might creep into a study. For instance, investigators can skew results by intervening in a study to counteract what may appear to be negative impacts of participating in a placebo group or to help patients who seem most ill. After all, clinical investigators are often clinical practitioners as well, and ideally are people who care about the health of others. Campbell and Stanley call this practice “compensatory equalization.”
In summary, there are a variety of threats to internal validity. Yet most of them can be avoided using the time-tested technique of randomization. But randomization does not fix threats to external validity.
Threats to External Validity
Threats to external validity usually result from the nonrepresentativeness of RCTs. The fact is that the typical RCT looks very little like what happens in clinical practice. In trials, data are collected at regular intervals, typically using the latest, highly validated outcome measures. Trial participants are usually paid much more attention than patients in clinical practice and are rarely allowed to miss appointments.
For example, a treatment for type 2 diabetes might be evaluated in an RCT that provides extensive support for adherence to medicine, getting regular exercise, and modifying diet. Providing these services is expensive and time consuming and requires a great deal of commitment on the part of the patient. In the context of an RCT, though, the burden is relatively light: All of this professional attention, and the treatment itself, typically come without charge and right on time. In contrast, real-world practice can be chaotic. Encounters between patients and providers may be on a regular schedule but more often are initiated by the patient. Support services are less accessible, and outcome measurements occur erratically. Protocols for managing patients can vary substantially across different health care organizations and practices, so that care will be inconsistent.
For all these reasons, treatments that appear to work well when evaluated in carefully controlled trials often achieve less impressive results when applied in standard care scenarios.Reference Chambers, Glasgow and Stange5 And none of this addresses the most common threat to external validity: the nonrepresentativeness of the study population (Box 5.1).
One recent analysis showed that more than 80 percent of 187 phase III cancer clinical trials published between 2007 and 2017 failed to show meaningful improvements in survival.Reference Shen, Ferro, Xu, Kramer, Patell and Kazi6 Overall survival was the primary endpoint in all 187 trials. In 131 of the trials, the effects of treatment were not statistically significant. Among the 56 trials that found a statistically significant benefit, 33 (59 percent) failed to meet the criterion of having an observed benefit that was equal to or greater than expected. Indeed, statistically significant improvements in life duration tend to be modest. One review of treatments for solid tumors reported that, between 2002 and 2014, improvements in survival averaged only 2.1 months.Reference Fojo, Mailankody and Lo7
The results for overall survival stood in contrast to studies that reported progression-related survival (PRS) as the primary endpoint. PRS is the length of time during and after cancer treatment that a patient remains alive and does not get worse.Reference Walia, Tuia and Prasad8 Studies looking at PRS showed statistically significant benefits in 134 of the 216 studies (64.4 percent).Reference Shen, Ferro, Xu, Kramer, Patell and Kazi9 This is a suggestive disconnect. A treatment that appears successful according to one outcome measure may appear much less successful according to a different measure. When a study fails on the measure of overall survival – typically the measure that matters most to patients – but succeeds on others, this is a sign that the reported results are not focused on patient interests.
Notably, prior to the year 2000, large US RCTs usually did report that the treatment “worked.” But after that point, investigators were required to prospectively register trials at ClinicalTrials.gov. This has made it much more difficult to retrospectively focus on whatever outcome variable demonstrated benefit. After the requirement was imposed, the number of trials posting positive results substantially declined.Reference Kaplan and Irvin10
Representativeness of Clinical Trials and Generalizability of Results
When treatments have been shown to be effective, it is often assumed that they are effective for most if not all people. As scientists, we know this is not the case: Efficacy trials show only that treatments can work under certain highly controlled conditions. We should keep that in mind when considering the usefulness of trials. We also should inhabit the patient’s perspective: The patient wants to know whether the treatment works for someone like themselves. If you discovered that the people tested in clinical trials were quite different from yourself, would that reduce your confidence in a treatment those trials recommend?
Generalizability of research results has received surprisingly little attention in the clinical trials literature. Of late, there has been more emphasis on recruiting people from traditionally underrepresented ethnic and minority groups, which is an important step toward ensuring representativeness. But such strategies cannot compensate for a study sample designed to present a treatment in the best possible light.
Extensive exclusion criteria and narrow inclusion criteria should have you asking questions about the external validity of studies. If you are a healthcare provider, would you turn away a patient seeking care for high blood pressure if they also have diabetes? Would you turn away a patient who lives more than 6 miles from the medical center? These are common exclusion criteria in major clinical trials, yet practicing physicians rarely have exclusion criteria. Certainly, they may refer patients to specialists, but this is hardly the same thing. A cardiologist may refer a patient with a skin lesion to a dermatologist, but if the patient also has heart and circulatory systems problems, the cardiologist will almost certainly attempt to treat these conditions.
In contrast, the selection criteria in clinical trials routinely result in the exclusion of as many as 95 percent of potential participants. Thus a study of new treatments for high blood pressure might exclude people who have diabetes in addition to elevated blood pressure. The rationale would be that the investigators want to estimate the effects of the treatment on blood pressure alone, so they want a “clean” treatment population without confounding variables. That means participants with elevated blood pressure but normal values on other cardiovascular risk factors. The real world is not clean, though. Cardiovascular risk factors are highly correlated; there are very few people who have only one risk factor, and their numbers decline with age.Reference Haug, Deischinger, Gyimesi, Kautzky-Willer, Thurner and Klimek11 Choosing study participants from this select group may assure that the trial results will not generalize to the larger population that has multiple risk factors.
Even when researchers have no interest in skewing results, structural factors make it highly likely that their samples will be unlike the wider population. Consider that academic medical centers remain a common source of participants for published clinical research. Some years ago, Larry Green and colleagues considered the representativeness of academic medical center patients.Reference Green, Fryer, Yawn, Lanier and Dovey12 Green and coauthors argued that if we follow 1,000 people over a short period of time, 800 will report a symptom, 327 will consider getting care, 217 will visit a physician’s office, 113 will see a primary care doctor, 65 will have a consultation from a complementary or alternative medical care provider, 21 will visit a hospital outpatient clinic, 14 with receive home health care, 13 will be seen in an emergency department, 8 will be hospitalized, and 1 will be hospitalized in an academic medical center. In other words, the odds of being a patient hospitalized in an academic medical center are very small. How can we be confident that studies drawing patients from academic medical centers generalize to the 99.9 percent of the population not treated in such institutions? Unfortunately, we can’t.
Countless examples demonstrate that participants in randomized clinical trials are systematically different from those seen in routine clinical care. One powerful demonstration comes from a systematic review of participants in trials evaluating acetylcholinesterase inhibitors, a medication for Alzheimer’s disease. Investigators considered 16 articles that reported the ages of study participants. For comparison, they used a nationwide cohort of people who had been diagnosed with Alzheimer’s disease in Finland. Figure 5.2 shows the mean, 5th, and 95th percentiles for age of participants across the 16 trials. Averaged across studies, the mean age of participants was 73.9 years. Yet the mean age of people with a new diagnosis of Alzheimer’s disease in the general population was 79.7 years. On average, participants in the RCTs were 5.8 years younger than those in the Alzheimer’s disease population. Further, the proportion of study participants under the age of 65 was four times higher than in the proportion of the general Alzheimer’s population under 65.Reference Leinonen, Koponen and Hartikainen13

Figure 5.2 Mean age of RCT participants and reference population with Alzheimer’s diseases. Forest plot from Leinonen et al.Reference Leinonen, Koponen and Hartikainen13
Other literature reviews of studies evaluating treatments for Alzheimer’s disease have also found that study participants differ from the average person affected by the disease. One evaluation considered the representativeness of clinical trial participants for studies of abucanumab, a drug that, at the time it was evaluated in clinical trials, was believed to have strong potential for reducing the loss of cognitive functioning. Abucanumab became controversial because the US FDA approved it against the advice of their own advisory committees. In the run up to the evaluation, there was pressure on Biogen, the company that made the drug, to fill their portfolio with positive studies. A review of the patient-recruitment process revealed that the selection criteria applied across Biogen’s studies would have eliminated 92 percent of US Medicare beneficiaries who had Alzheimer’s disease or related problems with dementia. Further, 85 percent of people with mild cognitive impairment would have been eliminated either because of their age or comorbid conditions. Nevertheless, the FDA approved Aducanumab with labeling that would make the drug available to substantial numbers of people who would have been excluded from the clinical trials.Reference Anderson, Ayanian, Souza and Landon14 A footnote: When made public, Aducanumab never became a commonly used drug, and it was withdrawn from the market in January 2024.
A similar example comes from studies of heart failure. One systematic review considered reasons why volunteers had been turned away from studies of heart-failure medications. Heart failure is a condition that primarily affects older adults with shorter than average life expectancies. Yet 25.5 percent of volunteers were excluded because they were too old. Further, 36.3 percent were turned away because they were judged to have short life expectancies. In total, 80 percent of potential study participants were excluded because they had other diseases in addition to heart failure.Reference Cherubini, Oristrell and Pla15 In practice, it is quite rare that a person is diagnosed only with heart failure. Here, exclusion criteria assured that the study sample was atypical.
Yet another example comes from the Systolic Blood Pressure Intervention Trial (SPRINT), one of the most influential clinical trials in cardiovascular disease prevention and among the most influential, all considered, published in the last decade. The trial was designed to investigate the threshold for initiating treatment for high blood pressure. Before the SPRINT trial, most guidelines considered the threshold for treating high systolic blood pressure to be 140 mmHg. US guideline committees, however, leaned toward lowering the threshold to 130 mmHg. The choice of a threshold is important because most biological variables, including systolic blood pressure, are normally distributed in the population. Choosing a treatment threshold closer to the center of the distribution – as the US committees were considering – means that a substantial portion of the population will become eligible for the intervention. In other words, setting a threshold nearer to the center of the distribution increases the number of “patients” substantially.Reference Kaplan16
The results of the SPRINT trial would help determine where the threshold was set, influencing care decisions for potentially all American adults – 219.4 million people, when the study was conducted. This was despite the fact that the selection criteria for SPRINT resulted in mass exclusions. The trial excluded people under the age of 50, so the potential pool of participants was reduced from 219.4 million to 95.1 million. Then the investigators reduced the potential participant pool further by excluding 37.3 million people with systolic blood pressure less than 130 mmHg. Next, another 26.4 people who had other cardiovascular risk factors were eliminated from the participant pool. That left 16.8 million potential participants. Although the results were generalized to all 219.4 million US adults, only 7.6 percent of them were eligible to enroll in the study. From there, many other people were excluded for a variety of other small and technical reasons.Reference Bress, Tanner, Hess, Colantonio, Shimbo and Muntner17
Ultimately, the SPRINT trial had a profound impact on practice guidelines, even as the study sample looks little like the population to which those guidelines apply. Following the publication of SPRINT, several physician groups lowered their targets for initiating antihypertensive medications. People with systolic blood pressure readings between 120 and 130 mmHg were more likely to be put on medications. Although the SPRINT trial demonstrated a statistically significant improvement with respect to death from any cause, the absolute risk reduction was only about 1.2 percent: Among people in the intensive treatment condition, 3.3 percent died, in comparison to 4.5 percent in the standard treatment condition. Meanwhile, there were significant adverse events associated with the intensive treatment. People receiving it were significantly more likely to experience hypotension (low blood pressure), syncope (dizziness), electrolyte abnormalities, and acute kidney injury or kidney failure.
Our final example comes from a heroic effort by Yen Yi Tan and colleagues, who sought to determine whether participants in RCTs were like patients receiving care for the same conditions in clinical practice.Reference Tan, Papez, Chang, Mueller, Denaxas and Lai18 These investigators used ClinicalTrials.gov to identify 43,895 clinical trials on anti-dementia drugs registered through spring 2021. Separately, using specialized data-mining tools, the researchers obtained primary-care records from 5,685,738 patients in clinical practices. The researchers then extracted clinical information from the primary-care records. The massive study was able to compare trial samples to patient populations for 989 different drugs that had been used for 286 medical conditions.
One striking observation was how common exclusions were. In the clinical population, about 41 percent had diagnoses of more than one chronic condition. Even though having multimorbidities is common in the real world, over 91 percent of potential participants were excluded from clinical trials. Further, in the real world, more than 94 percent of patients treated in primary care settings were taking more than one medication. Yet more than half of potential trial participants were excluded because of concomitant medicine use.
And, lest we forget, many volunteers are excluded from clinical trials for reasons other than comorbidities. Indeed, there are significant structural barriers to trial participation. One examination found that about 56 percent of people who might want to participate in a trial are turned away because no trial is available in their community. Patients wishing to participate are deemed ineligible about 22 percent of the time. Overall, only about 8 percent of patients interested in studies end up participating.Reference Pant and Lee19 This is unfortunate because there are reasons to believe that interested potential participants are more likely to represent understudied populations than are those selected to participate in trials.Reference Kalbaugh, Kalbaugh, McManus and Fisher20
To summarize, problems of representativeness and external validity upset the binary classification commonly used to appraise clinical research: RCTs get check marks, while nonrandomized studies are placed in the “not worthy of attention” column. To be clear, we appreciate the importance of RCTs for establishing causation and for significantly reducing bias from several sources. Yet the biases inherent in RCTs ought not to be overlooked. And these biases are indeed inherent. In particular, most RCTs cannot operate without exclusion criteria that ensure noncorrespondence between the study population and the population that is likely to take advantage of the studied treatment. It is quite often simply not possible to achieve statistically significant results with a sample that is truly representative of the population to be treated.
Flawed Comparators
Alongside nonrepresentative study samples, a serious problem afflicting RCTs is the use of flawed control conditions. Such conditions can have a substantial effect on a study’s conclusions, so it is not unusual for the selection of the control condition to provoke skepticism on the part of reviewers. Commercial sponsors, for example, have been accused of selecting control conditions that are likely to augment differences between the treatment and control groups, making the sponsor’s treatment appear more attractive.Reference Abramson21 To be clear, there may be biased comparators in any clinical research design, not just RCTs. But only RCTs are presumed to be maximally free of bias, so we focus on the effects of biased comparators in these sorts of trials.
There are many types of comparators a researcher might choose from. In studies evaluating pharmaceutical products, the most common comparators are placebos, a different dose of the same medication, a different active medication, or no treatment at all. In some cases, a pharmaceutical treatment is compared with a behavioral treatment or with a surgical or other nonpharmaceutical medical intervention. With such a variety of comparison conditions available, a researcher bent on achieving statistically significant results at any cost could well be found designing a comparator that serves their agenda.
Results are consistently linked to which control condition was chosen. For example, people who use selective serotonin reuptake inhibitors (SSRIs) to treat depression often appear much improved. But Kirsh has shown that SSRIs rarely produce better outcomes when compared with placebos.Reference Kirsch22 The placebo control group is crucial because it controls for the benefits patients and providers expect from treatment. These expectations are often derived from beliefs about the effects of SSRI on brain chemistry. In theory, SSRIs work because they increase serotonin levels in the brain. Yet studies show that the medications can increase, decease, or have no effect on serotonin levels.Reference Kirsch23 Because of the variability of effects on serotonin, the theory linking benefits of the drugs to serotonin levels seems implausible. Other systematic reviews show a small benefit of SSRIs in comparison to placebos, but also identified more adverse events, including suicidal ideation and suicide attempts.Reference Locher, Koechlin and Zion24 Because of issues like these, placebo or other groups that control for expected benefit are crucial.
What’s in That Placebo?
Most definitions characterize placebos as agents that have no active properties. Often, placebos are described as sugar pills. Nonetheless, hundreds of studies substantiate the so-called placebo effect. This ample literature shows that people commonly report improved symptoms after taking a pill that should have no effect on their health condition.Reference Enck and Klosterhalfen25
But not all placebos are inert.Reference von Wernsdorff, Loef, Tuschen-Caffier and Schmidt26 In fact, some are designed to cause harm, with the goal of mimicking a treatment’s side effects. One problem in clinical trials is that the experience of side effects might signal to both the patient and the investigator which patients are using the active drug, vitiating the bias-preventing mechanism of blinding. To compensate for this problem, investigators turn to active placebos. It has even been suggested that the comparison should be called a “harmcebo.” Such placebos will not only produce side effects of the treatment, but will also be designed to look, feel, taste, and in all other ways seem indistinguishable from the treatment.
Placebos that are engineered to seem like a treatment can cause side effects that confuse the interpretation of a study. In one example, a trial evaluating the effects of fish oils included a control group that took a placebo made of mineral oils. The problem was that the placebo elevated some lipid and inflammatory biomarkers that were used as outcome variables – that is, the placebo induced the effect that, according to the hypothesis, fish oils might mitigate. As a result, the study showed that inflammation was lower in the group that took the fish oils, leading to the incorrect conclusion that fish oils reduced inflammation.Reference Schwartz, Woloshin and Lu27 A later study that used corn oil as the placebo showed there were no clear benefits for fish oil.Reference Nicholls, Lincoff and Garcia28
Another example is the evaluation of oseltamivir, a drug used to treat influenza. Oseltamivir has a bitter taste, so the placebo used for the study included hydrochloric acid, which also tastes bitter. But, in addition, hydrochloric acid also causes gastrointestinal (GI) symptoms. In this case, the placebo caused GI distress just as the treatment did, leading to the incorrect conclusion that oseltamivir did not cause GI problems.Reference Webster, Howick and Hoffmann29
A particular concern surrounding placebo selection is that it can be impossible to know what is in placebos and therefore what the comparison condition of a trial actually is.Reference Demasi and Jefferson30 It is not uncommon for pharmaceutical companies to make the placebos they use in clinical trials and then hide their contents. When manufacturers submit their applications for new drug approval, they are required to provide a certificate of analysis that includes technical details of placebos used as comparators, but the certificate itself is not made public. The composition of the placebo remains proprietary information, known only to the manufacturer and the regulator. Usually, the regulator does not make this information public. Most medical journals do little to enhance transparency in this regard, as they do not routinely require the composition of placebos to be reported.
Some of these problems came into focus when the researcher Richard Shader criticized a 2017 study, published in the New England Journal of Medicine, in which a treatment group was injected with monoclonal antibodies and the control group received injections of a “matching” placebo. But what chemicals were in the matching placebo? Exactly how was it matched? None of that was reported.Reference Shader31 As then-editor of the journal Clinical Therapeutics, Shader took action. In January 2018, the journal announced that its authors would now be required to detail the exact composition of placebos and comparison medications. Concerns about comparator transparency long predate this episode, however; the practices Clinical Therapeutics adopted had been recommended years earlier in the behavioral clinical trials community.Reference Mohr, Spring and Freedland32
There have also been questions about the control condition in the most influential clinical trial evaluating cholesterol lowering statin medications, which we describe more fully in Chapter 14. The JUPITER trial, sponsored by the statin-maker AstraZeneca and involving more than 17,800 people in many different countries, is the only large-scale study to show that people randomly assigned to take statin medications live longer (or have lower rates of death from any cause) in comparison with those randomly assigned to take a placebo. Indeed, JUPITER showed not only considerable benefits of statins but also comparable rates of side effects between the statin treatment and the placebo used for comparison. In particular, there has long been concern that statins cause muscle aches as a side effect, yet JUPITER found no difference in rate of muscle ache reports between the placebo group and those taking the active medication. This finding may, however, reflect less the safety of statins than the harmfulness of the particular placebo used. Consider that the rate of muscle aches in the JUPITER placebo group was 15.4 percent, in comparison with less than 5 percent in the placebo groups in other randomized clinical trials evaluating statins.
What was this placebo that caused so many side effects, thereby making AstraZeneca’s treatment seem benign – no more damaging than a “sugar pill”? The New England Journal of Medicine article that summarized the study did not report anything about the contents of the placebo pills. Inquiries to the study’s principal investigator received no response. Epidemiologists Tom Jefferson and science journalist Maryanne Demasi contacted the European Medicines Agency for information, but were turned away.Reference Demasi and Jefferson30 They then inquired with Dutch and Australian regulators in possession of the proprietary study details; neither provided information on the composition of the placebo. Finally, they contacted AstraZeneca. The company replied that they could apply to obtain the information, but would have to sign a confidentiality agreement limiting disclosure of any information they received. Given AstraZeneca’s antitransparency terms, Jefferson and Demasi refused. The composition of AstraZeneca’s placebo remains a mystery to the public.
What can be done about this problem? Demasi and Jefferson have offered several simple and practical solutions. For example, regulatory agencies could and should develop universal standards for the regulation of placebos, and medical journals should require authors to report exactly what is in placebos. At the moment, placebos are just called placebos, even though they may differ from study to study and, in many cases, are not inert.
To be clear, we do not mean to suggest that active placebos are, in general, harmful or designed to produce results that favor treatment. Active placebos are tools that can be used for good science, even as they are also subject to manipulation. A recent review of 21 trials comparing treatments with both active and inert placebos did show that studies produced positive results more often when active placebos were used, but also found no clear difference in outcomes assessed using patient-reported measures.Reference Laursen, Nejstgaard and Bjørkedal33 This finding, though not dispositive, is at least suggestive: If active placebo-controlled studies are systematically biased toward positive findings that do not materialize in the patient experience, then skepticism is appropriate. We need not discard such studies, but we should read them carefully to better understand the placebo’s potential influence on results. The bottom line is that, when interpreting studies, one should never take the term “placebo” as self-evident. More broadly, one should always probe a trial’s control group carefully, as its design is highly relevant to study conclusions.
Control Groups in Behavioral Trials
There is no clear consensus on the appropriateness of control conditions in behavioral trials, but as in any trial, the control condition can have profound effects on the results reported by investigators.
Although the principles for selecting the control condition are similar for all clinical trials, behavioral trials may require more attention for at least three reasons. First, fidelity to treatment is particularly important in behavioral trials. In contrast to pharmaceutical trials where all participants may take an identical-looking pill, behavioral trials test treatments that may be administered differently by different people and in different settings. In order to address this problem, behavioral trialists try to design and test treatments that can be delivered in as consistent a manner as possible. This can sometimes be achieved using a structured manual that offers step-by-step instruction for administering the treatment. Even so, it is important to evaluate the extent to which the protocol is adhered to across trial settings.
The second particular challenge of crafting control conditions for behavioral clinical trials results from clinician allegiance. Usually, trained clinicians come to the study with beliefs about the efficacy and effectiveness of treatments. On the one hand, clinical trials should test the value of treatments under optimal circumstances. We want clinicians who have extensive experience with a particular treatment approach. But on the other hand, when clinicians are selected because of their unique training and experience with a particular intervention, biases are likely. Clinician allegiance might be associated with many different biases. It is not limited to the selection of the control condition. That is why intellectual conflicts of interest should also be taken into consideration when reviewing study results.
Finally, in behavioral trials, it is almost inevitable that the control condition itself will have some effect. The fact is that it is practically impossible to develop a true placebo for a behavioral intervention. Often, behavioral trials use a waitlist control, whereby participants are randomly assigned to the active treatment or to wait for the opportunity to participate in the active treatment. First, the active group is treated. When this phase is complete, the active and waitlist groups are compared to appraise treatment efficacy. Finally, the waitlist group is treated. This is a clever design, but it is not foolproof. Many people are dissatisfied with assignment to the waitlist, which can have important effects on self-reported outcomes and on the likelihood that participants remain actively involved in that study.
An important paper by David Mohr and colleagues offers a catalog of control conditions commonly used in behavioral studies. The authors explore, for example, the comparison of a treatment involving multiple components against a treatment comprising just one of the components. Some studies use “nonspecific” controls. In these cases, treatments designed for a specific purpose are compared with nonspecific interventions. For example, an active treatment might be compared with a control group that receives attention but does not get treatment that theoretically will improve their outcomes. Then too, attention alone may result in better outcomes. A so-called attention control may be used to separate the effect of the treatment from the effect of attention alone.
Another common control condition is “treatment as usual.” This has many advantages because it is representative of what happens in the real world. However, it is difficult to define exactly what happens in treatment-as-usual conditions. Clinicians providing treatment as usual might not themselves be able to describe what they do, and clinicians apply varying approaches, so that treatment as usual is necessarily an ambiguous category (Box 5.2).
Observational studies are assumed to produce biased estimates of treatment effects compared with randomized clinical trials. But how often do RCTs and observational studies come to different conclusions?
A recent umbrella review of 47 systematic reviews by the Cochrane Library included 2,869 RCTs with 3,882,115 participants, plus 3,924 observational studies with 19,499,970 participants, covering diverse topics in medicine and healthcare.
They found either no difference or very small differences in treatment effects’ estimates evaluated by both observational methods or RCTs.
Study design alone does not assure absence from bias. Observational studies frequently produce results that correspond to those of more complex and expensive randomized trials.Reference Toews, Anglemyer and Nyirenda34
Conclusions
Although RCTs go a long way toward eliminating bias, they may also systematically introduce bias. In this chapter, we have concentrated on two ways RCTs can mislead. First, RCTs may have high internal validity but low external validity. This occurs because all RCTs have inclusion and exclusion criteria. These selection criteria often assure that participants in the trial are unlike the population to whom the results will be generalized.
Second, all RCTs use comparison conditions, the selection of which can have a profound effect on how data are interpreted and on a study’s conclusions. In some circumstances, a control condition may be designed to confirm investigators’ a priori beliefs or ensure that treatments appear safe and effective. Even when control conditions are selected for good scientific reasons, they can produce confusing effects that lead researchers to misinterpret data. Just as randomization does not, in and of itself, eliminate bias or ensure that RCT results are useful, so too control groups do not ensure that RCTs produce valid data or valuable outcomes.
Careful scientists appreciate that RCTs have limitations. But policy makers, the media, patients, and some clinicians are less aware of these limitations, and commercial entities are frequently invested in ignoring them. Researchers are in part responsible for these misconceptions: Even as most of us know that RCTs are not bias-free and are often not as externally valid as treatment marketers claim, many of us work for or are funded by commercial entities that use our research deceptively. And, regardless of their employers or sponsors, researchers may affirm mistaken shorthands such as the hierarchy of evidence. Yes, RCTs can provide valid, useful, and important evidence. But they are not always, in some universal sense, “better” at producing evidence than are other research designs.
There is no pyramid with RCTs at the top. There are diverse research questions of varying importance, some of which should be answered using RCTs, some of which should be answered using other methods, and some of which should be discarded in favor of more useful questions. And all research questions that are studied should be approached using well-designed methods. Because RCTs are particularly complicated, their methodologies demand special scrutiny. As ever, the devil lurks in the details – and the more details, the more devils one is likely to find (Box 5.3).
Randomized clinical trials are often viewed as being free of bias. Some reviewers of the literature only include RCTs in their analysis.
Although RCTs are excellent methods for establishing causation and estimating whether a treatment can work, they often include systematic biases.
RCT selection criteria may systematically exclude portions of the population to whom results are likely to be generalized.
The selection of a control group can create systematic bias. Sometimes control groups are systematically chosen to create an artificially strong contrast that will enhance the apparent effect of the treatment.
Behavioral clinical trials are subject to unique difficulties that must be approached carefully.

