To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Before moving to Harvard University in the summer of 1983, I spent two wonderful years as a Professor at the University of Chicago, 75% in the Department of Statistics and 25% in the Department of Education. During that brief period of time, I supervised several great PhD students, one of whom was Neal Thomas, who, despite a strong mathematics background, had a deep appreciation for applied statistics. The issue of the subasymptotic properties of propensity score matching was of real interest to me at the time, and remains an important one today. His PhD thesis addressed this topic, but he and I made more progress after he was out and working at RAND, and then ETS. The result was a sequence of four articles, which started with quite general theoretical results and then moved to apply that theory.
The first of these, Chapter 15, Rubin and Thomas (1992a), was published in the Annals of Statistics and was built on a substantial extension of EPBR matching methods studied in Chapter 6, Rubin (1976b). Affinely invariant matching methods are ones that are closed under any wide-sense linear transformation of the data, such as discriminant matching (using either the estimated or the population discriminant), propensity score matching (again, using either the population or estimated, e.g., by logistic regression), as well as Mahalanobis-metric matching, and many other methods.
Matched samples are usually created to aid in the design of a study to assess the causal effect of some active treatment or intervention relative to some control treatment, based on nonrandomized observational data. Consequently, this summary of advice on matching will also offer some general suggestions for the design of such studies based on my four decades of work on them. The theoretical perspective for this advice, the “Rubin Causal Model” (RCM – Holland, 1986b; Rubin, 2006), has two essential parts: the definition of the scientific situation using “potential outcomes” to define causal effect estimands, and the formulation of a real or hypothetical “assignment mechanism”; and a third optional part, the modeling of the science to produce imputations of missing potential outcomes. Matched sampling is focused on the second step. As stated in the initial introduction, a full-length textbook from this perspective is Imbens and Rubin (2006b); and recent summaries of the RCM appear in Imbens and Rubin (2006a) and Rubin (2006a).
The first part of the RCM implies that we should always start by carefully defining all causal estimands in terms of potential outcomes, which are all values that could be observed in some real or hypothetical experiment comparing the results under an active treatment to the results under a control treatment. That is, causal effects are defined by a comparison of (a) the values that would be observed if the active treatment were applied and (b) the values that would be observed if instead the control treatment were applied.
Few issues in behavior genetics have received more public and scientific attention than that given to the possible role of an extra Y chromosome in human aggression. Soon after the literature began to suggest an elevated frequency of the XYY genotype among inmates of institutions for criminals and delinquents, interest in this issue had a meteoric rise; and it has been sustained ever since. This happened for several reasons. Stories about a few men who had or were presumed to have an extra Y chromosome and who had committed serious crimes were given prominent attention in the press, suggesting the intriguing idea that the single Y chromosome normally found in males contributes to “aggressive tendencies” in that sex and that an extra Y carries these tendencies beyond their usual bounds. Reports of antisocial behavior in XYY men, often based on a single case, soon began to appear in the scientific literature and were taken as evidence of an XYY-aggression linkage. The serious moral and legal implications of such a linkage attracted the interest of social scientists and legal groups to the XYY phenomenon (Shah, 1970), and students of genetics and psychology saw in it, as Lederberg (1973) has said, “one of the most tangible leads for connecting genetic constitution with behavior in man.”
A number of studies have supported the earlier finding of an elevated frequency of cases with an XYY complement among men in institutions, particularly in penal-mental institutions.
Abstract: The problem of drawing causal inferences from retrospective case-control studies is considered. A model for causal inference in prospective studies is reviewed and then applied to retrospective studies. The limitations of case-control studies are formulated in terms of the level of causally relevant parameters that can be estimated in such studies. An example using data from a large retrospective study of coffee-drinking and myocardial infarctions is used to illustrate the ideas of the article.
Philosophical discussions of causality often emphasize the meaning of causation. Scientists are usually concerned with understanding causal mechanisms. Purely statistical discussions of causality are substantially more limited in scope, because the unique contribution of statistics is to the measuring of causal effects and not to the understanding of causal mechanisms or to the meaning of causation. This distinction is sometimes expressed as “statistics can establish correlation, but not causation.” We feel our emphasis on measurement is more appropriate, because it focuses on what statistical theory can contribute to discussions of causality. Measuring causal effects accurately without any understanding whatsoever of the causal mechanisms involved or of the meaning of causation is not only possible but is, of course, a commonplace experience of everyday life; that is, people are quite capable of using automobiles, ovens, calculators, and typewriters safely and effectively without any knowledge of how these devices work. Of course, careful measurements of causal effects often lead to a better understanding of the causal mechanisms involved.
Abstract: Several matching methods that match all of one sample from another larger sample on a continuous matching variable are compared with respect to their ability to remove the bias of the matching variable. One method is a simple mean-matching method and three are nearest available pair-matching methods. The methods' abilities to remove bias are also compared with the theoretical maximum given fixed distributions and fixed sample sizes. A summary of advice to an investigator is included.
INTRODUCTION
Matched sampling is a method of data collection and organization designed to reduce bias and increase precision in observational studies, i.e., in those studies in which the random assignment of treatments to units (subjects) is absent. Although there are examples of observational studies which could have been conducted as properly randomized experiments, in many other cases the investigator could not randomly assign treatments to subjects. For example, consider the Kihlberg and Robinson [1968] study comparing severity of injury in automobile accidents for motorists using and not using seatbelts. One would not want to randomly assign subjects to “seatbelt” and “no seatbelt” treatments and then have them collide at varying speeds, angles of impact, etc. Neither, however, would one want to simply compare the severity of injury in “random” samples of motorists in accidents using and not using seatbelts; important variables such as “speed of automobile at time of accident” may be differently distributed in the two groups (i.e., seatbelted motorists are generally more cautious and therefore tend to drive more slowly).
Abstract: Propensity score matching refers to a class of multivariate methods used in comparative studies to construct treated and matched control samples that have similar distributions on many covariates. This matching is the observational study analog of randomization in ideal experiments, but is far less complete as it can only balance the distribution of observed covariates, whereas randomization balances the distribution of all covariates, both observed and unobserved. An important feature of propensity score matching is that it can be easily combined with model-based regression adjustments or with matching on a subset of special prognostic covariates or combinations of prognostic covariates that have been identified as being especially predictive of the outcome variables. We extend earlier results by developing approximations for the distributions of covariates in matched samples created with linear propensity score methods for the practically important situation where matching uses both the estimated linear propensity scores and a set of special prognostic covariates. Such matching on a subset of special prognostic covariates is an observational study analog of blocking in a randomized experiment. An example combining propensity score matching with Mahalanobis metric matching and regression adjustment is presented that demonstrates the flexibility of these methods for designing an observational study that effectively reduces both bias due to many observed covariates and bias and variability due to a more limited subset of covariates.
Abstract: Investigators in observational studies have no control over treatment assignment. As a result, large differences can exist between the treatment and control groups on observed covariates, which can lead to badly biased estimates of treatment effects. Propensity score methods are an increasingly popular method for balancing the distribution of the covariates in the two groups to reduce this bias; for example, using matching or subclassification, sometimes in combination with model-based adjustment. To estimate propensity scores, which are the conditional probabilities of being treated given a vector of observed covariates, we must model the distribution of the treatment indicator given these observed covariates. Much work has been done in the case where covariates are fully observed. We address the problem of calculating propensity scores when covariates can have missing values. In such cases, which commonly arise in practice, the pattern of missing covariates can be prognostically important, and then propensity scores should condition both on observed values of covariates and on the observed missing-data indicators. Using the resulting generalized propensity scores to adjust for the observed background differences between treatment and control groups leads, in expectation, to balanced distributions of observed covariates in the treatment and control groups, as well as balanced distributions of patterns of missing data. The methods are illustrated using the generalized propensity scores to create matched samples in a study of the effects of postterm pregnancy.
Patients write DNR orders to avoid a vegetative state or an otherwise low-quality life, which they fear will follow resuscitation, but policy makers are concerned that DNR orders hasten death, especially for vulnerable subgroups. Both patients and policy makers would be well informed by knowing how often and for whom obeying a DNR order will cause the early termination of life and how often those terminated lives would have been high quality.
There are two ways to learn about the causal effects of DNR orders: an observational study, such as performed by Shepardson et al. (1999) and a randomized clinical trial (RCT). At the end of our discussion, we consider an alternative to both an observational study and a traditional RCT that involves random assignment, yet may be both ethical and more policy relevant than a traditional RCT.
When treatment evaluation does not benefit from any random assignment, observational study methods must be used to assess treatment effects. Here, the authors use propensity score methods and conclude that a patient's choice to write a DNR order on average hastens death. At first glance, it may seem strange to use formal quantitative methods to evaluate this question. After all, can forbidding “resuscitation to preserve life” do anything but hasten death? However, the authors' give their attention to assessing the magnitude of the effect, which has far more policy relevance than the more mundane question of assessing statistical significance.
Abstract: Matched sampling is a standard technique in the evaluation of treatments in observational studies. Matching on estimated propensity scores comprises an important class of procedures when there are numerous matching variables. Recent theoretical work (Rubin, D. B., and Thomas, N., 1992a, reprinted in this volume as Chapter 15) on affinely invariant matching methods with ellipsoidal distributions provides a general framework for evaluating the operating characteristics of such methods. Moreover, Rubin and Thomas (1992b, reprinted in this volume as Chapter 16) uses this framework to derive several analytic approximations under normality for the distribution of the first two moments of the matching variables in samples obtained by matching on estimated linear propensity scores. Here we provide a bridge between these theoretical approximations and actual practice. First, we complete and refine the nomal-based analytic approximations, thereby making it possible to apply these results to practice. Second, we perform Monte Carlo evaluations of the analytic results under normal and nonnormal ellipsoidal distributions, which confirm the accuracy of the analytic approximations, and demonstrate the predictable ways in which the approximations deviate from simulation results when normal assumptions are violated within the ellipsoidal family. Third, we apply the analytic approximations to real data with clearly nonellipsoidal distributions, and show that the thoretical expressions, although derived under artificial distributional conditions, produce useful guidance for practice. Our results delineate the wide range of settings in which matching on estimated linear propensity scores performs well, thereby providing useful information for the design of matching studies.
Abstract: Observational studies comparing groups of treated and control units are often used to estimate the effects caused by treatments. Matching is a method for sampling a large reservoir of potential controls to produce a control group of modest size that is ostensibly similar to the treated group. In practice, there is a trade-off between the desires to find matches for all treated units and to obtain matched treated–control pairs that are extremely similar to each other. We derive expressions for the bias in the average matched pair difference due to (i) the failure to match all treated units – incomplete matching, and (ii) the failure to obtain exact matches – inexact matching. A practical example shows that the bias due to incomplete matching can be severe, and moreover, can be avoided entirely by using an appropriate multivariate nearest available matching algorithm, which, in the example, leaves only a small residual bias due to inexact matching.
INTRODUCTION
The Effects Caused by Treatments
A treatment is an intervention that can, in principle, be given to or withheld from any experimental unit under study. With an experimental treatment and a control treatment, each unit has two potential responses: a response r1 that would be observed if the unit received the experimental treatment, and a response r0 that would be observed if the unit received the control treatment.
Abstract: This paper proposes a simple technique for assessing the range of plausible causal conclusions from observational studies with a binary outcome and an observed categorical covariate. The technique assesses the sensitivity of conclusions to assumptions about an unobserved binary covariate relevant to both treatment assignment and response. A medical study of coronary artery disease is used to illustrate the technique.
INTRODUCTION AND NOTATION
Inevitably, the results of clinical studies are subject to dispute. In observational studies, one basis for dispute is obvious: since patients were not assigned to treatments at random, patients at greater risk may be over-represented in some treatment groups. This paper proposes a method for assessing the sensitivity of causal conclusions to an unmeasured patient characteristic relevant to both treatment assignment and response. Despite their limitations, observational studies will continue to be a valuable source of information, and therefore it is prudent to develop appropriate methods of analysis for them.
Our sensitivity analysis consists of the estimation of the average effect of a treatment on a binary outcome variable after adjustment for observed categorical covariates and an unobserved binary covariate u, under several sets of assumptions about u. Both Cornfield et al. (1959) and Bross (1966) have proposed guidelines for determining whether an unmeasured binary covariate having specified properties could explain all of the apparent effect of a treatment, that is, whether the treatment effect, after adjustment for u could be zero.