To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Abstract: The ability of matched sampling and linear regression adjustment to reduce the bias of an estimate of the treatment effect in two sample observational studies is investigated for a simple matching method and five simple estimates. Monte Carlo results are given for moderately linear exponential response surfaces and analytic results are presented for quadratic response surfaces. The conclusions are (1) in general both matched sampling and regression adjustment can be expected to reduce bias, (2) in some cases when the variance of the matching variable differs in the two populations both matching and regression adjustment can increase bias, (3) when the variance of the matching variable is the same in the two populations and the distributions of the matching variable are symmetric the usual covariance adjusted estimate based on random samples is almost unbiased, and (4) the combination of regression adjustment in matched samples generally produces the least biased estimate.
INTRODUCTION
This paper is an extension of Rubin [1973a] to include regression adjusted estimates and parallel nonlinear response surfaces. The reader is referred to Sections 1 and 2 of that paper for the statement of the general problem and an introduction to the notation.
After presenting the estimates of the treatment effect to be considered in the remainder of Section 1, we go on in Section 2 to present Monte Carlo results for the expected bias of the estimates assuming four exponential response surfaces, normally distributed X, and the random order, nearest available matching method.
Abstract: The aim of many analyses of large databases is to draw causal inferences about the effects of actions, treatments, or interventions. Examples include the effects of various options available to a physician for treating a particular patient, the relative efficacies of various health care providers, and the consequences of implementing a new national health care policy. A complication of using large databases to achieve such aims is that their data are almost always observational rather than experimental. That is, the data in most large data sets are not based on the results of carefully conducted randomized clinical trials, but rather represent data collected through the observation of systems as they operate in normal practice without any interventions implemented by randomized assignment rules. Such data are relatively inexpensive to obtain, however, and often do represent the spectrum of medical practice better than the settings of randomized experiments. Consequently, it is sensible to try to estimate the effects of treatments from such large data sets, even if only to help design a new randomized experiment or shed light on the generalizability of results from existing randomized experiments. However, standard methods of analysis using available statistical software (such as linear or logistic regression) can be deceptive for these objectives because they provide no warnings about their propriety. Propensity score methods are more reliable tools for addressing such objectives because the assumptions needed to make their answers appropriate are more assessable and transparent to the investigator.
Abstract: Propensity score methodology can be used to help design observational studies in a way analogous to the way randomized experiments are designed: without seeing any answers involving outcome variables. The typical models used to analyze observational data (e.g., least squares regressions, difference of difference methods) involve outcomes, and so cannot be used for design in this sense. Because the propensity score is a function only of covariates, not outcomes, repeated analyses attempting to balance covariate distributions across treatment groups do not bias estimates of the treatment effect on outcome variables. This theme will be the primary focus of this article: how to use the techniques of matching, subclassification and/or weighting to help design observational studies. The article also proposes a new diagnostic table to aid in this endeavor, which is especially useful when there are many covariates under consideration. The conclusion of the initial design phase may be that the treatment and control groups are too far apart to produce reliable effect estimates without heroic modeling assumptions. In such cases, it may be wisest to abandon the intended observational study, and search for a more acceptable data set where such heroic modeling assumptions are not necessary. The ideas and techniques will be illustrated using the initial design of an observational study for use in the tobacco litigation based on the NMES data set.
Abstract: Matched sampling is a method of data collection designed to reduce bias and variability due to specific matching variables. Although often used to control for bias in studies in which randomization is practically impossible, there is virtually no statistical literature devoted to investigating the ability of matched sampling to control bias in the common case of many matching variables. An obvious problem in studying the multivariate matching situation is the variety of sampling plans, underlying distributions, and intuitively reasonable matching methods. This article considers one class of multivariate matching methods which yield the same percent reduction in expected bias for each of the matching variables. The primary result is the derivation of the expression for the maximum attainable percent reduction in bias given fixed distributions and fixed sample sizes. An examination of trends in this maximum leads to a procedure for estimating minimum ratios of sample sizes needed to obtain well-matched samples.
INTRODUCTION
This introduction is brief; the reader is referred to Rubin [1976b] for a more detailed explanation of the notation and terminology.
Let G1 and G2 be two random samples of sizes N1 and N2 from two populations, P1 and P2. Matched sampling is an attempt to find subsamples of G1 and G2, G1* and G2* of sizes N1* and N2*, such that the distributions of the p matching variables X are more similar in G1* and G2* than in G1 and G2.
Abstract: Matched sampling is a method for selecting units from a large reservoir of potential controls to produce a control group of modest size that is similar to a treated group with respect to the distribution of observed covariates. We illustrate the use of multivariate matching methods in an observational study of the effects of prenatal exposure to barbiturates on subsequent psychological development. A key idea is the use of the propensity score as a distinct matching variable.
INTRODUCTION: BACKGROUND; WHY MATCH?
Matched Sampling in Observational Studies. In many observational studies, there is a relatively small group of subjects exposed to a treatment and a much larger group of control subjects not exposed. When the costs associated with obtaining outcome or response data from subjects are high, some sampling of the control reservoir is often necessary. Matched sampling attempts to choose the controls for further study so that they are similar to the treated subjects with respect to background variables measured on all subjects.
The Danish Cohort. We examine multivariate matched sampling using initial data from a proposed study of the effects on psychological development of prenatal exposure to barbiturates. The analyses presented are preliminary and intended only to explore methodological options; none of the matched samples are the actual ones to be used for study of in utero exposure to barbiturates. The children under study were born between 1959 and 1961 and have been the object of other studies (e.g., Mednick et al. 1971; Zachau-Christiansen and Ross 1975).
Abstract: Causal inference in an important topic and one that is now attracting serious attention of statisticians. Although there exist recent discussions concerning the general definition of causal effects and a substantial literature on specific techniques for the analysis of data in randomized and nonrandomized studies, there has been relatively little discussion of modes of statistical inference for causal effects. This presentation briefly describes and contrasts four basic modes of statistical inference for causal effects, emphasizes the common underlying causal framework with a posited assignment mechanism, and describes practical implications in the context of an example involving the effects of switching from a name-brand to a generic drug. A fundamental conclusion is that in such nonrandomized studies, sensitivity of inference to the assignment mechanism is the dominant issue, and it cannot be avoided by changing modes of inference, for instance, by changing from randomization-based to Bayesian methods.
INTRODUCTION
Causal Inference
Causal inference is a topic that statisticians are addressing more vigorously and rigorously in recent years. This is a desirable development for statistics, as supported by Cox's (1986) comment on Holland (1986b) that “ the issues explicitly and implicitly raised by the article seem to me more important for the foundations of our subject than the discussion of the nature of probability ”.
Even though this book is limited to reprinting publications of mine on matched sampling, it seems useful to provide some background history to the topic. As described in the initial introduction, I had already started working on matching before I met Bill Cochran. But Bill had been working on the design and analysis of observational studies for many years before my appearance.
I reviewed Cochran's work on observational studies, including his early papers, in Rubin (1984c), the first chapter in this book. This was originally a chapter written for a volume honoring Cochran's impact on statistics, edited by Rao and Sedransk (1984). My review starts with Cochran (1953a), which focused on stratified and pair matching on a single covariate, X, and their effects on efficiency of estimation. That chapter continued with Cochran (1965), which was a compendium of advice on the design and analysis of observational studies. Also reviewed was Cochran (1968a), on subclassification and stratified matching, an article that I regard as extremely important. It was a departure from his earlier work on matching, as well as other early work on the effects of matching, all of which focused on the efficiency of estimation (e.g., Wilks (1932), which assumed exact matching with a normally distributed variable; Greenberg (1953), which compared mean matching and regression adjustment; Billewicz (1965), which I've always found relatively confusing).
Part III begins with another pair of back-to-back Biometrics articles, Rubin (1976b, c), which were written shortly after completing my PhD, and were focused entirely on analytic results concerning multivariate matching. Obviously, for practice the multivariate case is the norm; in fact, often there are very many matching variables.
Chapter 6, Rubin (1976b), defines a class of matching methods called “Equal Percent Bias Reducing” (EPBR). EPBR methods have the property that the percent reduction in bias due to the matching is the same for each of the matching variables. There are always linear combinations of the covariates that have the same means in the treatment and control groups before matching, and if a method is not EPBR, some of these will have different means after matching, implying that the matching infinitely increases the bias for them! This is not an attractive feature of a matching method – to increase bias in some directions, especially in the context of outcomes, Y, that are commonly assumed to be approximately linearly related to the X variables.
Chapter 6 goes on to describe classes of matching methods (e.g., caliper methods, inner-product methods) and corresponding distributional conditions on X (e.g., exchangeable, ellipsoidally symmetric) that lead to the methods being EPBR for those distributions. The most generally useful of these methods have turned out to be inner-product methods, including Mahalanobis-metric matching and discriminant matching, which can be a special case of the now widely used propensity score matching.
This volume reprints my publications on matched sampling, or more succinctly, matching, produced during a period of over three decades. My work on matching began just after I graduated college in 1965 and has continued to the present, and beyond, in the sense that there are publications on matching subsequent to those collected here, and I have continuing work in progress on the topic. For most of the years during this period, I believe I was one of the few statistical researchers publishing in this area, and therefore this collection is, I hope, both interesting and historically relevant. In the introduction to each part, I attempt to set the stage for the particular articles in that part. When read together, the part introductions provide a useful overview of developments in matched sampling. In contrast to the earlier years, in the last few years, there have been many other researchers making important contributions to matching. Among these, ones by technically adroit economists and other social scientists are particularly notable, for example: Hahn (1998); Dehejia and Wahba (1999); Lechner (2002); Hansen (2004); Hill, Reiter, and Zanutto (2004); Hirano, Imbens, and Ridder (2004); Imbens (2004); Zhao (2004); Abadie and Imbens (2005); and Diamond and Sekon (2005). Some of these have had a direct or indirect connection to a course on causal inference I've taught at Harvard for over a decade, sometimes jointly with Guido Imbens.
The chapters in Part VII are more pure applications than the others in this volume. Chapter 23, Witkin et al. (1976), was a study of criminality in XYY and XXY men in Copenhagen, stimulated by proposals by John Mitchell, the U.S. Attorney General under President Richard Nixon, to identify such babies at birth and warn their parents of the potential antisocial tendencies; see the introductory section of this article for the underlying reasoning. This study matched XXY and XYY males to normal XY males on height, IQ, education, and parental socioeconomic status. The method used was simple stratified matching. Further model-based adjustments were also made. This article was a true committee effort, with the different authors of the different parts not always agreeing on what was important to say or how to say it. A careful reading is therefore more confusing and interesting than usual.
Chapter 24, Rubin (1991b), was the written version of a President's Invited Address to the Biometric Society, and reported some matched sample analyses that I was doing for Sandoz Pharmaceuticals. The issue concerned the effects of switching from a name-brand drug to a generic version after the original drug's patent expired. In particular, Sandoz had an antipsychotic drug for which it was investigating evidence that switching from the name-brand to a generic version was potentially harmful. Much of the article concerns how to formulate this causal inference problem correctly, especially how to deal with the post-intervention information.
Abstract: When assignment to treatment group is made solely on the basis of the value of a covariate, X, effort should be concentrated on estimating the conditional expectations of the dependent variable Y given X in the treatment and control groups. One then averages the difference between these conditional expectations over the distribution of X in the relevant population. There is no need for concern about “other” sources of bias, e.g., unreliability of X, unmeasured background variables. If the conditional expectations are parallel and linear, the proper regression adjustment is the simple covariance adjustment. However, since the quality of the resulting estimates may be sensitive to the adequacy of the underlying model, it is wise to search for nonparallelism and nonlinearity in these conditional expectations. Blocking on the values of X is also appropriate, although the quality of the resulting estimates may be sensitive to the coarseness of the blocking employed. In order for these techniques to be useful in practice, there must be either substantial overlap in the distribution of X in the treatment groups or strong prior information.
INTRODUCTION
In some studies, the experimental units are divided into two treatment groups solely on the basis of a covariate, X. By this we mean that if two units have the same value of X either they both must receive the same treatment or they must be randomly assigned (not necessarily with probability 0.5) to treatments.
Abstract: Matched sampling is a common technique used for controlling bias in observational studies. We present a general theoretical framework for studying the performance of such matching methods. Specifically, results are obtained concerning the performance of affinely invariant matching methods with ellipsoidal distributions, which extend previous results on equal percent bias reducing methods. Additional extensions cover conditionally affinely invariant matching methods for covariates with conditionally ellipsoidal distributions. These results decompose the effects of matching into one subspace containing the best linear discriminant, and the subspace of variables uncorrelated with the discriminant. This characterization of the effects of matching provides a theoretical foundation for understanding the performance of specific methods such as matched sampling using estimated propensity scores. Calculations for such methods are given in subsequent articles.
BACKGROUND
Matched sampling is a popular and important technique for controlling bias in observational studies. It has received increasing attention in the statistical literature in recent years [Cochran (1968a); Cochran and Rubin (1973); Rubin (1973a, b), (1976b, c), (1979b); Carpenter (1977); and Rosenbaum and Rubin (1983a, 1985a)]. The basic situation has two populations of units, treated (e.g., smokers) and control (e.g., nonsmokers), and a set of observed matching variables X = (X1, …, Xp) (e.g., age, gender, weight). The objective is to compare the distributions of the outcome variables having adjusted for differences in the distributions of X in the two populations. Matched sampling is a way of adjusting for X through data collection.
Abstract: Matched sampling is a standard technique for controlling bias in observational studies due to specific covariates. Since Rosenbaum & Rubin (1983a), multivariate matching methods based on estimated propensity scores have been used with increasing frequency in medical, educational, and sociological applications. We obtain analytic expressions for the effect of matching using linear propensity score methods with normal distributions. These expressions cover cases where the propensity score is either known, or estimated using either discriminant analysis or logistic regression, as is typically done in current practice. The results show that matching using estimated propensity scores not only reduces bias along the population propensity score, but also controls variation of components orthogonal to it. Matching on estimated rather than population propensity scores can therefore lead to relatively large variance reduction, as much as a factor of two in common matching settings where close matches are possible. Approximations are given for the magnitude of this variance reduction, which can be computed using estimates obtained from the matching pools. Related expressions for bias reduction are also presented which suggest that, in difficult matching situations, the use of population scores leads to greater bias reduction than the use of estimated scores.
INTRODUCTION
Matched Sampling
Matched sampling refers to an important class of applied statistical procedures designed to control bias in observational studies (Cochran & Rubin, 1973; Rubin, 1973a, b; McKinlay, 1977; Carpenter, 1977; Rosenbaum & Rubin, 1985a; Kane, 1991).
Abstract: Monte Carlo methods are used to study the efficacy of multivariate matched sampling and regression adjustment for controlling bias due to specific matching variables X when dependent variables are moderately nonlinear in X. The general conclusion is that nearest available Mahalanobis metric matching in combination with regression adjustment on matched pair differences is a highly effective plan for controlling bias due to X.
INTRODUCTION
Our objective is to study the utility of matched sampling and regression adjustment (covariance adjustment) for controlling specific matching variables in observational studies. This introduction is brief; we assume that the reader is familiar with the literature on matching and covariance adjustment in observational studies (e.g., Althauser and Rubin 1970; Billewicz 1964, 1965; Campbell and Erlebacher 1970; Cochran 1953a, 1968; Cochran and Rubin 1973; Gilbert, Light, and Mosteller 1975; Greenberg 1953; Lord 1960; McKinlay 1974, 1975a, b; and Rubin 1974, 1977a, 1978a). In particular, this work is a natural extension of earlier Monte Carlo work on one matching variable (Rubin 1973a, b) and theoretical work on multivariate matching methods (Rubin 1976b, c.)
Matched sampling refers to the selection of treatment units (e.g., smokers) and control units (e.g., nonsmokers) that have similar values of matching variables, X (e.g., age, weight), whereas regression adjustment refers to a statistical procedure that adjusts estimates of the treatment effects by estimating the relationship between the dependent variable Y (e.g., blood pressure) and X in each treatment group.
One of Don Campbell's many influential contributions was to the design of studies to estimate causal effects (e.g., Campbell & Stanley, 1966). He had particular interest in the trade-offs between matching and covariance adjustments (e.g., Campbell & Erlebacher, 1970; Cook & Campbell, 1979). One of the authors (Rubin), in fact, had his first conversation with Don on the topic, more than a quarter of a century ago, having recently completed his Ph.D. thesis under the direction of W. G. Cochran on the potential benefits of matching in observational studies. That author believes that the topic of this chapter, using matching in randomized experiments, would have been of great interest to Don and that this chapter would have benefited from his insightful comments. Moreover, we hope that he would have been pleased to see our example of an educational evaluation that did not have to rely on quasi-experimental techniques.
Randomized designs have been recognized since the ground-breaking work of R. A. Fisher in the early part of the 20th century as the most principled way to identify empirically causal relationships between treatments and outcomes. The strength of the randomized design lies in its ability to create treatment groups that have similar background characteristics on average. Randomization balances not only the observed characteristics but also the unobserved characteristics of the experimental units.
In 1978 I had a Guggenheim Fellowship to pursue work on causal inference. I had recently completed my Annals of Statistics article, Rubin (1978a), dealing with Bayesian inference for causal effects, and I was actively consulting with June Reinisch, Director of the Kinsey Institute, on matching problems with a large Danish cohort (used for illustrative purposes in several articles, including Rosenbaum and Rubin (1985a, b), both reprinted in this part, and Rubin and Thomas (1996, 2000), both reprinted in Part V. Some of the empirical context for this problem appears in Chapter 25 in Part VII.
Starting in the winter of 1978, I was able to take a one-semester leave from ETS to visit Harvard University, and taught a seminar on causal inference. There was an outstanding first-year PhD student who took my seminar and ended up writing his thesis largely under my supervision – Paul Rosenbaum. The combination of his brilliance and our real data problems led to a series of papers on propensity score methods that appear to have started an avalanche of very recent contributions addressing propensity score methods; Google, at the end of 2005, lists close to a million entries for “propensity scores” or “propensity score.”
Chapter 10, Rosenbaum and Rubin (1983a), introduced the propensity score. It provided definitions of balancing scores, propensity scores, strongly ignorable treatment assignment (a stronger version of ignorable treatment assignment, defined in Rubin (1978a) but implied in Rubin (1976a)).
Each of the chapters in Part VI has a real application as its focus, but each one also includes some new methodological contribution. This is in some contrast to the chapters in Part VII, where the focus is almost entirely on the application and/or providing some general advice based on work done in other articles.
Chapter 19, Holland and Rubin (1988), consider case-control or retrospective studies (e.g., see Breslow and Day, 1980). Such studies, which are often used to investigate rare diseases, do not match treated and control subjects, as with all the methods we have studied here that generate the “prospective” study. Rather, the pair-matched retrospective study matches cases (e.g., infected with a particular disease) to controls (free of that disease) on the basis of background variables. Although this is a very helpful design when searching for possible causes of a disease, such matching is inherently “backward” and is fully correct only under the null hypothesis of no effect of treatment on the disease. Retrospective matching of cases and controls, therefore, does not generally control for the background variables used to create the matches – additional stratification or model-based adjustment is still needed. This fact is pointed out in the article, which also illustrates the point using data from a case-control study of the effect of coffee drinking on myocardial infarction.
This conclusion was previously known but not stated within the context of the “Rubin Causal Model” using potential outcomes and an assignment mechanism.