To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Objective: To test whether exposure to phenobarbital in utero is associated with deficits in intelligence scores in adult men and whether the magnitude of the postnatal effect is mediated by exposure parameters and/or postnatal environmental factors.
Design: Two double-blind studies were conducted on independent samples of adult men prenatally exposed to phenobarbital and matched control samples using different measures of general intelligence. Based on data from control subjects, regression models were built relating intelligence scores to relevant pre-exposure matching variables and age at testing. Models generated predicted scores for each exposed subject. Group mean differences between the individually predicted and observed scores estimated exposure effects.
Setting: Copenhagen, Denmark.
Participants: Exposed subjects were adult men born at the largest hospital in Copenhagen between 1959 and 1961 who were exposed to phenobarbital during gestation via maternal medical treatment and whose mothers had no history of a central nervous system disorder and no treatment during pregnancy with any other psychopharmacological drug. Study 1 included 33 men and study 2, 81 men. Controls were unexposed members of the same birth cohort matched on a wide spectrum of maternal variables recorded prenatally and perinatally. Controls for studies 1 and 2 included 52 and 101 men, respectively.Main Outcome Measures:In study 1: Wechsler Adult Intelligence Scale (Danish version); in study 2: Danish Military Draft Board Intelligence Test (Børge Priens Prøve).
Result: Men exposed prenatally to phenobarbital had significantly lower verbal intelligence scores (approximately 0.5 SD) than predicted. Lower socioeconomic status and being the offspring of an “unwanted” pregnancy increased the magnitude of the negative effects.
Abstract: The propensity score is the conditional probability of assignment to a particular treatment given a vector of observed covariates. Both large and small sample theory show that adjustment for the scalar propensity score is sufficient to remove bias due to all observed covariates. Applications include: (i) matched sampling on the univariate propensity score, which is a generalization of discriminant matching, (ii) multivariate adjustment by subclassification on the propensity score where the same subclasses are used to estimate treatment effects for all outcome variables and in all subpopulations, and (iii) visual representation of multivariate covariance adjustment by a two-dimensional plot.
DEFINITIONS
The Structure of Studies for Causal Effects
Inferences about the effects of treatments involve speculations about the effect one treatment would have had on a unit which, in fact, received some other treatment. We consider the case of two treatments, numbered 1 and 0. In principle, the ith of the N units under study has both a response r1i that would have resulted if it had received treatment 1, and a response r0i that would have resulted if it had received treatment 0. In this formulation, causal effects are comparisons of r1i and r0i, for example r1i – r0i or r1i/r0i. Since each unit receives only one treatment, either r1i or r0i is observed, but not both, so comparisons of r1i and r0i imply some degree of speculation.
Abstract: Monte Carlo methods are used to study the ability of nearest-available, Mahalanobis-metric matching to make the means of matching variables more similar in matched samples than in random samples.
INTRODUCTION
Matched sampling is a method for selecting treatment units (e.g., smokers) and control units (e.g., nonsmokers) with similar values of matching variables X (e.g., age, sex, family medical history). Rubin (1976b,c) presented analytical work on a class of matching methods for multivariate X, which are called ‘equal percent bias reducing’ (EPBR) because they yield the same percent reduction in expected bias for each matching variable, and thus for each linear combination of the matching variables. If the matching method is not EPBR, then there are different percent reductions in bias for different matching variables; in fact, any non-EPBR matching method increases bias for some linear combinations of the matching variables.
This communication extends the previous work of Rubin (1976b, c) by presenting Monte Carlo values for the percent reduction in bias when X is bivariate normal, and nearest-available, Mahalanobis-metric, pair matching is used. Related theoretical work on this matching method appears in Carpenter (1977). We chose to study Mahalanobis-metric matching because it is: (i) EPBR under commonly assumed distributional conditions (see Rubin, 1976b, §2); (ii) a natural extension of univariate nearest-available-pair matching methods studied earlier (Rubin, 1973a, b; Cochran and Rubin, 1973), in the sense that it attempts to find pair matches close on all matching variables; (iii) easy to implement in practice by using commonly available computer programs for sorting and calculating sample covariance matrices; and (iv) superior to another EPBR pair-matching method for nonlinear response surfaces (Rubin, 1979b).
The statistical study of the utility of general methods of matched sampling starts with the simplest setting, that with only one matching variable, X, as in Cochran (1968a). When attempting to control for bias in X, a prominent competitor to matching is regression adjustment, also called covariance adjustment (ANCOVA = analysis of covariance). Therefore, it is important when studying matching to compare the relative merits of the methods, including their combination, that is, regression adjustment on the matched samples. In fact, this topic was the focus of my thesis.
Chapters 3 and 4, Rubin (1973a, b), were originally published back-to-back in Biometrics, and were improved versions of the material in my PhD thesis. The first considered matching on a single normally distributed covariate, X, where the outcome variable, Y, was monotonely but possibly nonlinearly (e.g., exponentially) related to X with possibly nonparallel regressions in the treatment and control groups. The estimand was the average causal effect for the treated group, and the estimator was the simple difference in Y means. The criterion for comparing methods was “percent reduction in the bias” due to selecting samples by pair matching or mean matching on X, rather than random sampling. The conditions considered involved various treatment sample sizes and ratios of control to treatment sample sizes, as well as various mean differences on X between treatment and control groups, and ratios of variance of X in the treatment and control groups.
Abstract: The propensity score is the conditional probability of assignment to a particular treatment given a vector of observed covariates. Previous theoretical arguments have shown that subclassification on the propensity score will balance all observed covariates. Subclassification on an estimated propensity score is illustrated, using observational data on treatments for coronary artery disease. Five subclasses defined by the estimated propensity score are constructed that balance 74 covariates, and thereby provide estimates of treatment effects using direct adjustment. These subclasses are applied within subpopulations, and model-based adjustments are then used to provide estimates of treatment effects within these subpopulations. Two appendixes address theoretical issues related to the application: the effectiveness of subclassification on the propensity score in removing bias, and balancing properties of propensity scores with incomplete data.
INTRODUCTION: SUBCLASSIFICATION AND THE PROPENSITY SCORE
Adjustment by Subclassification in Observational Studies
In observational studies for causal effects, treatments are assigned to experimental units without the benefits of randomization. As a result, treatment groups may differ systematically with respect to relevant characteristics and, therefore, may not be directly comparable. One commonly used method of controlling for systematic differences involves grouping units into subclasses based on observed characteristics, and then directly comparing only treated and control units who fall in the same subclass. Obviously such a procedure can only control the bias due to imbalances in observed covariates.
Cochran (1968a) presents an example in which the mortality rates of cigarette smokers, cigar/pipe smokers, and nonsmokers are compared after subclassification on the covariate age.
Abstract: Multivariate matching methods are commonly used in the behavioral and medical sciences in an attempt to control bias when randomization is not feasible. Some examples of multivariate matching methods are discussed in Althauser and Rubin [1970] and Cochran and Rubin [1973] but otherwise have received little attention in the literature. Here we present examples of multivariate matching methods that will yield the same percent reduction in bias for each matching variable for a variety of underlying distributions. Eleven distributional cases are considered and for each one, matching methods are described which are equal percent bias reducing. The methods discussed in Section 8 will probably be the most generally applicable in practice. These matching methods are based on the values of the sample best linear discriminant or define distance by the inverse of the sample covariance matrix.
INTRODUCTION
In an observational (non-randomized) study the objective is often to determine the effect of a dichotomous treatment variable (e.g., exposure to a specific drug) on several dependent variables (e.g., blood pressure, cholesterol levels). The treatment variable defines two populations of units, P1 (e.g., subjects given the drug) and P2 (subjects not given the drug). Since the treatments were not randomly assigned to the units, estimating the effect of the treatment variable on the dependent variables using random samples from P1 and P2 may be quite biased.
William G. Cochran worked on statistically rich and scientifically important problems. Throughout his career he participated in the design, analysis, and evaluation of statistical studies directed at important real world problems. The field of observational studies is a perfect example of a general topic that Cochran helped to define and advance with many contributions. Cochran's work provides an essential foundation for continuing research in this important area of statistics.
An observational study, for purposes here, is a study intended to assess causal effects of treatments where the rule that governs the assignment of treatments to units is at least partially unknown. Thus a randomized experiment on rats for the effect of smoke inhalation on lung cancer is a controlled experiment rather than an observational study, but an analysis of health records for samples of smokers and nonsmokers from the U.S. population is an observational study. The obvious problem created by observational studies is that there may exist systematic differences between the treatment groups besides treatment exposure, and so any observed differences between the groups (e.g., between smokers and nonsmokers) with respect to an outcome variable (e.g., incidence of lung cancer) might be due to confounding variables (e.g., age, genetic susceptibility to cancer) rather than the treatments themselves. Consequently, a primary objective in the design and analysis of observational studies is to control, through sampling and statistical adjustment, the possible biasing effects of those confounding variables that can be measured: a primary objective in the evaluation of observational studies is to speculate about the remaining biasing effects of those confounding variables that cannot be measured.
Summary. This chapter discusses the nature of probability as it is used to represent both variability and uncertainty in the various approaches to statistical inference. After some preliminary remarks, the way in which a frequency notion of probability can be used to assess uncertainty is reviewed. Then two contrasting notions of probability as representing degree of belief in an uncertain event or hypothesis are examined.
General remarks
We can now consider some issues involved in formulating and comparing the different approaches.
In some respects the Bayesian formulation is the simpler and in other respects the more difficult. Once a likelihood and a prior are specified to a reasonable approximation all problems are, in principle at least, straightforward. The resulting posterior distribution can be manipulated in accordance with the ordinary laws of probability. The difficulties centre on the concepts underlying the definition of the probabilities involved and then on the numerical specification of the prior to sufficient accuracy.
Sometimes, as in certain genetical problems, it is reasonable to think of θ as generated by a stochastic mechanism. There is no dispute that the Bayesian approach is at least part of a reasonable formulation and solution in such situations. In other cases to use the formulation in a literal way we have to regard probability as measuring uncertainty in a sense not necessarily directly linked to frequencies. We return to this issue later.
Summary. This chapter deals in outline with a number of topics that fall outside the main theme of the book. The topics are prediction, decision analysis and point estimation, concentrating especially on estimates that are exactly or approximately unbiased. Finally some isolated remarks are made about methods, especially for relatively complicated models, that avoid direct use of the likelihood.
Prediction
In prediction problems the target of study is not a parameter but the value of an unobserved random variable. This includes, however, in so-called hierarchical models estimating the value of a random parameter attached to a particular portion of the data. In Bayesian theory the formal distinction between prediction and estimation largely disappears in that all unknowns have probability distributions. In frequentist theory the simplest approach is to use Bayes' theorem to find the distribution of the aspect of interest and to replace unknown parameters by good estimates. In special cases more refined treatment is possible.
In the special case when the value Y*, say, to be predicted is conditionally independent of the data given the parameters the Bayesian solution is particularly simple. A predictive distribution is found by averaging the density fY* (y*; θ) over the posterior distribution of the parameter.
In special cases a formally exact frequentist predictive distribution is obtained by the following device.
Summary. First a number of distinct situations are given in which significance tests may be relevant. The nature of a simple significance test is set out and its implications explored. The relation with interval estimation is emphasized. While most of the discussion is from a frequentist perspective, relations with Bayesian theory are outlined in the final section.
General remarks
So far, in our frequentist discussion we have summarized information about the unknown parameter ψ by finding procedures that would give in hypothetical repeated applications upper (or lower) bounds for ψ a specified proportion of times in a long run of repeated applications. This is close to but not the same as specifying a probability distribution for ψ; it avoids having to treat ψ as a random variable, and moreover as one with a known distribution in the absence of the data.
Suppose now there is specified a particular value ψ0 of the parameter of interest and we wish to assess the relation of the data to that value. Often the hypothesis that ψ = ψ0 is called the null hypothesis and conventionally denoted by H0. It may, for example, assert that some effect is zero or takes on a value given by a theory or by previous studies, although ψ0 does not have to be restricted in that way.
A very thorough account of the history of the more mathematical side of statistics up to the 1930's is given by Hald (1990, 1998, 2006). Stigler (1990) gives a broader perspective and Heyde and Seneta (2001) have edited a series of vignettes of prominent statisticians born before 1900.
Many of the great eighteenth and early nineteenth century mathematicians had some interest in statistics, often in connection with the analysis of astronomical data. Laplace (1749–1827) made extensive use of flat priors and what was then called the method of inverse probability, now usually called Bayesian methods. Gauss (1777–1855) used both this and essentially frequentist ideas, in particular in his development of least squares methods of estimation. Flat priors were strongly criticized by the Irish algebraist Boole (1815–1864) and by later Victorian mathematicians and these criticisms were repeated by Todhunter (1865) in his influential history of probability. Karl Pearson (1857–1936) began as, among other things, an expert in the theory of elasticity, and brought Todhunter's history of that theory to posthumous publication (Todhunter, 1886, 1893).
In one sense the modern era of statistics started with Pearson's (1900) development of the chi-squared goodness of fit test. He assessed this without comment by calculating and tabulating the tail area of the distribution. Pearson had some interest in Bayesian ideas but seems to have regarded prior distributions as essentially frequency distributions.
Summary. Maximum likelihood estimation and related procedures provide effective solutions for a wide range of problems. There can, however, be difficulties leading at worst to inappropriate procedures with properties far from those sketched above. Some of the difficulties are in a sense mathematical pathologies but others have serious statistical implications. The first part of the chapter reviews the main possibilities for anomalous behaviour. For illustration relatively simple examples are used, often with a single unknown parameter. The second part of the chapter describes some modifications of the likelihood function that sometimes allow escape from these difficulties.
Multimodal likelihoods
In some limited cases, notably connected with exponential families, convexity arguments can be used to show that the log likelihood has a unique maximum. More commonly, however, there is at least the possibility of multiple maxima and saddle-points in the log likelihood surface. See Note 7.1.
There are a number of implications. First, proofs of the convergence of algorithms are of limited comfort in that convergence to a maximum that is in actuality not the overall maximum of the likelihood is unhelpful or worse. Convergence to the global maximum is nearly always required for correct interpretation. When there are two or more local maxima giving similar values to the log likelihood, it will in principle be desirable to know them all; the natural confidence set may consist of disjoint intervals surrounding these local maxima.
Summary. This chapter continues the comparative discussion of frequentist and Bayesian arguments by examining rather more complicated situations. In particular several versions of the two-by-two contingency table are compared and further developments indicated. More complicated Bayesian problems are discussed.
General remarks
The previous frequentist discussion in especially Chapter 3 yields a theoretical approach which is limited in two senses. It is restricted to problems with no nuisance parameters or ones in which elimination of nuisance parameters is straightforward. An important step in generalizing the discussion is to extend the notion of a Fisherian reduction. Then we turn to a more systematic discussion of the role of nuisance parameters.
By comparison, as noted previously in Section 1.5, a great formal advantage of the Bayesian formulation is that, once the formulation is accepted, all subsequent problems are computational and the simplifications consequent on sufficiency serve only to ease calculations.
General Bayesian formulation
The argument outlined in Section 1.5 for inference about the mean of a normal distribution can be generalized as follows. Consider the model fY|Θ(y | θ), where, because we are going to treat the unknown parameter as a random variable, we now regard the model for the data-generating process as a conditional density. Suppose that Θ has the prior density fΘ(θ), specifying the marginal distribution of the parameter, i.e., in effect the distribution Θ has when the observations y are not available.
Summary. A different approach to statistical inference is outlined based not on a probabilistic model of the data-generating process but on the randomization used in study design. The implications of this are developed in simple cases, first for sampling and then for the design of experiments.
General remarks
The discussion throughout the book so far rests centrally on the notion of a probability model for the data under analysis. Such a model represents, often in considerably idealized form, the data-generating process. The parameters of interest are intended to capture important and interpretable features of that generating process, separated from the accidental features of the particular data. That is, the probability model is a model of physically generated variability, of course using the word ‘physical’ in some broad sense. This whole approach may be called model-based.
In some contexts of sampling existing populations and of experimental design there is a different approach in which the probability calculations are based on the randomization used by the investigator in the planning phases of the investigation. We call this a design-based formulation.
Fortunately there is a close similarity between the methods of analysis emerging from the two approaches. The more important differences between them concern interpretation of the conclusions. Despite the close similarities it seems not to be possible to merge a theory of the purely design-based approach seamlessly into the theory developed earlier in the book.