To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In Chapters 18 and 19, we introduced a statistical formalization of causal effects using potential outcomes, focusing on the estimation of average causal effects and interactions using data from controlled experiments. In practice, logistic, ethical, or financial constraints can make it difficult or impossible to externally assign treatments, and simple estimates of the treatment effect based on differences or regressions can be biased when selection into treatment and control group is not random. To estimate effects when there is imbalance and lack of overlap between treatment and control groups, you should include as regression predictors all the confounders that explain this selection. The present chapter discusses methods for causal inference in the presence of systematic pre-treatment differences between treatment and control groups. A key difficulty is that there can be many pre-treatment variables with mismatch, hence the need for adjustment on many variables.
Statistical inference can be formulated as a set of operations on data that yield estimates and uncertainty statements about predictions and parameters of some underlying process or population. From a mathematical standpoint, these probabilistic uncertainty statements are derived based on some assumed probability model for observed data. In this chapter, we sketch the basics of probability modeling, estimation, bias and variance, and the interpretation of statistical inferences and statistical errors in applied work. We introduce the theme of uncertainty in statistical inference and discuss how it is a mistake to use hypothesis tests or statistical significance to attribute certainty from noisy data.
This introductory chapter lays out the key challenges of statistical inference in general and regression modeling in particular. We present a series of applied examples to show how complex and subtle regression can be, and why a book-length treatment is needed, not just on the mathematics of regression modeling but on how to apply and understand these methods.
This chapter is a departure from the rest of the book, which focuses on data analysis: building, fitting, understanding, and evaluating models fit to existing data. In the present chapter, we consider the design of studies, in particular asking the question of what sample size is required to estimate a quantity of interest to some desired precision. We focus on the paradigmatic inferential tasks of estimating population averages, proportions, and comparisons in sample surveys, or estimating treatment effects in experiments and observational studies. However, the general principles apply for other inferential goals such as prediction and data reduction. We present the relevant algebra and formulas for sample size decisions and demonstrating with a range of examples, but we also criticize the standard design framework of “statistical power,” which when studied naively yields unrealistic expectations of success and can lead to the design of ineffective, noisy studies. As we frame it, the goal of design is not to attain statistical significance with some high probability, but rather to have a sense–before and after data have been collected–about what can realistically be learned from statistical analysis of an empirical study.
So far, we have been interpreting regressions predictively: given the values of several inputs, the fitted model allows us to predict y, typically considering the n data points as a simple random sample from a hypothetical infinite “superpopulation” or probability distribution. Then we can make comparisons across different combinations of values for these inputs. This section of the book considers causal inference, which concerns what would happen to an outcome y as a result of a treatment, intervention, or exposure. This chapter introduces the notation and ideas of causal inference in the context of randomized experiments, which allow clean inference for average causal effects and serve as a starting point for understanding the tools and challenges of causal estimation.
With logistic as with linear regression, fitting is only part of the story. In this chapter we develop more advanced graphics to visualize data and fitted logistic regressions with one or more predictors. We discuss the challenges of interpreting coefficients in the presence of interactions and the use of linear transformations to aid understanding. We show how to make probabilistic predictions and how to average these predictions to obtain summaries–average predictive comparisons–that can be more interpretable than logistic regression coefficients. We discuss the evaluation of fitted models using binned residual plots and predictive errors, and we present all these tools in the context of a worked example. The chapter concludes with a discussion of the use of Bayesian inference and prior distributions to resolve a challenge of inference that arises with sparse discrete data, which again we illustrate with an applied example.
Going forward, there are various ways in which we find it useful in applied work to push against the boundaries of linear regression and generalized linear models. Consider this concluding chapter as an introduction to various methods that we plan to discuss in more detail in the sequel to this book.
It is not always best to fit a regression using data in their raw form. In this chapter we start by discussing linear transformations for standardizing predictors and outcomes in a regression, which connects to “regression to the mean,” earlier discussed in Chapter 6, and how it relates to linear transformations and correlation. We then discuss logarithmic and other transformations in the context of a series of examples in which input and outcome variables are transformed and combined in various ways in order to get more understandable models and better predictions. This leads us to more general thoughts about building and comparing regression models in an applied context, which we develop in the context of an additional example.
Simple methods from introductory mathematics and statistics have three important roles in regression modeling. First, linear algebra and simple probability distributions are the building blocks for elaborate models. Second, it is useful to understand the basic ideas of inference separately from the details of particular classes of model. Third, it is often useful in practice to construct quick estimates and comparisons for small parts of a problem–before fitting an elaborate model, or in understanding the output from such a model. This chapter provides a quick review of some of these basic ideas.
As we move from the simple model, y = a + bx + error to the more general y = β0 + β1x1 + β2 x2 + . . . + error, complexities arise, involving choices of what predictors x to include in the model, interpretations of the coefficients and how they interact, and construction of new predictors from existing variables to capture discreteness and nonlinearity. We need to learn how to build and understand models as new predictors are added. We discuss these challenges through a series of examples illustrated with R code and graphs of data and fitted models.
Linear regression is an additive model, which does not work for binary outcomes–that is, data y that take on the values 0 or 1. To model binary data, we need to add two features to the base model y = a + bx: a nonlinear transformation that bounds the output between 0 and 1 (unlike a + bx, which is unbounded), and a model that treats the resulting numbers as probabilities and maps them into random binary outcomes. This chapter and the next describe one such model–logistic regression–and then in Chapter 15 we discuss generalized linear models, a larger class that includes linear and logistic regression as special cases. In the present chapter, we introduce the mathematics of logistic regression and also its latent-data formulation, in which the binary outcome y is a discretized version of an unobserved or latent continuous measurement z. As with the linear model, we show how to fit logistic regression, interpret its coefficients, and plot data and fitted curves. The nonlinearity of the model increases the challenges of interpretation and model-building, as we discuss in the contextof several examples.
In this chapter we use our favored technique of fake-data simulation to understand a simple regression model, use a real-data example of height and earnings to warn against unwarranted causal interpretations, and discuss the historical origins of regression as it relates to comparisons and statistical adjustment.
The previous chapters described causal inference strategies that assume ignorability of exposure or treatment assignment. It is reasonable to be concerned about this assumption, however. After all, when are we really confident that we have measured all confounders? This chapter explores several alternative causal inference strategies that rely on slightly different sets of assumptions that may be more plausible in certain settings. We also discuss the relation between statistical causal inference (estimates of the effects of specified treatments or exposures) and causal explorations or searches for causes of patterns in observed data.