To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In the usual regression context, predictive inference relates to comparisons between units, whereas causal inference addresses comparisons of different treatments if applied to the same units. More generally, causal inference can be viewed as a special case of prediction in which the goal is to predict what would have happened under different treatment options. Causal interpretations of regression coefficients can only be justified by relying on much stronger assumptions than are needed for predictive inference. As discussed in the previous chapter, controlled experiments are ideal settings for using regression to estimate a treatment effect because the design of data collection guarantees that treatment assignment is independent of the potential outcomes, conditional on the information used in the design. This chapter illustrates the use of regression in the setting of controlled experiments, going through issues of adjustment for pre-treatment predictors, interactions, and pitfalls that can arise when building a regression using experimental data and interpreting coefficients causally.
Simulation of random variables is important in applied statistics for several reasons. First, we use probability models to mimic variation in the world, and the tools of simulation can help us better understand how this variation plays out. Second, we can use simulation to approximate the sampling distribution of data and propagate this to the sampling distribution of statistical estimates and procedures. Third, regression models are not deterministic; they produce probabilistic predictions. Simulation is the most convenient and general way to represent uncertainties in forecasts.
As discussed in Chapter 1, regression is fundamentally a technology for predicting an outcome y from inputs x1, x2, . . . . In this chapter we introduce regression in the simple (but not trivial) case of a linear model predicting a continuous y from a single continuous x, thus fitting the model yi = a+bxi +errortodata(xi,yi), i=1, ..., n. We demonstrate with an applied example that includes the steps of fitting the model, displaying the data and fitted line, and interpreting the fit. We then show how to check the fitting procedure using fake-data simulation, and the chapter concludes with an explanation of how linear regression includes simple comparison as a special case.
Most of this book is devoted to examples and tools for the practical use and understanding of regression models, starting with linear regression with a single predictor and moving to multiple predictors, nonlinear models, and applications in prediction and causal inference. In this chapter, we lay out some of the mathematical structure of inference for regression models and some algebra to help you understand estimation for linear regression. We also explain the rationale for the use of the Bayesian fitting routine stan_glm and its connection to classical linear regression. This chapter thus provides background and motivation for the mathematical and computational tools used in the rest of the book.
The present chapter considers two sorts of operations that are done as adjuncts to fitting a regression. In poststratification, the outputs from a fitted model are combined to make predictions about a new population that can differ systematically from the data. The model allows us to adjust for differences between sample and population–as long as the relevant adjustment variables are included as predictors in the regression, and as long as their distribution is known in the target population. Poststratification is a form of post-processing of inferences that is important in survey research and also arises in causal inference for varying treatment effects, as discussed in subsequent chapters. In contrast, missing-data analysis is a pre-processing step in which data are cleaned or imputed in some ways so as to allow them to be used more easily in a statistical analysis. This chapter introduces the basic ideas of poststratification and missing-data imputation using a mix of real and simulated-data examples.
We can apply the principle of logistic regression–taking a linear “link function“ y = a + bx and extending it through a nonlinear transformation and a probability model–to allow it to predict bounded or discrete data of different forms. This chapter presents this generalized linear modeling framework and goes through several important special cases, including Poisson or negative binomial regression for count data, the logistic-binomial and probit models, ordered logistic regression, robust regression, and some extensions. As always, we explain these models with a variety of examples, with graphs of data and fitted models along with associated R code, with the goal that you should be able to build, fit, understand, and evaluate these models on new problems.
In this chapter we turn to the assumptions of the regression model, along with diagnostics that can be used to assess whether some of these assumptions are reasonable. Some of the most important assumptions rely on the researcher’s knowledge of the subject area and may not be directly testable from the available data alone. Hence, it is good to understand the ideas underlying the model, while recognizing that there is no substitute for engagement with data and the purposes for which they are being used. We show different sorts of plots of data, fitted models, and residuals, developing these methods in the context of real and simulated-data examples. We consider diagnostics based on predictive simulation from the fitted model, along with numerical summaries of fit, including residual error, explained variance, external validation, and cross validation. The goal is to develop a set of tools that you can use in constructing, interpreting, and evaluating regression models with multiple predictors.
Bayesian inference involves three steps that go beyond classical estimation. First, the data and model are combined to form a posterior distribution, which we typically summarize by a set of simulations of the parameters in the model. Second, we can propagate uncertainty in this distribution–that is, we can get simulation-based predictions for unobserved or future outcomes that account for uncertainty in the model parameters. Third, we can include additional information into the model using a prior distribution. The present chapter describes all three of these steps in the context of examples capturing challenges of prediction and inference.
Before fitting a model, though, it is a good idea to understand where your numbers are coming from. The present chapter demonstrates through examples how to use graphical tools to explore and understand data and measurements.
The triangle packing number v(G) of a graph G is the maximum size of a set of edge-disjoint triangles in G. Tuza conjectured that in any graph G there exists a set of at most 2v(G) edges intersecting every triangle in G. We show that Tuza’s conjecture holds in the random graph G = G(n, m), when m ⩽ 0.2403n3/2 or m ⩾ 2.1243n3/2. This is done by analysing a greedy algorithm for finding large triangle packings in random graphs.
A celebrated theorem of Pippenger states that any almost regular hypergraph with small codegrees has an almost perfect matching. We show that one can find such an almost perfect matching which is ‘pseudorandom’, meaning that, for instance, the matching contains as many edges from a given set of edges as predicted by a heuristic argument.
In this study, we propose new risk measures from a regulator’s perspective on the regulatory capital requirements. The proposed risk measures possess many desired properties, including monotonicity, translation-invariance, positive homogeneity, subadditivity, nonnegative loading, and stop-loss order preserving. The new risk measures not only generalize the existing, well-known risk measures in the literature, including the Dutch, tail value-at-risk (TVaR), and expectile measures, but also provide new approaches to generate feasible and practical coherent risk measures. As examples of the new risk measures, TVaR-type generalized expectiles are investigated in detail. In particular, we present the dual and Kusuoka representations of the TVaR-type generalized expectiles and discuss their robustness with respect to the Wasserstein distance.
This paper relaxes assumptions defining multivariate Brownian motion (BM) to construct processes with dependent increments as tractable models for problems in engineering and management science. We show that any Gaussian Markov process starting at zero and possessing stationary increments and a symmetric smooth kernel has a parametric kernel of a particular form, and we derive the unique unbiased, jointly sufficient, maximum-likelihood estimators of those parameters. As an application, we model a single-server queue driven by such a process and derive its transient distribution conditional on its history.
Social jetlag (SJ) occurs when sleep-timing irregularities from social or occupational demands conflict with endogenous sleep–wake rhythms. SJ is associated with evening chronotype and poor mental health, but mechanisms supporting this link remain unknown. Impaired ability to retrieve extinction memory is an emotion regulatory deficit observed in some psychiatric illnesses. Thus, SJ-dependent extinction memory deficits may provide a mechanism for poor mental health. To test this, healthy male college students completed 7–9 nights of actigraphy, sleep questionnaires, and a fear conditioning and extinction protocol. As expected, greater SJ, but not total sleep time discrepancy, was associated with poorer extinction memory. Unexpectedly, greater SJ was associated with a tendency toward morning rather than evening chronotype. These findings suggest that deficient extinction memory represents a potential mechanism linking SJ to psychopathology and that SJ is particularly problematic for college students with a greater tendency toward a morning chronotype.