To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The likelihood function plays a central role in both statistical theory and practice. Basic results about likelihood inference, which we call first order asymptotics, were developed in fundamental work by R. A. Fisher during the 1920s, and now form an essential and widely taught part of both elementary and advanced courses in statistics. It is less well known that Fisher later proposed a more refined approach, which has been developed over the past three decades into a theory of higher order asymptotics. While this theory leads to some extremely accurate methods for parametric inference, accounts of the theory can appear forbidding, and the results may be thought to have little importance for statistical practice.
The purpose of this book is dispel this view, showing how higher order asymptotics may be applied in realistic examples with very little more effort than is needed for first order procedures, and to compare the resulting improved inferences with those from other approaches. To do this we have collected a range of examples and case studies, provided details on the implementation of higher order approximations, and compared the resulting inference to that based on other methods; usually first order likelihood theory, but where appropriate also methods based on simulation. Our examples are nearly all derived from regression models for discrete or continuous data, but range quite widely over the types of models and inference problems where likelihood methods are applied.
This book is about the statistical analysis of data, and in particular approximations based on the likelihood function. We emphasize procedures that have been developed using the theory of higher order asymptotic analysis and which provide more precise inferences than are provided by standard theory. Our goal is to illustrate their use in a range of applications that are close to many that arise in practice. We generally restrict attention to parametric models, although extensions of the key ideas to semi-parametric and non-parametric models exist in the literature and are briefly mentioned in contexts where they may be appropriate. Most of our examples consist of a set of independent observations, each of which consists of a univariate response and a number of explanatory variables.
Much application of likelihood inference relies on first order asymptotics, by which we mean the application of the central limit theorem to conclude that the statistics of interest are approximately normally distributed, with mean and variance consistently estimable from the data. There has, however, been great progress over the past twenty-five years or so in the theory of likelihood inference, and two main themes have emerged. The first is that very accurate approximations to the distributions of statistics such as the maximum likelihood estimator are relatively easily derived using techniques adapted from the theory of asymptotic expansions. The second is that even in situations where first order asymptotics is to be used, it is often helpful to use procedures suggested by these more accurate approximations, as they provide modifications to naive approaches that result in more precise inferences.
In the examples in later chapters we use parametric models almost exclusively. These models are used to incorporate a key element of statistical thinking: the explicit recognition of uncertainty. In frequentist settings imprecise knowledge about the value of a single parameter is typically expressed through a collection of confidence intervals, or equivalently by computation of the P-values associated with a set of hypotheses. If prior information is available then Bayes' theorem can be employed to perform posterior inference.
In almost every realistic setting, uncertainty is gauged using approximations, the most common of which rely on the application of the central limit theorem to quantities derived from the likelihood function. Not only does likelihood provide a powerful and very general framework for inference, but the resulting statements have many desirable properties.
In this chapter we provide a brief overview of the main approximations for likelihood inference. We present both first order and higher order approximations; first order approximations are derived from limiting distributions, and higher order approximations are derived from further analysis of the limiting process. A minimal amount of theory is given to structure the discussion of the examples in Chapters 3 to 7; more detailed discussion of asymptotic theory is given in Chapter 8.
Scalar parameter
In the simplest situation, observations y1, …, yn are treated as a realization of independent identically distributed random variables Y1, …, Yn whose probability density function f(y; θ) depends on an unknown scalar parameter θ.
A superficially appealing way to implement higher order inference procedures would be to write general routines in a computer algebra system such as Maple or Mathematica, designed so that the user need provide the minimum input specific to his or her problem. One would use these routines to derive symbolic expressions for quantities such as r*, and then evaluate these expressions numerically, concealing the hideous details in the computer. Many models have special structure which this approach does not exploit, however, leading to burdensome derivations of intermediate quantities which then simplify enormously, and symbolic computation packages are generally ill-suited for numerical work on the scale needed for applied statistics. Thus although computer algebra systems can be powerful tools for research in higher order asymptotics, those currently available are unsuitable for passing on the fruits of that research to potential users. It is possible to interface separate packages for symbolic and numerical computation, but this halfway house is system-dependent and demands knowledge of advanced features of the packages.
A more practicable approach recognises that many classes of models can be treated in a modular way, so higher order quantities can be expressed using a few elementary building-blocks. In some cases these must be computed specifically for the problem at hand, but the rudimentary symbolic manipulation facilities of environments for numerical computing such as R can then be exploited. A technique that we call pivot profiling can then be used to obtain higher order quantities for the range of interest, by computing them over a fixed grid of values between which they are interpolated.
In this chapter we give a brief overview of the main theoretical results and approximations used in this book. These approximations are derived from the theory of higher order likelihood asymptotics. We present these fairly concisely, with few details on the derivations. There is a very large literature on theoretical aspects of higher order asymptotics, and the bibliographic notes give guidelines to the references we have found most helpful.
The building blocks for the likelihood approximations are some basic approximation techniques: Edgeworth and saddlepoint approximations to the density and distribution of the sample mean, Laplace approximation to integrals, and some approximations related to the chi-squared distribution. These techniques are summarized in Appendix A, and the reader wishing to have a feeling for the mathematics of the approximations in this chapter may find it helpful to read that first.
We provide background and notation for likelihood, exponential family models and transformation models in Section 8.2 and describe the limiting distributions of the main likelihood statistics in Section 8.3. Approximations to densities, including the very important p* approximation, are described in Section 8.4. Tail area approximations for inference about a scalar parameter are developed in Sections 8.5 and 8.6. These tail area approximations are illustrated in most of the examples in the earlier chapters. Approximations for Bayesian posterior distribution and density functions are described in Section 8.7. Inference for vector parameters, using adjustments to the likelihood ratio statistic, is described in Section 8.8.
In Chapters 4 and 5, we presented a selection of case studies with the goals of emphasizing the application and of illustrating higher order approximations as an adjunct to inference. The selection of our case studies was informed by these twin goals – we used relatively small data sets, and our discussion was sometimes rather remote from the original application.
In this chapter we present more detailed analyses of data collected to address particular scientific problems, with emphasis on the modelling and the conclusions. While we use higher order methods as an adjunct to the analysis, the main focus is on the data analysis rather than the inference methods. These case studies are a subset of examples that have crossed our desks in collaborative work. In Chapter 7 we take the opposite approach, and illustrate a selection of inference problems that are amenable to higher order approximation.
Sections 6.2 and 6.3 present slightly non-standard analysis of binary data; in the first case a natural model leads to the complementary log–log link, and in the second we consider conditional assessment of the binary model, eliminating the parameters in the binary regression by conditioning on the sufficient statistic. In Section 6.4 we present detailed analysis of a published set of herbicide data, with particular emphasis on the nlreg package of the hoa bundle. This package provides an extensive set of diagnostics and plots that are a useful adjunct to first order analysis, as well as providing an implementation of higher order approximations.
In this chapter we illustrate the breadth of application of higher order asymptotics by presenting a variety of examples, most of which have appeared in the published literature. In contrast to the earlier chapters, the emphasis is on the methods for higher order approximation, with the data treated as mainly illustrative. Section 7.2 outlines a problem of calibration in normal linear regression, and the two succeeding sections outline higher order approximation for a variance components setting and for dependent data, respectively. Sections 7.5 and 7.6 concern a problem of gamma regression; we compare Bartlett correction to Skovgaard's multivariate adjustment to the likelihood ratio statistic, and indicate the use of Laplace approximation for Bayes inference. In Section 7.7 we consider if it is worthwhile to apply higher order approximation to partial likelihood. The final section concerns use of a constructed exponential family to find the distribution of interest for a randomisation test.
Calibration
Table 7.1 shows measurements of the concentration of an enzyme in human blood plasma. The true concentration x is obtained using an accurate but expensive laboratory method, and the measured concentration y is obtained by a faster and less expensive automatic method. The goal is to use the observed data pairs to estimate values of the true concentration based on further measurements using the less expensive method. This is an example of a calibration problem: we have a model for E(y|x) that depends on some unknown parameters, and use a sample of pairs (x1, y1),…,(xn, yn) to estimate these parameters.
Consider an educational study with data from students in many schools, predicting in each school the students' grades y on a standardized test given their scores on a pre-test x and other information. A separate regression model can be fit within each school, and the parameters from these schools can themselves be modeled as depending on school characteristics (such as the socioeconomic status of the school's neighborhood, whether the school is public or private, and so on). The student-level regression and the school-level regression here are the two levels of a multilevel model.
In this example, a multilevel model can be expressed in (at least) three equivalent ways as a student-level regression:
A model in which the coefficients vary by school (thus, instead of a model such as y = α + βx + error, we have y = αj + βjx + error, where the subscripts j index schools),
A model with more than one variance component (student-level and school-level variation),
A regression with many predictors, including an indicator variable for each school in the data.
More generally, we consider a multilevel model to be a regression (a linear or generalized linear model) in which the parameters—the regression coefficients—are given a probability model. This second-level model has parameters of its own—the hyperparameters of the model—which are also estimated from data.
The two key parts of a multilevel model are varying coefficients, and a model for those varying coefficients (which can itself include group-level predictors).
Statistical graphics are sometimes summarized as “exploratory data analysis” or “presentation” or “data display.” But these only capture part of the story. Graphs are a way to communicate graphical and spatial information to ourselves and others. Long before worrying about how to convince others, you first have to understand what's happening yourself.
Why to graph
Going back through the dozens of examples in this book, what are our motivations for graphing data and fitted models? Ultimately, the goal is communication (to self or others). More immediately, graphs are comparisons (to zero, to other graphs, to horizontal lines, and so forth). We “read” a graph both by pulling out the expected (for example, the slope of a fitted regression line, the comparisons of a series of confidence intervals to zero and each other) and the unexpected.
In our experience, the unexpected is usually not an “outlier” or aberrant point but rather a systematic pattern in some part of the data. For example, consider the binned residual plots in Section 5.6 for the well-switching models. There was an unexpectedly low rate of switching from wells that were just barely over the dangerous level for arsenic, possibly suggesting that people were moderating their decisions when in this ambiguous zone, or that there was other information not included in the model that could explain these decisions.
Analysis of variance (ANOVA) refers to a specific set of methods for data analysis and to a way of summarizing multilevel models:
As a tool for data analysis, ANOVA is typically used to learn the relative importance of different sources of variation in a dataset. For example, Figure 13.8 displays success rates of pilots at a flight simulator under five different treatments at eight different airports. How much of the variation in the data is explained by treatments, how much by airports, and how much remains after these factors have been included in a linear model?
If a multilevel model has already been fit, it can be summarized by the variation in each of its batches of coefficients. For example, in the radon modeling in Chapter 12, how much variation in radon levels is explained by floor of measurement and how much by geographical variation? Or, in the analysis of public opinion by state in Section 14.1, how much of the variation is explained by demographic factors (sex, age, ethnicity, education), and how much by states and regions?
These “analysis of variance” questions can be of interest even for models that are primarily intended for prediction, or for estimating particular regression coefficients.
The sections of this chapter address the different roles of ANOVA in multilevel data analysis. We begin in Section 22.1 with a brief review of the goals and methods of classical analysis of variance, outlining how they fit into our general multilevel modeling approach.
Linear regression is a method that summarizes how the average values of a numerical outcome variable vary over subpopulations defined by linear functions of predictors. Introductory statistics and regression texts often focus on how regression can be used to represent relationships between variables, rather than as a comparison of average outcomes. By focusing on regression as a comparison of averages, we are being explicit about its limitations for defining these relationships causally, an issue to which we return in Chapter 9. Regression can be used to predict an outcome given a linear function of these predictors, and regression coefficients can be thought of as comparisons across predicted values or as comparisons among averages in the data.
One predictor
We begin by understanding the coefficients without worrying about issues of estimation and uncertainty. We shall fit a series of regressions predicting cognitive test scores of three- and four-year-old children given characteristics of their mothers, using data from a survey of adult American women and their children (a subsample from the National Longitudinal Survey of Youth).
For a binary predictor, the regression coefficient is the difference between the averages of the two groups
We start by modeling the children's test scores given an indicator for whether the mother graduated from high school (coded as 1) or not (coded as 0).