To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
A statistical model is a probability distribution constructed to enable inferences to be drawn or decisions made from data. This idea is the basis of most tools in the statistical workshop, in which it plays a central role by providing economical and insightful summaries of the information available.
This book is intended as an integrated modern account of statistical models covering the core topics for studies up to a masters degree in statistics. It can be used for a variety of courses at this level and for reference. After outlining basic notions, it contains a treatment of likelihood that includes non-regular cases and model selection, followed by sections on topics such as Markov processes, Markov random fields, point processes, censored and missing data, and estimating functions, as well as more standard material. Simulation is introduced early to give a feel for randomness, and later used for inference. There are major chapters on linear and nonlinear regression and on Bayesian ideas, the latter sketching modern computational techniques. Each chapter has a wide range of examples intended to show the interplay of subject-matter, mathematical, and computational considerations that makes statistical work so varied, so challenging, and so fascinating.
The target audience is senior undergraduate and graduate students, but the book should also be useful for others wanting an overview of modern statistics. The reader is assumed to have a good grasp of calculus and linear algebra, and to have followed a course in probability including joint and conditional densities, moment-generating functions, elementary notions of convergence and the central limit theorem, for example using Grimmett and Welsh (1986) or Stirzaker (1994).
Chapter 4 described methods related to a central notion in inference, namely likelihood. This chapter and the next discuss how those ideas apply to some particular situations, beginning with the simplest model for the dependence of one variable on another, straight-line regression. There is then an account of exponential family distributions, which include many models commonly used in practice, such as the normal, exponential, gamma, Poisson and binomial densities, and which play a central role in statistical theory. We then briefly describe group transformation models, which are also important in statistical theory. This is followed by a description of models for data in the form of lifetimes, which are common in medical and industrial settings, and a discussion of missing data and the EM algorithm.
Straight-Line Regression
We have already met situations where we focus on how one variable depends on others. In such problems there are two or more variables, some of which are regarded as fixed, and others as random. The random quantities are known as responses and the fixed ones as explanatory variables. We shall suppose that only one variable is regarded as a response. Such models, known as regression models, are discussed extensively in Chapters 8, 9, and 10. Here we outline the basic results for the simplest regression model, where a single response depends linearly on a single covariate. We start with an example.
The previous chapter outlined likelihood analysis of some standard models. Here we turn to data in which the dependence among the observations is more complex. We start by explaining how our earlier discussion extends to Markov processes in discrete and continuous time. We then extend this to more complex indexing sets and in particular to Markov random fields, in which basic concepts from graph theory play an important role. A special case is the multivariate normal distribution, an important model for data with several responses. We give some simple notions for time series, a very widespread form of dependent data, and then turn to point processes, describing models for rare events in passing.
Markov Chains
In certain applications interest is focused on transitions among a small number of states. A simple example is rainfall modelling, where a sequence … 010011 … indicates whether or not it has rained each day. Another is in panel studies of employment, where many individuals are interviewed periodically about their employment status, which might be full-time, part-time, home-worker, unemployed, retired, and so forth. Here interest will generally focus on how variables such as age, education, family events, health, and changes in the job market affect employment history for each interviewee, so that there are many short sequences of state data taken at unequal intervals, unlike the single long rainfall sequence. In each case, however, the key aspect is that transitions occur amongst discrete states, even though these typically are crude summaries of reality.
The key idea in statistical modelling is to treat the data as the outcome of a random experiment. The purpose of this chapter is to understand some consequences of this: how to summarize and display different aspects of random data, and how to use results of probability theory to appreciate the variation due to this randomness. We outline the elementary notions of statistics and parameters, and then describe how data and statistics derived from them vary under sampling from statistical models. Many quantities used in practice are based on averages or on ordered sample values, and these receive special attention. The final section reviews moments and cumulants, which will be useful in later chapters.
Statistics and Sampling Variation
Data summaries
The most basic element of data is a single observation, y — usually a number, but perhaps a letter, curve, or image. Throughout this book we shall assume that whatever their original form, the data can be recoded as numbers. We shall mostly suppose that single observations are scalar, though sometimes they are vectors or matrices.
We generally deal with an ensemble of n observations, y1, …, yn, known as a sample. Occasionally interest centres on the given sample alone, and if n is not tiny it will be useful to summarize the data in terms of a few numbers. We say that a quantity s = s(y1, …, yn) that can be calculated from y1, …, yn is a statistic.
Every statistical investigation takes place in a context. Information about what question is to be addressed will suggest what data are needed to give useful answers. Before the data are available, one role for this information is to suggest suitable probability models. There may also be information about the values of unknown parameters, and if this can be expressed as a probability density, an approach to inference based on Bayes' theorem is possible. Many statisticians make the stronger claim that this theorem provides the only entirely consistent basis for inference, and insist on its use.
This chapter outlines some aspects of the Bayesian approach to modelling. We first give an account of basic uses of Bayes' theorem and of the role and construction of prior densities. We then turn to inference, dealing with analogues of confidence intervals, tests, approaches to model criticism, and model uncertainty. Until recently computational difficulties placed realistic Bayesian modelling largely out of reach, but over the last 20 years there has been rapid progress and complex models can now be fitted routinely. Section 11.3 gives an account of Bayesian computation, first of analytical approaches based on integral approximations, and then of Monte Carlo methods. The chapter concludes with brief introductions to hierarchical and empirical Bayesian procedures.
Until now we have confined discussion to scatterplot smoothers. This setting served well to illustrate the main concepts behind smoothing. However, there is a gap between the methodology and the needs of practitioners. As exemplified by the problems described in Chapter 1, most applications of regression involve several predictors. To begin closing the gap, this chapter introduces a class of multiple regression models that have a nonparametric component involving only a single predictor and a parametric component for the other predictors. Having both parametric and nonparametric components means the models are semiparametric. This class of simple semiparametric models is important in its own right but also serves as an introduction to more complex semiparametric regression models of later chapters, where the effects of several predictors are modeled nonparametrically.
Beyond Scatterplot Smoothing
The end of the previous chapter closed off quite a lengthy description of how to smooth out a scatterplot and perform corresponding inference. In Chapter 3 we described three general approaches: penalized splines, local polynomial fitting, and series approximation. For penalized splines, we presented both an algorithmic approach based on ridge regression and a mixed model approach based on maximum likelihood and best prediction. There are other approaches to scatterplot smoothing that we did not describe at all.
Mixed models are an extension of regression models that allow for the incorporation of random effects. However, they also turn out to be closely related to smoothing. In fact, we will show in Section 4.9 that the penalized spline smoother exactly corresponds to the optimal predictor in a mixed model framework. This link allows for mixed model methodology and software to be used in semiparametric regression analysis, as we will demonstrate in subsequent chapters.
This chapter begins with a brief review of mixed models. Readers with detailed knowledge of mixed models could skip these sections and proceed directly to Section 4.9.
Mixed Models
Much of the early work on mixed models – in particular, the special case of variance component models – was motivated by the analysis of data from animal breeding experiments and driven by the need to incorporate heritabilities and genetic correlations in a parsimonious fashion. They have also played an important role in establishing quality control procedures and determination of sampling designs, among other applications. Overviews of this vast topic are provided by Searle, Casella, and McCulloch (1992), Vonesh and Chinchilli (1997), Pinheiro and Bates (2000), Verbeke and Molenberghs (2000), and McCulloch and Searle (2001).
A more contemporary application of mixed models is the analysis of longitudinal data sets (see e.g. Laird and Ware 1982; Diggle et al. 2002). We will use this setting to illustrate the essence of mixed modeling.
Figure 4.1 shows two representations of data pertaining to weight measurements of 48 pigs for nine successive weeks. Figure 4.1(a) is simply a scatterplot of the weights against their corresponding week number; in Figure 4.1(b), lines are drawn connecting those measurements that belong to the same pig.
The final draft of this book was written in 2002 and reflects our priorities and views on semiparametric regression at that time. However, the interplay between statistical methodology and applications is currently in a dynamic state. We hope that our coverage of the main ideas of semiparametric regression will serve as a reasonable basis for whatever new directions semiparametric regression takes. In this closing chapter, we note that the approach to semiparametric regression used throughout most of the book can be distilled into just a few basic ideas. We also mention some notable omissions and comment on future directions.
Minimalist Statistics
One of the major themes of this book is the use of the mixed model framework to fit and make inferences concerning a wide variety of semiparametric regression models, though we have intentionally used both the mixed model and more classical GCV methods in our examples. This approach has the advantage of requiring little more than familiarity with mixed model methodology, as outlined in Chapter 4 and Section 10.8. In particular, fitting is achieved through just two fundamental and well-established principles:
(1) estimation of parameters via (restricted) maximum likelihood; and
(2) prediction of random effects via best prediction.
If there is an important scientific exception to the basis model – such as a predictor being subject to measurement error – then these principles can still be used for fitting, as demonstrated in Chapter 15. However, as seen there and in Section 10.8, maximum likelihood and best prediction are sometimes hindered by the presence of intractable integrals.