To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This book is primarily intended for advanced undergraduates or beginning graduate students in statistics. It should also be of interest to many students and professionals in the social and health sciences. Although written as a textbook, it can be read on its own. The focus is on applications of linear models, including generalized least squares, two-stage least squares, probits and logits. The bootstrap is explained as a technique for estimating bias and computing standard errors.
The contents of the book can fairly be described as what you have to know in order to start reading empirical papers that use statistical models. The emphasis throughout is on the connection—or lack of connection—between the models and the real phenomena. Much of the discussion is organized around published studies; the key papers are reprinted for ease of reference. Some observers may find the tone of the discussion too skeptical. If you are among them, I would make an unusual request: suspend belief until you finish reading the book. (Suspension of disbelief is all too easily obtained, but that is a topic for another day.)
The first chapter contrasts observational studies with experiments, and introduces regression as a technique that may help to adjust for confounding in observational studies. There is a chapter that explains the regression line, and another chapter with a quick review of matrix algebra. (At Berkeley, half the statistics majors need these chapters.)
1. In table 1, there were 837 deaths from other causes in the total treatment group (screened plus refused) and 879 in the control group. Not much different.
Comments. (i) Groups are the same size, so we can look at numbers or rates. (ii) The difference in number of deaths is relatively small, and not statistically significant.
2. This comparison is biased. The control group includes women who would have accepted screening if they had been asked, and are therefore comparable to women in the screening group. But the control group also includes women who would have refused screening. The latter are poorer, less well educated, less at risk from breast cancer. (A comparison that includes only the subjects who follow the investigators' treatment plans is called “per protocol analysis,” and is generally biased.)
3. Natural experiment. The fact that the Lambeth Company moved its pipe (i) sets up the comparison with Southwark & Vauxhall (table 2) and (ii) makes it harder to explain the difference in death rates between the Lambeth customers and the Southwark & Vauxhall customers on the basis of some difference between the two groups—other than the water. For instance, people were generally not choosing between the two water companies on the basis of how the water tasted. If they had been, selfselection and confounding would be bigger issues. The change in water intake point is one basis for the view that the data could be analyzed as if they were from a randomized controlled experiment.
Every statistician and data analyst has to make choices. The need arises especially when data have been collected and it is time to think about which model to use to describe and summarise the data. Another choice, often, is whether all measured variables are important enough to be included, for example, to make predictions. Can we make life simpler by only including a few of them, without making the prediction significantly worse?
In this book we present several methods to help make these choices easier. Model selection is a broad area and it reaches far beyond deciding on which variables to include in a regression model.
Two generations ago, setting up and analysing a single model was already hard work, and one rarely went to the trouble of analysing the same data via several alternative models. Thus ‘model selection’ was not much of an issue, apart from perhaps checking the model via goodness-of-fit tests. In the 1970s and later, proper model selection criteria were developed and actively used. With unprecedented versatility and convenience, long lists of candidate models, whether thought through in advance or not, can be fitted to a data set. But this creates problems too. With a multitude of models fitted, it is clear that methods are needed that somehow summarise model fits.
Data can often be modelled in different ways. There might be simple approaches and more advanced ones that perhaps have more parameters. When many covariates are measured we could attempt to use them all to model their influence on a response, or only a subset of them, which would make it easier to interpret and communicate the results. For selecting a model among a list of candidates, Akaike's information criterion (AIC) is among the most popular and versatile strategies. Its essence is a penalised version of the attained maximum log-likelihood, for each model. In this chapter we shall see AIC at work in a range of applications, in addition to unravelling its basic construction and properties. Attention is also given to natural generalisations and modifications of AIC that in various situations aim at performing more accurately.
Information criteria for balancing fit with complexity
In Chapter 1 various problems were discussed where the task of selecting a suitable statistical model, from a list of candidates, was an important ingredient. By necessity there are different model selection strategies, corresponding to different aims and uses associated with the selected model. Most (but not all) selection methods are defined in terms of an appropriate information criterion, a mechanism that uses data to give each candidate model a certain score; this then leads to a fully ranked list of candidate models, from the ostensibly best to the worst.
In this chapter we compare some information criteria with respect to consistency and efficiency, which are classical themes in model selection. The comparison is driven by a study of the ‘penalty’ applied to the maximised log-likelihood value, in a framework with increasing sample size. AIC is not strongly consistent, though it is efficient, while the opposite is true for the BIC. We also introduce Hannan and Quinn's criterion, which has properties similar to those of the BIC, while Mallows's Cp and Akaike's FPE behave like AIC.
Comparing selectors: consistency, efficiency and parsimony
If we make the assumption that there exists one true model that generated the data and that this model is one of the candidate models, we would want the model selection method to identify this true model. This is related to consistency. A model selection method is weakly consistent if, with probability tending to one as the sample size tends to infinity, the selection method is able to select the true model from the candidate models. Strong consistency is obtained when the selection of the true model happens almost surely. Often, we do not wish to make the assumption that the true model is amongst the candidate models. If instead we are willing to assume that there is a candidate model that is closest in Kullback–Leibler distance to the true model, we can state weak consistency as the property that, with probability tending to one, the model selection method picks such a closest model.
This book is about making choices. If there are several possibilities for modelling data, which should we take? If multiple explanatory variables are measured, should they all be used when forming predictions, making classifications, or attempting to summarise analysis of what influences response variables, or will including only a few of them work equally well, or better? If so, which ones can we best include? Model selection problems arrive in many forms and on widely varying occasions. In this chapter we present some data examples and discuss some of the questions they lead to. Later in the book we come back to these data and suggest some answers. A short preview of what is to come in later chapters is also provided.
Introduction
With the current ease of data collection which in many fields of applied science has become cheaper and cheaper, there is a growing need for methods which point to interesting, important features of the data, and which help to build a model. The model we wish to construct should be rich enough to explain relations in the data, but on the other hand simple enough to understand, explain to others, and use. It is when we negotiate this balance that model selection methods come into play. They provide formal support to guide data users in their search for good models, or for determining which variables to include when making predictions and classifications.
The model selection methods presented earlier (such as AIC and the BIC) have one thing in common: they select one single ‘best model’, which should then be used to explain all aspects of the mechanisms underlying the data and predict all future data points. The tolerance discussion in Chapter 5 showed that sometimes one model is best for estimating one type of estimand, whereas another model is best for another estimand. The point of view expressed via the focussed information criterion (FIC) is that a ‘best model’ should depend on the parameter under focus, such as the mean, or the variance, or the particular covariate values, etc. Thus the FIC allows and encourages different models to be selected for different parameters of interest.
Estimators and notation in submodels
In model selection applications there is a list of models to consider. We shall assume here that there is a ‘smallest’ and a ‘biggest’ model among these, and that the others lie between these two extremes. More concretely, there is a narrow model, which is the simplest model that we possibly might use for the data, having an unknown parameter vector θ of length p. Secondly, in the wide model, the largest model that we consider, there are an additional q parameters γ = (γ1, …, γq). We assume that the narrow model is a special case of the wide model, which means that there is a value γ0 such that with γ = γ0 in the wide model, we get precisely the narrow model.
In this chapter model selection and averaging methods are applied in some usual regression set-ups, like those of generalised linear models and the Cox proportional hazards regression model, along with some less straightforward models for multivariate data. Answers are suggested to several of the specific model selection questions posed about the data sets of Chapter 1. In the process we explain in detail what the necessary key quantities are, for different strategies, and how these are estimated from data. A concrete application of methods for statistical model selection and averaging is often a nontrivial task. It involves a careful listing of all candidate models as well as specification of focus parameters, and there might be different possibilities for estimating some of the key quantities involved in a given selection criterion. Some of these issues are illustrated in this chapter, which is concerned with data analysis and discussion only; for the methodology we refer to earlier chapters.
AIC and BIC selection for Egyptian skull development data
We perform model selection for the data set consisting of measurements on skulls of male Egyptians, living in different time eras; see Section 1.2 for more details. Our interest lies in studying a possible trend in the measurements over time and in the correlation structure between measurements.
Assuming the normal approximation at work, we construct for each time period, and for each of the four measurements, pointwise 95% confidence intervals for the expected average measurement of that variable and in that time period.