To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In this chapter we give an account of the main ideas of decision theory. Our motivation for beginning our account of statistical inference here is simple. As we have noted, decision theory requires formal specification of all elements of an inference problem, so starting with a discussion of decision theory allows us to set up notation and basic ideas that run through the remainder of the book in a formal but easy manner. In later chapters, we will develop the specific techniques of statistical inference that are central to the three paradigms of inference. In many cases these techniques can be seen as involving the removal of certain elements of the decision theory structure, or focus on particular elements of that structure.
Central to decision theory is the notion of a set of decision rules for an inference problem. Comparison of different decision rules is based on examination of the risk functions of the rules. The risk function describes the expected loss in use of the rule, under hypothetical repetition of the sampling experiment giving rise to the data x, as a function of the parameter of interest. Identification of an optimal rule requires introduction of fundamental principles for discrimination between rules, in particular the minimax and Bayes principles.
In Chapter 8, we sketched the asymptotic theory of likelihood inference. It is our primary purpose in this chapter to describe refinements to that asymptotic theory, our discussion having two main origins. One motivation is to improve on the first-order limit results of Chapter 8, so as to obtain approximations whose asymptotic accuracy is higher by one or two orders. The other is the Fisherian proposition that inferences on the parameter of interest should be obtained by conditioning on an ancillary statistic, rather than from the original model. We introduce also in this chapter some rather more advanced ideas of statistical theory, which provide important underpinning of statistical methods applied in many contexts.
Some mathematical preliminaries are described in Section 9.1, most notably the notion of an asymptotic expansion. The concept of parameter orthogonality, and its consequences for inference, are discussed in Section 9.2. Section 9.3 describes ways of dealing with a nuisance parameter, through the notions of marginal likelihood and conditional likelihood. Parametrisation invariance (Section 9.4) provides an important means of discrimination between different inferential procedures. Two particularly important forms of asymptotic expansion, Edgeworth expansion and saddlepoint expansion, are described in Section 9.5 and Section 9.6 respectively. The Laplace approximation method for approximation of integrals is described briefly in Section 9.7. The remainder of the chapter is concerned more with inferential procedures. Section 9.8 presents a highlight of modern likelihood-based inference: the so-called p* approximation to the conditional density of a maximum likelihood estimator, given an ancillary statistic.
This chapter is concerned primarily with point estimation of a parameter θ. For many parametric problems, including in particular problems with exponential families, it is possible to summarise all the information about θ contained in a random variable X by a function T = T(X), which is called a sufficient statistic. The implication is that any reasonable estimator of θ will be a function of T (X). However, there are many possible sufficient statistics – we would like to use the one which summarises the information as efficiently as possible. This is called the minimal sufficient statistic, which is essentially unique. Completeness is a technical property of a sufficient statistic. A sufficient statistic, which is also complete, must be minimal sufficient (the Lehmann–Scheffé Theorem). Another feature of a complete sufficient statistic T is that, if some function of T is an unbiased estimator of θ, then it must be the unique unbiased estimator which is a function of a sufficient statistic. The final section of the chapter demonstrates that, when the loss function is convex (including, in particular, the case of squared error loss function), there is a best unbiased estimator, which is a function of the sufficient statistic, and that, if the sufficient statistic is also complete, this estimator is unique. In the case of squared error loss this is equivalent to the celebrated Rao–Blackwell Theorem on the existence of minimum variance unbiased estimators.
The focus of our discussion so far has been inference for the unknown parameter of the probability distribution assumed to have generated the sample data. Sometimes, interest lies instead in assessing the values of future, unobserved values from the same probability distribution, typically the next observation. We saw in Section 3.9 that in a Bayesian approach such prediction is easily accommodated, since there all unknowns are regarded as random variables, so that the distinction between an unknown constant (parameter) and a future observation (random variable) disappears. However, a variety of other approaches to prediction have been proposed.
The prediction problem is as follows. The data x are the observed value of a random variable X with density f (x; θ), and we wish to predict the value of a random variable Z, which, conditionally on X = x, has distribution function G(z | x; θ), depending on θ.
As a simple case, we might have X formed from independent and identically distributed random variables X1, …, Xn, and Z is a further, independent, observation from the same distribution. A more complicated example is that of time series prediction, where the observations are correlated and prediction of a future value depends directly on the observed value as well as on any unknown parameters that have to be estimated. Example 10.2 is a simple case of time series prediction.
In statistical inference experimental or observational data are modelled as the observed values of random variables, to provide a framework from which inductive conclusions may be drawn about the mechanism giving rise to the data.
We wish to analyse observations x = (x1, …, xn) by:
Regarding x as the observed value of a random variable X = (X1, …, Xn) having an (unknown) probability distribution, conveniently specified by a probability density, or probability mass function, f (x).
Restricting the unknown density to a suitable family or set F. In parametric statistical inference, f (x) is of known analytic form, but involves a finite number of real unknown parameters θ = (θ1, …, θd). We specify the region Θ ⊆ ℝd of possible values of θ, the parameter space. To denote the dependency of f (x) on θ, we write f (x; θ) and refer to this as the model function. Alternatively, the data could be modelled non-parametrically, a non-parametric model simply being one which does not admit a parametric representation. We will be concerned almost entirely in this book with parametric statistical inference.
The objective that we then assume is that of assessing, on the basis of the observed data x, some aspect of θ, which for the purpose of the discussion in this paragraph we take to be the value of a particular component, θi say. In that regard, we identify three main types of inference: point estimation, confidence set estimation and hypothesis testing.
This chapter develops the key ideas in the Bayesian approach to inference. Fundamental ideas are described in Section 3.1. The key conceptual point is the way that the prior distribution on the unknown parameter θ is updated, on observing the realised value of the data x, to the posterior distribution, via Bayes’ law. Inference about θ is then extracted from this posterior. In Section 3.2 we revisit decision theory, to provide a characterisation of the Bayes decision rule in terms of the posterior distribution. The remainder of the chapter discusses various issues of importance in the implementation of Bayesian ideas. Key issues that emerge, in particular in realistic data analytic examples, include the question of choice of prior distribution and computational difficulties in summarising the posterior distribution. Of particular importance, therefore, in practice are ideas of empirical Bayes inference (Section 3.5), Monte Carlo techniques for application of Bayesian inference (Section 3.7) and hierarchical modelling (Section 3.8). Elsewhere in the chapter we provide discussion of Stein's paradox and the notion of shrinkage (Section 3.4). Though not primarily a Bayesian problem, we shall see that the James–Stein estimator may be justified (Section 3.5.1) as an empirical Bayes procedure, and the concept of shrinkage is central to practical application of Bayesian thinking. We also provide here a discussion of predictive inference (Section 3.9) from a Bayesian perspective, as well as a historical description of the development of the Bayesian paradigm (Section 3.6).
Much of the early history of social statistics, strongly influenced by Quetelet, can be viewed as a search for the “average man” – that improbable man without qualities who could be comfortable with his feet in the ice chest and his hands in the oven. Some of this obsession can be attributed to the seductive appeal of the Gaussian law of errors. Everyone, as Poincaré famously quipped, believes in the normal law of errors: the theorists because they believe it is an empirical fact, and the empiricists because they believe that it is a mathematical theorem. Once in the grip of this Gaussian faith, it suffices to learn about means. But sufficiency, despite all its mathematical elegance, should be tempered by a skeptical empiricism: a willingness to peer occasionally outside the cathedral of mathematics and see the world in all its diversity.
There have been many prominent statistical voices who, like Galton, reveled in the heterogeneity of statistical life – who resisted proposals to throw the mountains of Switzerland into its lakes. Edgeworth (1920) mocked excessive reliance on “reasoning with the aid of the gens d'arme's hat – from which, as from the conjuror's, so much can be extracted.” Models for the conditional mean in which independently and identically distributed Gaussian “lerrors” are tacked on almost as an afterthought are rife throughout the realms of science. They are indispensable approximations in many settings. We have argued that it is sometimes useful to deconstruct these models, complementing the estimation of models for the conditional mean with estimates of a family of conditional quantile functions.
Francis Galton in a famous passage defending the “charms of statistics” against its many detractors, chided his statistical colleagues
[who] limited their inquiries to Averages, and do not seem to revel in more comprehensive views. Their souls seem as dull to the charm of variety as that of a native of one of our flat English counties, whose retrospect of Switzerland was that, if the mountains could be thrown into its lakes, two nuisances would be got rid of at once
(Natural Inheritance, p. 62).
It is the fundamental task of statistics to bring order out of the diversity – at times the apparent chaos – of scientific observation. And this task is often very effectively accomplished by exploring how averages of certain variables depend on the values of other “conditioning” variables. The method of least squares, which pervades statistics, is admirably suited for this purpose. And yet, like Galton, one may question whether the exclusive focus on conditional mean relations among variables ignores some “charm of variety” in matters statistical.
As a resident of one of the flattest American counties, my recollections of Switzerland and its attractive nuisances are quite different from the retrospect described by Galton. Not only the Swiss landscape, but also many of its distinguished statisticians have in recent years made us more aware of the charms and perils of the diversity of observations and the consequences of too blindly limiting our inquiry to averages.
Much of applied statistics may be viewed as an elaboration of the linear regression model and associated estimation methods of least squares. In beginning to describe these techniques, Mosteller and Tukey (1977), in their influential text, remark:
What the regression curve does is give a grand summary for the averages of the distributions corresponding to the set of xs. We could go further and compute several different regression curves corresponding to the various percentage points of the distributions and thus get a more complete picture of the set. Ordinarily this is not done, and so regression often gives a rather incomplete picture. Just as the mean gives an incomplete picture of a single distribution, so the regression curve gives a correspondingly incomplete picture for a set of distributions.
My objective in the following pages is to describe explicitly how to “go further.” Quantile regression is intended to offer a comprehensive strategy for completing the regression picture.
Why does least-squares estimation of the linear regression model so pervade applied statistics? What makes it such a successful tool? Three possible answers suggest themselves. One should not discount the obvious fact that the computational tractability of linear estimators is extremely appealing. Surely this was the initial impetus for their success. Second, if observational noise is normally distributed (i.e., Gaussian), least-squares methods are known to enjoy a certain optimality. But, as it was for Gauss himself, this answer often appears to be an ex post rationalization designed to replace the first response.
This chapter provides a practical guide to statistical inference for quantile regression applications. In the earlier chapters, we have described a number of applications of quantile regression and provided various representations of the precision of these estimates; in this chapter and the one to follow, we will describe a variety of inference methods more explicitly. There are several competing approaches to inference in the literature and some guidance will be offered on their advantages and disadvantages. Ideally, of course, we would aspire to provide a finite-sample apparatus for statistical inference about quantile regression like the elegant classical theory of least-squares inference under independently and identically distributed (iid) Gaussian errors. But we must recognize that even in the least-squares theory it is necessary to resort to asymptotic approximations as soon as we depart significantly from idealized Gaussian conditions.
Nevertheless, we will begin by briefly describing what is known about the finite-sample theory of the quantile regression estimator and its connection to the classical theory of inference for the univariate quantiles. The asymptotic theory of inference is introduced with a heuristic discussion of the asymptotic behavior of the ordinary sample quantile; then a brief overview of quantile regression asymptotics is given. A more detailed treatment of the asymptotic theory of quantile regression is deferred to Chapter 4. Several approaches to inference are considered: Wald tests and related problems of direct estimation of the asymptotic covariance matrix, rank tests based on the dual quantile regression process, likelihood-ratio-type tests based on the value of the objective function under null and alternative models, and, finally, several resampling methods are introduced.
Although early advocates of absolute error methods like Boscovitch, Laplace, and Edgeworth all suggested ingenious methods for minimizing sums of absolute errors for bivariate regression problems, it was not until the introduction of the simplex algorithm in the late 1940s, and the formulation of the ℓ1 regression problem as a linear program somewhat later, that a practical, general method for computing absolute error regression estimates was made available.
We have already seen that the linear programming formulation of quantile regression is an indispensable tool for understanding its statistical behavior. Like the Euclidean geometry of the least-squares estimator, the polyhedral geometry of minimizing weighted sums of absolute errors plays a crucial role in understanding these methods. This chapter begins with a brief account of the classical theory of linear programming, stressing its geometrical nature and introducing the simplex method. The simplex approach to computing quantile regression estimates is then described and the special role of simplex-based methods for “sensitivity analysis” is emphasized.
Parametric programming in a variety of quantile regression contexts is treated in Section 6.3. Section 6.4 describes some recent developments in computation that rely on “interior point” methods for solving linear programs. These new techniques are especially valuable in large quantile regression applications, where the simplex approach becomes impractical. Further gains in computational efficiency are possible by preprocessing of the linear programming problems as described in Section 6.5. Interior point methods are also highly relevant for nonlinear quantile regression problems, a topic addressed in Section 6.6.