To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Multivariate t distributions have attracted somewhat limited attention of researchers for the last 70 years in spite of their increasing importance in classical as well as in Bayesian statistical modeling. These distributions have been perhaps unjustly overshadowed – during all these years – by the multivariate normal distribution. Both the multivariate t and the multivariate normal are members of the general family of elliptically symmetric distributions. However, we feel that it is desirable to focus on these distributions separately for several reasons:
Multivariate t distributions are generalizations of the classical univariate Student t distribution, which is of central importance in statistical inference. The possible structures are numerous, and each one possesses special characteristics as far as potential and current applications are concerned.
Application of multivariate t distributions is a very promising approach in multivariate analysis. Classical multivariate analysis is soundly and rigidly tilted toward the multivariate normal distribution while multivariate t distributions offer a more viable alternative with respect to real-world data, particularly because its tails are more realistic. We have seen recently some unexpected applications in novel areas such as cluster analysis, discriminant analysis, multiple regression, robust projection indices, and missing data imputation.
A carefully planned investigation can give much more insight into the question at hand than a haphazard one, data from which may be useless. Experimental design is a highly developed subject, though its principles are not universally appreciated. In this chapter we outline some basic ideas and describe some simple designs and associated analyses. The first section discusses the importance of randomization, and shows how it can be used to justify standard linear models and how it strengthens inferences. Section 9.2 then describes some common designs and analyses. Interaction, contrasts and analysis of covariance are discussed in Section 9.3. Section 9.4 then outlines the consequences of having more than one level of variability.
Randomization
Randomization
The purpose of a designed experiment is to compare how treatments affect a response, by applying them to experimental units, on each of which the response is to be measured. The units are the raw material of the investigation; formally a unit is the smallest subdivision of this such that any two different units might receive different treatments. The treatments are clearly defined procedures one of which is to be applied to each experimental unit. In an agricultural field trial the treatments might be different amounts of nitrogen and potash, while a unit is a plot of land. In a medical setting, treatments might be types of operation and different therapies, with units being patients who are operated upon and then given therapy to aid recovery.
Chapter 4 introduced likelihood and explored associated concepts such as likelihood ratio statistics and maximum likelihood estimators, which were then extensively used for inference in Chapters 5 and 6. In this chapter we turn aside from the central theme of the book and discuss some more theoretical topics. Estimation is a fundamental statistical activity, and in Section 7.1 we consider what properties a good estimator should have, including a brief discussion of nonparametric density estimators and the mathematically appealing topic of minimum variance unbiased estimation. One of the most important approaches to constructing estimators is as solutions to systems of estimating equations. In Section 7.2 we discuss the implications of this, showing how it complements minimum variance unbiased estimation, and seeing its implications for robust estimation and for stochastic processes. We then give an account of some of the main ideas underlying another major statistical activity, the testing of hypotheses, discussing the construction of tests with good properties, and making the connection to estimation.
Estimation
Mean squared error
Suppose that we wish to estimate some aspect of a probability model f(y). In principle we might try and estimate almost any feature of f, but we largely confine ourselves to estimation of the unknown parameter θ or a function of it ψ(θ) in a parametric model f(y; θ). Suppose that our data Y comprise a random sample Y1, …, Yn from f, and let the statistic T = t(Y) be an estimator of ψ(θ).
The regression models of Chapters 8 and 9 involve a continuous response that depends linearly on the parameters. Linear models remain the backbone of most statistical data analysis, but they have their deficiencies. In many applications, response variables are discrete, or statistical or substantive considerations suggest that covariates will appear nonlinearly. Models of this sort appeared on a somewhat ad hoc basis in the literature up to about 1970, since when there has been an explosion of generalizations to the linear model. Two important developments were the use of iterative weighted least squares for fitting, and the systematic use of exponential family response distributions. The iterative weighted least squares algorithm has wide applicability in nonlinear models and we outline its properties in Section 10.2, giving also a discussion of likelihood inference in this context. Exponential family response densities play a central role in generalized linear models, which we describe in Section 10.3, turning to the important special cases of binomial and Poisson responses in Sections 10.4 and 10.5. These models are widely used, but real data often display too much variation for them to be taken at face value. In Section 10.6 we outline remedies for this, based on the discussion of estimating functions in Section 7.2.
In each of these generalizations of the linear model our key notion that a few parameters summarize the entire model is retained. Section 10.7 branches out in a different direction, taking the viewpoint that the regression curve itself is more central than the parameters that summarize it.
The list below gives key words for practicals written in the statistical language S and intended to accompany the chapters of the book. The practicals themselves may be downloaded from
http://statwww.epfl.ch/people/∼davison/SM
together with a library of functions and data.
Variation
Speed of light data. Exploratory data analysis.
Maths marks data. Brush and spin plots.
Probability plots for simulated data.
Illustration of central limit theorem using simulated data.
Data on air-conditioning failures. Exponential probability plots.
Uncertainty
Properties of half-normal distribution. Half-normal plot.
Simulation of Student t statistic, following original derivation.
Simulation of Wiener process and Brownian bridge.
Normal random number generation by summing uniform variables.
Implementation and assessment of a linear congruential generator.
Coverage of Student t confidence interval under various scenarios.
Likelihood
Loss of information due to rounding of normal data.
Birth data. Maximum likelihood estimation for Poisson and gamma models. Assessment of fit.
Data on sizes of groups of people. Maximum likelihood fit of truncated Poisson distribution. Pearson's statistic.
α-particle data. Maximum likelihood fit of Poisson process model.
Blood group data. Maximum likelihood fit of multinomial model.
Generalized Pareto distribution. Nonregular estimation of endpoint.
Models
Boiling point of water data. Straight-line regression.
Survival data on leukaemia. Exponential and Weibull models.
HUS data. EM algorithm for mixture of Poisson distributions.
Statistics concerns what can be learned from data. Applied statistics comprises a body of methods for data collection and analysis across the whole range of science, and in areas such as engineering, medicine, business, and law — wherever variable data must be summarized, or used to test or confirm theories, or to inform decisions. Theoretical statistics underpins this by providing a framework for understanding the properties and scope of methods used in applications.
Statistical ideas may be expressed most precisely and economically in mathematical terms, but contact with data and with scientific reasoning has given statistics a distinctive outlook. Whereas mathematics is often judged by its elegance and generality, many statistical developments arise as a result of concrete questions posed by investigators and data that they hope will provide answers, and elegant and general solutions are not always available. The huge variety of such problems makes it hard to develop a single over-arching theory, but nevertheless common strands appear. Uniting them is the idea of a statistical model.
Charles Robert Darwin (1809–1882) was rich enough not to have to earn his living. His reading and studies at Edinburgh and Cambridge exposed him to contemporary scientific ideas, and prepared him for the voyage of the Beagle (1831–1836), which formed the basis of his life's work as a naturalist — at one point he spent 8 years dissecting and classifying barnacles. He wrote numerous books including The Origin of Species, in which he laid out the theory of evolution by natural selection. […]