To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Each of the problems described in the previous chapter can benefit from regression analysis. In this book we focus on the combination of classical parametric regression techniques and modern nonparametric regression techniques to develop useful models for such analyses. Therefore, it is essential to have a good grounding in the principles of parametric regression before proceeding to the more complicated semiparametric regression chapters. In particular, some of the theoretical aspects of regression should be well understood since these are important in extensions to semiparametric regression. The present chapter can serve as either a brief introduction to parametric regression for readers without a background in that field or as a refresher for those with a working knowledge of parametric regression but who could benefit from a review. If you are very familiar with parametric regression methodology and theory, then this chapter could be skimmed. Of course, this brief introduction can only cover the main concepts and a few special models. Many widely used parametric models are not discussed. This chapter provides sufficient background in parametric regression for the chapters to follow. However, readers wishing to apply parametric regression models may consult a textbook on parametric regression such as Weisberg (1985), Neter et al. (1996), or Draper and Smith (1998).
Note, moreover, that Section 2.5 contains some new perspectives on parametric regression that are relevant to later chapters on semiparametric models, so this is worth covering regardless of experience.
Toward the end of the chapter we describe some limitations of parametric regression. Most of the remainder of the book is concerned with extensions of parametric regression that have much more flexibility.
Classical statistics treats parameters as fixed unknown quantities. Bayesian statistics is based on a different philosophy; parameters are treated as random variables. The probability distribution of a parameter characterizes knowledge about the parameter's value, and this distribution changes as new data are acquired. The mixed models of classical statistics have a Bayesian flavor because some parameters are treated as random. However, in a mixed model both the fixed effects and the variance components are treated as nonrandom unknowns. Bayesians go one step beyond mixed models in that they treat all parameters as random. In this chapter we take the mixed model formulation of Section 4.9 and extend it to a fully Bayesian model.
Bayesian statistics differs from classical statistics in two important respects:
(1) the use of the prior distribution to characterize knowledge of the parameter values prior to data collection; and
(2) the use of the posterior distribution – that is, the conditional distribution of the parameters given the data – as the basis of inference.
Some statisticians are uneasy about the use of priors, but when done with care, the use of priors is quite sensible. In some situations, we might have strong prior beliefs that will influence our analysis. For example, suppose we needed to estimate the probability that a toss of a coin comes up heads.
The primary aim of this book is to guide researchers needing to flexibly incorporate nonlinear relationships into their regression analyses. Flexible nonlinear regression is traditionally known as nonparametric regression; it differs from parametric regression in that the shape of the functional relationships are not predetermined but can adjust to capture unusual or unexpected features of the data.
Almost all existing regression texts treat either parametric or nonparametric regression exclusively. The level of exposition between books of either type differs quite alarmingly. In this book we argue that nonparametric regression can be viewed as a relatively simple extension of parametric regression and treat the two together. We refer to this combination as semiparametric regression. Our approach to semiparametric regression is based on penalized regression splines and mixed models. Indeed, every model in this book is a special case of the linear mixed model or its generalized counterpart. This makes the methodology modular and is in keeping with our general philosophy of minimalist statistics (see Section 19.2), where the amount of methodology, terminology, and so on is kept to a minimum. This is the first smoothing book that makes use of the mixed model representation of smoothers.
Unlike many other texts on nonparametric regression, this book is very much problem-driven. Examples from our collaborative research (and elsewhere) have driven the selection of material and emphases and are used throughout the book.
The book is suitable for several audiences. One audience consists of students or working scientists with only a moderate background in regression, though familiarity with matrix and linear algebra is assumed. Marginal notes and the appendices are intended for beginners, especially those from interface disciplines.
The additive models of Chapters 8 and 11 have many attractive features. The joint effect of all the predictor variables upon the response is expressed as a sum of individual effects. These individual effects show how the expected response varies as any single predictor varies with the others held fixed at arbitrary values; because of the additivity, the effect of one predictor does not depend on the values at which the others are fixed. Thus, the individual component functions can be plotted separately to visualize the effect of each predictor, and these functions – taken together – allow us to understand the joint effects of all the predictors upon the expected response. If, for example, we wish to find conditions under which the expected response is maximized, then we need only maximize separately each of the component functions of the additive model. In summary, it is extremely convenient whenever an additive model provides an accurate summary of the data.
However, there are no guarantees that an additive model will provide a satisfactory fit in any given situation. Nonadditivity means that, as one predictor is varied, the effect on the expected response depends on the fixed values of the other predictors. A deviation from additivity is called an interaction.
SECTION 1 offers some reasons for why anyone who uses probability should know about the measure theoretic approach.
SECTION 2 describes some of the added complications, and some of the compensating benefits that come with the rigorous treatment of probabilities as measures.
SECTION 3 argues that there are advantages in approaching the study of probability theory via expectations, interpreted as linear functionals, as the basic concept.
SECTION 4 describes the de Finetti convention of identifying a set with its indicator function, and of using the same symbol for a probability measure and its corresponding expectation.
SECTION *5 presents a fair-price interpretation of probability, which emphasizes the linearity properties of expectations. The interpretation is sometimes a useful guide to intuition.
Why bother with measure theory?
Following the appearance of the little book by Kolmogorov (1933), which set forth a measure theoretic foundation for probability theory, it has been widely accepted that probabilities should be studied as special sorts of measures. (More or less true—see the Notes to the Chapter.) Anyone who wants to understand modern probability theory will have to learn something about measures and integrals, but it takes surprisingly little to get started.
For a rigorous treatment of probability, the measure theoretic approach is a vast improvement over the arguments usually presented in undergraduate courses. Let me remind you of some difficulties with the typical introduction to probability.
Independence
There are various elementary definitions of independence for random variables.
SECTION 3 defines the integral with respect to a measure as a linear functional on a cone of measurable functions. The definition sidesteps the details of the construction of integrals from measures.
SECTION *4 constructs integrals of nonnegative measurable functions with respect to a countably additive measure.
SECTION 5 establishes the Dominated Convergence theorem, the Swiss Army knife of measure theoretic probability.
SECTION 6 collects together a number of simple facts related to sets of measure zero.
SECTION *7 presents a few facts about spaces of functions with integrable pth powers, with emphasis on the case p=2, which defines a Hilbert space.
SECTION 8 defines uniform integrability, a condition slightly weaker than domination. Convergence in £1 is characterized as convergence in probability plus uniform integrability.
SECTION 9 defines the image measure, which includes the concept of the distribution of a random variable as a special case.
SECTION 10 explains how generating class arguments, for classes of sets, make measure theory easy.
SECTION *11 extends generating class arguments to classes of functions.
Measures and sigma-fields
As promised in Chapter 1, we begin with measures as set functions, then work quickly towards the interpretation of integrals as linear functionals. Once we are past the purely set-theoretic preliminaries, I will start using the de Finetti notation (Section 1.4) in earnest, writing the same symbol for a set and its indicator function.
SECTION 1 collects together some facts about stochastic processes and the normal distribution, for easier reference.
SECTION 2 defines Brownian motion as a Gaussian process indexed by a subinterval T of the real line. Existence of Brownian motions with and without continuous sample paths is discussed. Wiener measure is defined.
SECTION 3 constructs a Brownian motion with continuous sample paths, using an orthogonal series expansion of square integrable functions.
SECTION *4 describes some of the finer properties—lack of differentiability, and a modulus of continuity—for Brownian motion sample paths.
SECTION 5 establishes the strong Markov property for Brownian motion. Roughly speaking, the process starts afresh as a new Brownian motion after stopping times.
SECTION *6 describes a family of martingales that can be built from a Brownian motion, then establishes Lévy's martingale characterization of Brownian motion with continuous sample paths.
SECTION *7 shows how square integrable functions of the whole Brownian motion path can be represented as limits of weighted sums of increments. The result is a thinly disguised version of a remarkable property of the isometric stochastic integral, which is mentioned briefly.
SECTION *8 explains how the result from Section 7 is the key to the determination of option prices in a popular model for changes in stock prices.
Prerequisites
Broadly speaking, Brownian motion is to stochastic process theory as the normal distribution is to the theory for real random variables.
SECTION 1 presents a few of the basic properties of Fourier transforms that make them such a valuable tool of probability theory.
SECTION 2 exploits a mysterious coincidence, involving the Fourier transform and the density function of the normal distribution, to establish inversion formulas for recovering distributions from Fourier transforms.
SECTION *3 explains why the coincidence from Section 2 is not really so mysterious.
SECTION 4 shows that the inversion formula from Section 2 has a continuity property, which explains why pointwise convergence of Fourier transforms implies convergence in distribution.
SECTION *5 establishes a central limit theorem for triangular arrays of martingale differences.
SECTION 6 extends the theory to multivariate distributions, pointing out how the calculations reduce to one-dimensional analogs for linear combinations of coordinate variables—the Cramér and Wold device.
SECTION *7 provides a direct proof (no Fourier theory) of the fact that the family of (one-dimensional) distributions for all linear combinations of a random vector uniquely determines its multivariate distribution.
SECTION *8 illustrates the use of complex-variable methods to prove a remarkable property of the normal distribution—the Lévy-Cramér theorem.
Definitions and basic properties
Some probabilistic calculations simplify when reexpressed in terms of suitable transformations, such as the probability generating function (especially for random variables taking only positive integer values), the Laplace transform (especially for random variables taking only nonnegative values), or the moment generating function (for random variables with rapidly decreasing tail probabilities).
This book began life as a set of handwritten notes, distributed to students in my one-semester graduate course on probability theory, a course that had humble aims: to help the students understand results such as the strong law of large numbers, the central limit theorem, conditioning, and some martingale theory. Along the way they could expect to learn a little measure theory and maybe even a smattering of functional analysis, but not as much as they would learn from a course on Measure Theory or Functional Analysis.
In recent years the audience has consisted mainly of graduate students in statistics and economics, most of whom have not studied measure theory. Most of them have no intention of studying measure theory systematically, or of becoming professional probabilists, but they do want to learn some rigorous probability theory—in one semester.
Faced with the reality of an audience that might have neither the time nor the inclination to devote itself completely to my favorite subject, I sought to compress the essentials into a course as self-contained as I could make it. I tried to pack into the first few weeks of the semester a crash course in measure theory, with supplementary exercises and a whirlwind exposition (Appendix A) for the enthusiasts. I tried to eliminate duplication of mathematical effort if it served no useful role. After many years of chopping and compressing, the material that I most wanted to cover all fit into a one-semester course, divided into 25 lectures, each lasting from 60 to 75 minutes.