To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
SECTION 1 illustrates the usefulness of coupling, by means of three simple examples.
SECTION 2 describes how sequences of random elements of separable metric spaces that converge in distribution can be represented by sequences that converge almost surely.
SECTION *3 establishes Strassen's Theorem, which translates the Prohorov distance between two probability measures into a coupling.
SECTION *4 establishes Yurinskii's coupling for sums of independent random vectors to normally distributed random vectors.
SECTION 5 describes a deceptively simple example (Tusnády's Lemma) of a quantile coupling, between a symmetric Binomial distribution and its corresponding normal approximation.
SECTION 6 uses the Tusnády Lemma to couple the Haar coefficients for the expansions of an empirical process and a generalized Brownian Bridge.
SECTION 7 derives one of most striking results of modern probability theory, the KMT coupling of the uniform empirial process with the Brownian Bridge process.
What is coupling?
A coupling of two probability measures, P and Q, consists of a probability space (Ω, ℱ, ℙ) supporting two random elements X and F, such that X has distribution P and Y has distribution Q. Sometimes interesting relationships between P and Q can be coded in some simple way into the joint distribution for X and Y. Three examples should make the concept clearer.
Example. Let Pα denote the Bin(n,α) distribution. As α gets larger, the distribution should “concentrate on bigger values.” More precisely, for each fixed x, the tail probability Pα[x, n] should be an increasing function of α. A coupling argument will give an easy proof.
SECTION 1 considers the elementary case of conditioning on a map that takes only finitely many different values, as motivation for the general definition.
SECTION 2 defines conditional probability distributions for conditioning on the value of a general measurable map.
SECTION 3 discusses existence of conditional distributions by means of a slightly more general concept, disintegration, which is essential for the understanding of general conditional densities.
SECTION 4 defines conditional densities. It develops the general analog of the elementary formula for a conditional density: (joint density)/(marginal density).
SECTION *5 illustrates how conditional distributions can be identified by symmetry considerations. The classical Borel paradox is presented as a warning against the misuse of symmetry.
SECTION 6 discusses the abstract Kolmogorov conditional expectation, explaining why it is natural to take the conditioning information to be a sub-sigma-field.
SECTION *7 discusses the statistical concept of sufficiency.
Conditional distributions: the elementary case
In introductory probability courses, conditional probabilities of events are defined as ratios, ℙ(A∣B) = ℙAB/ℙB, provided ℙB ≠ 0. The division by ℙB ensures that ℙ(· ∣ B) is also a probability measure, which puts zero mass outside the set B, that is, ℙ(Bc ∣ B) = 0. The conditional expectation of a random variable X is defined as its expectation with respect to ℙ(· ∣ B), or, more succinctly, ℙ(X ∣ B) = ℙ(XB)/ℙB. If ℙB = 0, the conditional probabilities and conditional expectations are either left undefined or are extracted by some heuristic limiting argument.
SECTION 1 defines the concepts of weak convergence for sequences of probability measures on a metric space, and of convergence in distribution for sequences of random elements of a metric space and derives some of their consequences. Several equivalent definitions for weak convergence are noted.
SECTION 2 establishes several more equivalences for weak convergence of probability measures on the real line, then derives some central limit theorems for sums of independent random variables by means of Lindeberg's substitution method.
SECTION 3 explains why the multivariate analogs of the methods from Section 2 are not often explicitly applied.
SECTION 4 develops the calculus of stochastic order symbols.
SECTION *5 derives conditions under which sequences of probability measures have weakly convergent subsequences.
Definition and consequences
Roughly speaking, central limit theorems give conditions under which sums of random variable have approximate normal distributions. For example:
If ξ1, …, ξn are independent random variables with ℙξi = 0 for each i and ∑i var(ξi) = 1, and if none of the ξi, makes too large a contribution to their sum, then ∑i ξi is approximately N(0, 1) distributed.
The traditional way to formalize approximate normality requires, for each real x, that ℙ{∑i ξi ≤ x) ≈ ℙ{Z ≤ x} where Z has a N(0, 1) distribution. Of course the variable Z is used just as a convenient way to describe a calculation with the N(0, 1) probability measure; Z could be replaced by any other random variable with the same distribution.
SECTION 1 explains why the traditional split of introductory probability courses into two segments—the study of discrete distributions, and the study of “continuous” distributions—is unnecessary in a measure theoretic treatment. Absolute continuity of one measure with respect to another measure is defined. A simple case of the Radon-Nikodym theorem is proved.
SECTION *2 establishes the Lebesgue decomposition of a measure into parts absolutely continuous and singular with respect to another measure, a result that includes the Radon-Nikodym theorem as a particular case.
SECTION 3 shows how densities enter into the definitions of various distances between measures.
SECTION 4 explains the connection between the classical concept of absolute continuity and its measure theoretic generalization. Part of the Fundamental Theorem of Calculus is deduced from the Radon-Nikodym theorem.
SECTION *5 establishes the Vitali covering lemma, the key to the identification of derivatives as densities.
SECTION *6 presents the proof of the other part of the Fundamental Theorem of Calculus, showing that absolutely continuous functions (on the real line) are Lebesgue integrals of their derivatives, which exist almost everywhere.
Densities and absolute continuity
Nonnegative measurable functions create new measures from old.
Let (X, A, µ) be a measure space, and let Δ(·) be a function in M+(X, A). The increasing, linear functional defined on M+(X, A) by vf ≔ µ(fΔ) inherits from µ the Monotone Convergence property, which identifies it as an integral with respect to a measure on A.
SECTION 1 introduces independence as a property that justifies some sort of factorization of probabilities or expectations. A key factorization Theorem is stated, with proof deferred to the next Section, as motivation for the measure theoretic approach. The Theorem is illustrated by a derivation of a simple form of the strong law of large numbers, under an assumption of bounded fourth moments.
SECTION 2 formally defines independence as a property of sigma-fields. The key Theorem from Section 1 is used as motivation for the introduction of a few standard techniques for dealing with independence. Product sigma-fields are defined.
SECTION 3 describes a method for constructing measures on product spaces, starting from a family of kernels.
SECTION 4 specializes the results from Section 3 to define product measures. The Tonelli and Fubini theorems are deduced. Several important applications are presented.
SECTION *5 discusses some difficulties encountered in extending the results of Sections 3 and 4 when the measures are not sigma-finite.
SECTION 6 introduces a blocking technique to refine the proof of the strong law of large numbers from Section 1, to get a version that requires only a second moment condition.
SECTION *7 introduces a truncation technique to further refine the proof of the strong law of large numbers, to get a version that requires only a first moment condition for identically distributed summands.
SECTION *8 discusses the construction of probability measures on products of countably many spaces.
SECTION 1 explains why you will not learn from this Chapter everything there is to know about the multivariate normal distribution.
SECTION 2 introduces Fernique's inequality. As illustration, Sudakov's lower bound for the expected value of a maximum of correlated normals is derived.
SECTION *3 proves Fernique's inequality.
SECTION 4 introduces the Gaussian isoperimetric inequlity. As an application, Borell's tail bound for the distribution of the maximum of correlated normals is derived.
SECTION *5 proves the Gaussian isoperimetric inequlity.
Introduction
Of all the probability distributions on multidimensional Euclidean spaces the multivariate normal is the most studied and, in many ways, the most tractable. In years past, the statistical subject known as “Multivariate Analysis” was almost entirely devoted to the study of the multivariate normal. The literature on Gaussian processes—stochastic processes whose finite dimensional distributions are all multivariate normal—is vast. It is important to know a little about the multivariate normal.
As you saw in Section 8.6, the multivariate normal is uniquely determined by its vector of means and its matrix of covariances. In principle, everything that one might want to know about the distribution can be determined by calculation of means and covariances, but in practice it is not completely straightforward. In this Chapter you will see two elegant examples of what can be achieved: Fernique's (1975) inequality, which deduces important information about the spread in a multivariate normal distribution from its covariances; and Borell's (1975) Gaussian isoperimetric inequality, with a proof due to Ehrhard (1983a, 1983b).
SECTION 1 gives some examples of martingales, submartingales, and supermartingales.
SECTION 2 introduces stopping times and the sigma-fields corresponding to “information available at a random time.” A most important Stopping Time Lemma is proved, extending the martingale properties to processes evaluted at stopping times.
SECTION 3 shows that positive supermartingales converge almost surely.
SECTION 4 presents a condition under which a submartingale can be written as a difference between a positive martingale and a positive supermartingale (the Krickeberg decomposition). A limit theorem for submartingales then follows.
SECTION *5 proves the Krickeberg decomposition.
SECTION *6 defines uniform integrability and shows how uniformly integrable martingales are particularly well behaved.
SECTION *7 show that martingale theory works just as well when time is reversed.
SECTION *8 uses reverse martingale theory to study exchangeable probability measures on infinite product spaces. The de Finetti representation and the Hewitt-Savage zero-one law are proved.
What are they?
The theory of martingales (and submartingales and supermartingales and other related concepts) has had a profound effect on modern probability theory. Whole branches of probability, such as stochastic calculus, rest on martingale foundations. The theory is elegant and powerful: amazing consequences flow from an innocuous assumption regarding conditional expectations. Every serious user of probability needs to know at least the rudiments of martingale theory.
A little notation goes a long way in martingale theory. A fixed probability space (Ω, ℱ, ℙ) sits in the background.
In late 1982, Ted Hannan discussed with me a question he had been asked by some astronomers – how could you estimate the two frequencies in two sinusoids when the frequencies were so close together that you could not tell, by looking at the periodogram, that there were two frequencies? He asked me if I would like to work with him on the problem and gave me a reprint of his paper (Hannan 1973) on the estimation of frequency. Together we wrote a paper (Hannan and Quinn 1989) which derived the regression sum of squares estimators of the frequencies, and showed that the estimators were strongly consistent and satisfied a central limit theorem. It was clear that there were no problems asymptotically if the two frequencies were fixed, so Ted's idea was to fix one frequency, and let the other converge to it at a certain rate, in much the same way as the alternative hypothesis is constructed to calculate the asymptotic power of a test. Since then, I have devoted much of my research to sinusoidal models. In particular, I have spent a lot of time constructing algorithms for the estimation of parameters in these models, to implementing the algorithms in practice and, for me perhaps the most challenging, establishing the asymptotic (large sample) properties of the estimators.
Chapter 10 provided an overview of Monte Carlo methods and dealt solely with the problem of generating from the uniform distribution. Since the uniform distribution is the fundamental distribution, we're now prepared to deal with the postponed problem of generating from other distributions. Given the results of Chapter 10, this problem should be viewed as transforming a source sequence of IID uniform random variables {Ui} to an IID sequence of random variables {Xi} with cumulative distribution function (cdf) F. A discussion of general methods for generating from continuous distributions forms Section 11.2. Specific algorithms designed for various distributions, such as the normal and Student's t, follow in Section 11.3. General methods for discrete distributions are discussed in Section 11.4, with specific cases in Section 11.5. Special problems, including random sampling from a population, are handled in Section 11.6. The problem of accuracy in Monte Carlo is tackled in Section 11.7.
Some general remarks are in order before pursuing the problem at hand. Algorithms for generating random variables should always be simple, fast, and exact. Simplicity is paramount, since users must often code and debug their own programs. Finding errors in random output is very difficult (see Exercises 11.14 and 11.20). If an algorithm is simple, most mistakes will bring consequences so severe that the error can be easily discovered. Speed is not so important, since the computational effort in generation is usually only a small fraction of the total effort in the Monte Carlo experiment.
Maximum likelihood is generally regarded as the best all-purpose approach for statistical analysis. Outside of the most common statistical procedures, when the “optimal” or “usual” method is unknown, most statisticians follow the principle of maximum likelihood for parameter estimation and statistical hypothesis tests. Bayesian statistical methods also rely heavily on maximum likelihood. The main reason for this reliance is that following the principle of maximum likelihood usually leads to very reasonable and effective estimators and tests. From a theoretical viewpoint, under very mild conditions, maximum likelihood estimators (MLEs) are consistent, asymptotically unbiased, and efficient. Moreover, MLEs are invariant under reparameterizations or transformations: the MLE of a function of the parameter is the function of the MLE. From a practical viewpoint, the estimates and test statistics can be constructed without a great deal of analysis, and large-sample standard errors can be computed. Overall, experience has shown that maximum likelihood works well most of the time.
The biggest computational challenge comes from the naive expectation that any statistical problem can be solved if the maximum of some function is found. Instead of relying solely on the unconstrained optimization methods presented in Chapter 8 to meet this unrealistic expectation, the nature of the likelihood function can be exploited in ways that are more effective for computing MLEs. Since the exploitable properties of likelihood functions follow from the large-sample theory, this chapter will begin with a summary of the consistency and asymptotic normality properties of MLEs.
One of the main advantages of Monte Carlo integration is a rate of convergence that is unaffected by increasing dimension, but a more important advantage for statisticians is the familiarity of the technique and its tools. Although Markov chain Monte Carlo (MCMC) methods are designed to integrate high-dimensional functions, the ability to exploit distributional tools makes these methods much more appealing to statisticians. In contrast to importance sampling with weighted observations, MCMC methods produce observations that are no longer independent; rather, the observations come from a stationary distribution and so time-series methods are needed for their analysis. The emphasis here will be on using MCMC methods for Bayesian problems with the goal of generating a series of observations whose stationary distribution π(t) is proportional to the unnormalized posterior p*(t). Standard statistical methods can then be used to gain information about the posterior.
The two general approaches covered in this chapter are known as Gibbs sampling and the Metropolis–Hastings algorithm, although the former can be written as a special case of the latter. Gibbs sampling shows the potential of MCMC methods for Bayesian problems with hierarchical structure, also known as random effects or variance components. The key ingredient in Gibbs sampling is the ability to generate from the conditional distribution of each variable given the others; in the case of three components, generating from f(x | Y = y, Z = z), f(y | X = x, Z = z), and f(z | X = x, Y = y).
We encounter periodic: phenomena every day of our lives. Those of us who still use analogue clocks are acutely aware of the 60 second, 60 minute and 12 hour periods associated with the sweeps of the second, minute and hour hands. We are conscious of the fact that the Earth rotates on its axis roughly every 24 hours and that it completes a revolution of the Sun roughly every 365 days. These periodicities are reasonably accurate. The quantities we are interested in measuring are not precisely periodic and there will also be error associated with their measurement. Indeed, some phenomena only seem periodic. For example, some biological population sizes appear to fluctuate regularly over a long period of time, but it is hard to justify using common sense any periodicity other than that associated with the annual cycle. It has been argued in the past that some cycles occur because of predator-prey interaction, while in other cases there is no obvious reason. On the other hand, the sound associated with musical instruments can reasonably be thought of as periodic, locally in time, since musical notes are produced by regular vibration and propagated through the air via the regular compression and expansion of the air. The ‘signal’ will not be exactly periodic, since there are errors associated with the production of the sound, with its transmission through the air (since the air is not a uniform medium) and because the ear is not a perfect receiver.