To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In this chapter we start by reviewing the different types of inference procedures: frequentist, Bayesian, parametric and non-parametric. We introduce notation by providing a list of the probability distributions that will be used later on, together with their first two moments. We review some results on conditional moments and carry out several examples. We review definitions of stochastic processes, stationary processes and Markov processes, and finish by introducing the most common discrete-time stochastic processes that show dependence in time and space.
In this chapter we conclude the book by presenting dependent models for random vectors and for stochastic processes. The types of dependence are exchangeable, Markov, moving average, spatial or a combination of the latter two.
Bringing together years of research into one useful resource, this text empowers the reader to creatively construct their own dependence models. Intended for senior undergraduate and postgraduate students, it takes a step-by-step look at the construction of specific dependence models, including exchangeable, Markov, moving average and, in general, spatio-temporal models. All constructions maintain a desired property of pre-specifying the marginal distribution and keeping it invariant. They do not separate the dependence from the marginals and the mechanisms followed to induce dependence are so general that they can be applied to a very large class of parametric distributions. All the constructions are based on appropriate definitions of three building blocks: prior distribution, likelihood function and posterior distribution, in a Bayesian analysis context. All results are illustrated with examples and graphical representations. Applications with data and code are interspersed throughout the book, covering fields including insurance and epidemiology.
This enthusiastic introduction to the fundamentals of information theory builds from classical Shannon theory through to modern applications in statistical learning, equipping students with a uniquely well-rounded and rigorous foundation for further study. Introduces core topics such as data compression, channel coding, and rate-distortion theory using a unique finite block-length approach. With over 210 end-of-part exercises and numerous examples, students are introduced to contemporary applications in statistics, machine learning and modern communication theory. This textbook presents information-theoretic methods with applications in statistical learning and computer science, such as f-divergences, PAC Bayes and variational principle, Kolmogorov's metric entropy, strong data processing inequalities, and entropic upper bounds for statistical estimation. Accompanied by a solutions manual for instructors, and additional standalone chapters on more specialized topics in information theory, this is the ideal introductory textbook for senior undergraduate and graduate students in electrical engineering, statistics, and computer science.
The main reason for the possibility of data compression is the experimental (empirical) law: Real-world sources produce very restricted sets of sequences. How do we model these restrictions? Chapter 10 looks at the first of three compression types that we will consider: variable-length lossless compression.
In this chapter our goal is to determine the achievable region of the exponent pairs for the type-I and type-II error probabilities. Our strategy is to apply the achievability and (strong) converse bounds from Chapter 14 in conjunction with the large-deviations theory developed in Chapter 15. After characterizing the full tradeoff we will discuss an adaptive setting of hypothesis testing where, instead of committing ahead of time to testing on the basis of n samples, one can decide adaptively whether to request more samples or stop. We will find out that adaptivity greatly increases the region of achievable error exponents and will learn about the sequential probability ratio test (SPRT) of Wald. In the closing sections we will discuss relations to more complicated settings in hypothesis testing: one with composite hypotheses and one with communication constraints.
The operation of mapping (naturally occurring) continuous time/analog signals into (electronics-friendly) discrete/digital signals is known as quantization, which is an important subject in signal processing in its own right. In information theory, the study of optimal quantization is called rate-distortion theory, introduced by Shannon in 1959. To start, in Chapter 24 we will take a closer look at quantization, followed by the information-theoretic formulation. A simple (and tight) converse bound is then given, with the matching achievability bound deferred to Chapter 25.
In Chapter 4 we collect some results on variational characterizations of information measures. It is a well-known method in analysis to study a functional by proving variational characterizations representing it as a supremum or infimum of some other, simpler (often linear) functionals. Such representations can be useful for multiple purposes:
Convexity: the pointwise supremum of convex functions is convex.
Regularity: the pointwise supremum of lower semicontinuous (lsc) functions is lsc.
Bounds: the upper and lower bounds on the functional follow by choosing good solutions in the optimization problem.
We will see in this chapter that divergence has two different sup-characterizations (over partitions and over functions). The mutual information is more special. In addition to inheriting the ones from Kullback–Leibler divergence, it possesses two extra: an inf-representation over (centroid) measures and a sup-representation over Markov kernels. As applications of these variational characterizations, we discuss the Gibbs variational principle, which serves as the basis of many modern algorithms in machine learning, including the EM algorithm and variational autoencoders; see Section 4.4. An important theoretical construct in machine learning is the idea of PAC-Bayes bounds (Section 4.8*).
So far we have been focusing on the paradigm for one-way communication: data are mapped to codewords and transmitted, and later decoded based on the received noisy observations. Chapter 23 looks at the more practical setting (except for storage), where the communication frequently goes in both ways so that the receiver can provide certain feedback to the transmitter. As a motivating example, consider the communication channel of the downlink transmission from a satellite to earth. Downlink transmission is very expensive (power constraint at the satellite), but the uplink from earth to the satellite is cheap which makes virtually noiseless feedback readily available at the transmitter (satellite). In general, channel with noiseless feedback is interesting when such asymmetry exists between uplink and downlink. Even in less ideal settings, noisy or partial feedbacks are commonly available that can potentially improve the reliability or complexity of communication. In the first half of our discussion, we shall follow Shannon to show that even with noiseless feedback “nothing” can be gained in the conventional setup. In the process, we will also introduce the concept of Massey’s directed information. In the second half of the Chapter we examine situations where feedback is extremely helpful: low probability of error, variable transmission length and variable transmission power.
In Chapter 21 we will consider an interesting variation of the channel coding problem. Instead of constraining the blocklength (i.e., the number of channel uses), we will constrain the total cost incurred by the codewords. The motivation is the following. Consider a deep-space probe that has a k-bit message that needs to be delivered to Earth (or a satellite orbiting it). The duration of transmission is of little worry for the probe, but what is really limited is the amount of energy it has stored in its battery. In this chapter we will learn how to study this question abstractly and how this fundamental limit is related to communication over continuous-time channels.
In Chapter 25 we present the hard direction of the rate-distortion theorem: the random coding construction of a quantizer. This method is extended to the development of a covering lemma and soft-covering lemma, which lead to the sharp result of Cuff showing that the fundamental limit of channel simulation is given by Wyner’s common information. We also derive (a strengthened form of) Han and Verdú’s results on approximating output distributions in Kullback–Leibler.
This enthusiastic introduction to the fundamentals of information theory builds from classical Shannon theory through to modern applications in statistical learning, equipping students with a uniquely well-rounded and rigorous foundation for further study. The book introduces core topics such as data compression, channel coding, and rate-distortion theory using a unique finite blocklength approach. With over 210 end-of-part exercises and numerous examples, students are introduced to contemporary applications in statistics, machine learning, and modern communication theory. This textbook presents information-theoretic methods with applications in statistical learning and computer science, such as f-divergences, PAC-Bayes and variational principle, Kolmogorov’s metric entropy, strong data-processing inequalities, and entropic upper bounds for statistical estimation. Accompanied by additional stand-alone chapters on more specialized topics in information theory, this is the ideal introductory textbook for senior undergraduate and graduate students in electrical engineering, statistics, and computer science.
The topic of this chapter is the deterministic (worst-case) theory of quantization. The main object of interest is the metric entropy of a set, which allows us to answer two key questions:
(1) covering number: the minimum number of points to cover a set up to a given accuracy;
(2) packing number: the maximal number of elements of a given set with a prescribed minimum pairwise distance.
The foundational theory of metric entropy was put forth by Kolmogorov, who, together with his students, also determined the behavior of metric entropy in a variety of problems for both finite and infinite dimensions. Kolmogorov’s original interest in this subject stems from Hilbert’s thirteenth problem, which concerns the possibility or impossibility of representing multivariable functions as compositions of functions of fewer variables. Metric entropy has found numerous connections to and applications in other fields, such as approximation theory, empirical processes, small-ball probability, mathematical statistics, and machine learning.