To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This enthusiastic introduction to the fundamentals of information theory builds from classical Shannon theory through to modern applications in statistical learning, equipping students with a uniquely well-rounded and rigorous foundation for further study. The book introduces core topics such as data compression, channel coding, and rate-distortion theory using a unique finite blocklength approach. With over 210 end-of-part exercises and numerous examples, students are introduced to contemporary applications in statistics, machine learning, and modern communication theory. This textbook presents information-theoretic methods with applications in statistical learning and computer science, such as f-divergences, PAC-Bayes and variational principle, Kolmogorov’s metric entropy, strong data-processing inequalities, and entropic upper bounds for statistical estimation. Accompanied by additional stand-alone chapters on more specialized topics in information theory, this is the ideal introductory textbook for senior undergraduate and graduate students in electrical engineering, statistics, and computer science.
The topic of this chapter is the deterministic (worst-case) theory of quantization. The main object of interest is the metric entropy of a set, which allows us to answer two key questions:
(1) covering number: the minimum number of points to cover a set up to a given accuracy;
(2) packing number: the maximal number of elements of a given set with a prescribed minimum pairwise distance.
The foundational theory of metric entropy was put forth by Kolmogorov, who, together with his students, also determined the behavior of metric entropy in a variety of problems for both finite and infinite dimensions. Kolmogorov’s original interest in this subject stems from Hilbert’s thirteenth problem, which concerns the possibility or impossibility of representing multivariable functions as compositions of functions of fewer variables. Metric entropy has found numerous connections to and applications in other fields, such as approximation theory, empirical processes, small-ball probability, mathematical statistics, and machine learning.
In Chapter 17 we introduce the concept of an error-correcting code (ECC). We will spend time discussing what it means for a code to have low probability of error, and what is the optimum (ML or MAP) decoder. On the special case of coding for the binary symmetric channel (BSC), we showcase the evolution of our understanding of fundamental limits from pre-Shannon’s to modern finite blocklength. We also briefly review the history of ECCs. We conclude with a conceptually important proof of a weak converse (impossibility) bound for the performance of ECCs.
Chapter 33 introduces the strong data-processing inequalities (SDPIs), which are quantitative strengthening of the DPIs in Part I. As applications we show how to apply SDPI to deduce lower bounds for various estimation problems on graphs or in distributed settings. The purpose of this chapter is two-fold. First, we want to introduce general properties of the SDPI coefficients. Second, we want to show how SDPIs help prove sharp lower (impossibility) bounds on statistical estimation questions. The flavor of the statistical problems in this chapter is different from the rest of the book in that here the information about the unknown parameter θ is either more “thinly spread” across a high-dimensional vector of observations than in classical X = θ + Z type of models (see spiked Wigner and tree-coloring examples), or distributed across different terminals (as in correlation and mean estimation examples).
This enthusiastic introduction to the fundamentals of information theory builds from classical Shannon theory through to modern applications in statistical learning, equipping students with a uniquely well-rounded and rigorous foundation for further study. The book introduces core topics such as data compression, channel coding, and rate-distortion theory using a unique finite blocklength approach. With over 210 end-of-part exercises and numerous examples, students are introduced to contemporary applications in statistics, machine learning, and modern communication theory. This textbook presents information-theoretic methods with applications in statistical learning and computer science, such as f-divergences, PAC-Bayes and variational principle, Kolmogorov’s metric entropy, strong data-processing inequalities, and entropic upper bounds for statistical estimation. Accompanied by additional stand-alone chapters on more specialized topics in information theory, this is the ideal introductory textbook for senior undergraduate and graduate students in electrical engineering, statistics, and computer science.
So far our discussion on information-theoretic methods has been mostly focused on statistical lower bounds (impossibility results), with matching upper bounds obtained on a case-by-case basis. In Chapter 32 we will discuss three information-theoretic upper bounds for statistical estimation under KL divergence (Yang–Barron), Hellinger (Le Cam–Birgé), and total variation (Yatracos) loss metrics. These three results apply to different loss functions and are obtained using completely different means. However, they take on exactly the same form, involving the appropriate metric entropy of the model. In particular, we will see that these methods achieve minimax optimal rates for the classical problem of density estimation under smoothness constraints.
This enthusiastic introduction to the fundamentals of information theory builds from classical Shannon theory through to modern applications in statistical learning, equipping students with a uniquely well-rounded and rigorous foundation for further study. The book introduces core topics such as data compression, channel coding, and rate-distortion theory using a unique finite blocklength approach. With over 210 end-of-part exercises and numerous examples, students are introduced to contemporary applications in statistics, machine learning, and modern communication theory. This textbook presents information-theoretic methods with applications in statistical learning and computer science, such as f-divergences, PAC-Bayes and variational principle, Kolmogorov’s metric entropy, strong data-processing inequalities, and entropic upper bounds for statistical estimation. Accompanied by additional stand-alone chapters on more specialized topics in information theory, this is the ideal introductory textbook for senior undergraduate and graduate students in electrical engineering, statistics, and computer science.
Chapter 29 gives an exposition of the classical large-sample asymptotics for smooth parametric models in fixed dimensions, highlighting the role of Fisher information introduced in Chapter 2. Notably, we discuss how to deduce classical lower bounds (Hammersley–Chapman–Robbins, Cramér–Rao, van Trees) from the variational characterization and the data-processing inequality (DPI) of χ2-divergence in Chapter 7.
This enthusiastic introduction to the fundamentals of information theory builds from classical Shannon theory through to modern applications in statistical learning, equipping students with a uniquely well-rounded and rigorous foundation for further study. The book introduces core topics such as data compression, channel coding, and rate-distortion theory using a unique finite blocklength approach. With over 210 end-of-part exercises and numerous examples, students are introduced to contemporary applications in statistics, machine learning, and modern communication theory. This textbook presents information-theoretic methods with applications in statistical learning and computer science, such as f-divergences, PAC-Bayes and variational principle, Kolmogorov’s metric entropy, strong data-processing inequalities, and entropic upper bounds for statistical estimation. Accompanied by additional stand-alone chapters on more specialized topics in information theory, this is the ideal introductory textbook for senior undergraduate and graduate students in electrical engineering, statistics, and computer science.
A commonly used method in combinatorics for bounding the number of certain objects from above involves a smart application of Shannon entropy. Notably the precision of this application can be increased by three methods: marginal bound, pairwise bound (Shearer’s lemma and generalization, see Theorem 1.8), and the chain rule (exact calculation).
In Chapter 7, we give three applications using the above three methods, respectively, in order of increasing difficulty:
(1) enumerating binary vectors of a given average weight;
(2) counting triangles and other subgraphs; and
(3) Brégman’s theorem.
Finally, to demonstrate how the entropy method can also be used for questions in Euclidean spaces, we prove the Loomis–Whitney and Bollobás–Thomason theorems based on analogous properties of differential entropy.
Chapter 26 evaluates the rate-distortion function for Gaussian and Hamming sources. We also discuss the important foundational implication that an optimal (lossy) compressor paired with an optimal error-correcting code together form an optimal end-to-end communication scheme (known as the joint source–channel coding separation principle). This principle explains why “bits” are the natural currency of the digital age.
In Chapter 31 we study three commonly used techniques for proving minimax lower bounds, namely, Le Cam’s method, Assouad’s lemma, and Fano’s method. Compared to the results in Chapter 29, which are geared toward large-sample asymptotics in smooth parametric models, the approach here is more generic, less tied to mean-squared error, and applicable in non-asymptotic settings such as nonparametric or high-dimensional problems. The common rationale of all three methods is reducing statistical estimation to hypothesis testing.
In Chapter 2 we introduced the Kullback–Leibler (KL) divergence that measures the dissimilarity between two distributions. This turns out to be a special case of the family of f-divergences between probability distributions, introduced by Csiszár. Like KL divergence, f-divergences satisfy a number of useful properties: operational significance, invariance to bijective transformations, data-processing inequality, variational representations (à la Donsker–Varadhan), and local behavior.
The purpose of Chapter 7 is to establish these properties and prepare the ground for applications in subsequent chapters. The important highlight is a joint-range theorem of Harremoës and Vajda, which gives the sharpest possible comparison inequality between arbitrary f-divergences (and puts an end to a long sequence of results starting from Pinsker’s inequality – Theorem 7.10).
Chapter 3 defines perhaps the most famous concept in the entire field of information theory, mutual information. It was originally defined by Shannon, although the name was coined later by Robert Fano. It has two equivalent expressions (as a Kullback–Leibler divergence and as difference of entropies), both having their merits. In this chapter, we collect some basic properties of mutual information (non-negativity, chain rule, and the data-processing inequality). While defining conditional information, we also introduce the language of directed graphical models, and connect the equality case in the data-processing inequality with Fisher’s concept of sufficient statistics. The connection between information and estimation is furthered in Section 3.7*, in which we relate mutual information and minimum mean-squared error in Gaussian noise (I-MMSE relation). From the latter we also derive the entropy-power inequality, which plays a central role in high-dimensional probability and concentration of measure.
In Chapter 30 we describe a strategy for proving the statistical lower bound we call the mutual information method (MIM), which entails comparing the amount of information data provides with the minimum amount of information needed to achieve a certain estimation accuracy. Similar to Section 29.2, the main information-theoretical ingredient is the data-processing inequality, this time for mutual information as opposed to f-divergences.
So far our discussion of channel coding was mostly following the same lines as the M-ary hypothesis testing (HT) in statistics. In Chapter 18 we introduce a key departure from this: The principal and most interesting goal in information theory is the design of the encoder mapping an input message to the channel input. Once the codebook is chosen, the problem indeed becomes that of M-ary HT and can be tackled by standard statistical tools. However, the task of choosing the encoder has no exact analogs in statistical theory (the closest being design of experiments). It turns out that the problem of choosing a good encoder will be much simplified if we adopt a suboptimal way of testing M-ary HT, based on thresholding information density.
Consider the following problem: Given a stream of independent Ber(p) bits, with unknown p, we want to turn them into pure random bits, that is, independent Ber(1/2) bits. Our goal is to find a universal way to extract the most number of bits. In other words, we want to extract as many fair coin flips as possible from possibly biased coin flips, without knowing the actual bias. In 1951 von Neumann proposed the following scheme: Divide the stream into pairs of bits, output 0 if 10, output 1 if 01, otherwise do nothing and move to the next pair. Since both 01 and 10 occur with probability pq, regardless of the value of p, we obtain fair coin flips at the output. To measure the efficiency of von Neumann’s scheme, note that, on average, we have 2n bits in and 2pqn bits out. So the efficiency (rate) is pq. The question is: Can we do better? It turns out that the fundamental limit (maximal efficiency) is given by the entropy $h(p)$. In this chapter we discuss optimal randomness extractors, due to Elias and Peres respectively, and several related problems.