To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In this and the next chapter we discuss two binary classification schemes known as perceptron and support vector machines. In contrast to logistic regression, these methods approximate neither the conditional pdf, nor the joint pdf, .
Inference deals with the estimation of hidden parameters or random variables from observations of other related variables. In this chapter, we study the basic, yet fundamental, problem of inferring an unknown random quantity from observations of another random quantity by using the mean-square-error (MSE) criterion. Several other design criteria can be used for inference purposes besides MSE, such as the mean-absolute error (MAE) and the maximum a-posteriori (MAP) criteria. We will encounter these possibilities in future chapters, starting with the next chapter. We initiate our discussions of inference problems though by focusing on the MSE criterion due to its mathematical tractability and because it sheds light on several important questions that arise in the study of inference problems in general.
In this chapter we describe two ensemble learning techniques, known as bagging and boosting, which aggregate the decisions of a mixture of learners to enable enhanced classification performance. In particular, they help transform a collection of “weak” learners into a more robust learning machine.
The expectation-maximization (EM) and Baum–Welch algorithms are particularly useful for the processing of data arising from mixture models. Both techniques enable us to identify the parameters of the underlying components, for both cases when the observations are independent of each other or follow a first-order Markovian process. In this chapter, we consider another important example of a mixture model consisting of a collection of independent sources, a mixture matrix, and the observations. The objective is to undo the mixing and recover the original sources. The resulting technique is known as independent component analysis (ICA).
Over the last few decades sparsity has become a driving force in the development of new and better algorithms in signal and image processing. In the context of the late deep learning zenith, a pivotal work by Papyan et al. showed that deep neural networks can be interpreted and analyzed as pursuit algorithms seeking for sparse representations of signals belonging to a multilayer synthesis sparse model. In this chapter we review recent contributions showing that this observation is correct but incomplete, in the sense that such a model provides a symbiotic mixture of coupled synthesis and analysis sparse priors. We make this observation precise and use it to expand on uniqueness guarantees and stability bounds for the pursuit of multilayer sparse representations. We then explore a convex relaxation of the resulting pursuit and derive efficient optimization algorithms to approximate its solution. Importantly, we deploy these algorithms in a supervised learning formulation that generalizes feed-forward convolutional neural networks into recurrent ones, improving their performance without increasing the number of parameters of the model.
The Laplace method approximates the posterior distribution through a Gaussian probability density function (pdf) that is not always accurate. The Markov chain Monte Carlo (MCMC) method, on the other hand, relies on sampling from auxiliary (proposal) distributions and provides a powerful way to approximate posterior distributions albeit through repeated simulations. In this chapter, we describe a third approach for approximating the posterior distribution, known as expectation propagation (EP). This method restricts the class of distributions from which the posterior is approximated to the Gaussian or exponential family and assumes a factored form for the posterior. The method can become analytically demanding, depending on the nature of the factors used for the posterior, because these factors can make the computation of certain moments unavailable in closed form. The EP method has been observed to lead to good performance in some applications such as the Bayesian logit classification problem, but this behavior is not universal and performance can degrade for other problems, especially when the posterior distribution admits a mixture model.
The -nearest neighbor (-NN) rule is appealing. However, each new feature requires searching over the entire training set of size to determine the neighborhood around .
In this chapter, we describe three other data-based generative methods that approximate the solution to the optimal Bayes classifier (52.8) in the absence of knowledge of the conditional probabilities . The methods estimate the prior probabilities for the classes and, in some cases, assume a Gaussian form for the reverse conditional distribution, . The training data is used to estimate the priors and the first-and second-order moments of .
This chapter provides theoreticalinsights into why and how deep learning can generalize well, despite its large capacity, complexity, possible algorithmic instability, non-robustness, and sharp minima, responding to an open question in the literature. We also discuss approaches to provide non-vacuousgeneralization guarantees for deep learning. On the basis of the theoreticalobservations, wepropose new open problems.
The discussion in the last two chapters focused on directed graphical models or Bayesian networks, where a directed link from a variable toward another variable carries with it an implicit connotation of “causal effect” by on . In many instances, this implication need not be appropriate or can even be limiting. For example, there are cases where conditional independence relations cannot be represented by a directed graph. One such example is provided in Prob. 43.1. In this chapter, we examine another form of graphical representations where the links are not required to be directed anymore, and the probability distributions are replaced by potential functions. These are strictly positive functions defined over sets of connected nodes; they broaden the level of representation by graphical models. The potential functions carry with them a connotation of “similarity” or “affinity” among the variables, but can also be rolled back to represent probability distributions. Over undirected graphs, edges linking nodes will continue to reflect pairwise relationship between the variables but will lead to a fundamental factorization result in terms of the product of clique potential functions. We will show that these functions play a prominent role in the development of message-passing algorithms for the solution of inference problems.
The inference of a random variable from observations requires that we evaluate the posterior distribution as happens, for example, in inference formulations based on mean-square-error (MSE), maximum a-posteriori (MAP), or probability of error metrics. In previous chapters, we described several techniques to facilitate the computation or approximation of such posterior distributions using Monte Carlo or variational inference methods. We will encounter other types of approximations in later chapters. For example, in the context of naïve Bayes classifiers in Chapter 55, we will assume that, conditioned on the latent variable , the observations are independent of each other in order to write
We give a short and concise review about the dynamical system and the control theory approach to deep learning. From the viewpoint of the dynamical systems, the back-propagation algorithm in deep learning becomes a simple consequence of the variational equations in ODEs. From the viewpoint of control theory, deep learning is a case of mean-field control in that all the agents share the same control. As an application, we discuss a new class of algorithms for deep learning based on Pontryagin’s maximum principle in control theory.