We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In Chapters 33 and 34 we described three methods for approximating posterior distributions: the Laplace method, the Markov chain Monte Carlo (MCMC) method, and the expectation-propagation (EP) method. Given an observable and a latent variable , the Laplace method approximates by a Gaussian distribution and was seen to be suitable for problems with small-dimensional latent spaces because its implementation involves a matrix inversion. The Gaussian approximation, however, is not sufficient in many instances and can perform poorly. The MCMC method is more powerful, and also more popular, and relies on elegant sampling techniques and the Metropolis–Hastings algorithm. However, MCMC requires a large number of samples, does not perform well on complex models, and does not scale well to higher dimensions and large datasets. The EP method, on the other hand, limits the class of distributions from which the posterior is approximated to the Gaussian or exponential families, and can be analytically demanding. In this chapter, we develop a fourth powerful method for posterior approximation known as variational inference. One of its advantages is that it usually scales better to large datasets and large dimensions.
We formulated the maximum-likelihood (ML) approach in the previous chapter, where an unknown parameter is estimated by maximizing the log-likelihood function. We showed there that in some cases of interest this problem can be solved analytically in closed form and an expression for the parameter estimate can be determined in terms of the observations. However, there are many important scenarios where the ML solution cannot be pursued in closed form, either due to mathematical intractability or due to missing data or hidden variables that are unobservable. In this chapter, we motivate and describe the expectation maximization (EM) procedure as a useful tool for constructing ML estimates under these more challenging conditions. We also illustrate how EM can be used to fit mixture models onto data.
In this chapter we illustrate one important application of the linear meansquare-error (MSE) theory to the derivation of the famed Kalman filter. The filter is a powerful recursive technique for updating the estimates of the state (hidden) variables of a state-space model from noisy observations. The state evolution satisfies a Markovian property in the sense that the distribution of the state at time is only dependent on the most recent past state, . Likewise, the distribution of the observation at the same time instant is only dependent on the state . The state and observation variables are represented by a linear state-space model, which will be shown to enable a powerful recursive solution. One key step in the argument is the introduction of the innovations process and the exploitation to great effect of the principle of orthogonality. In Chapter 35 we will allow for nonlinear state-space models and derive the class of particle filters by relying instead on the concept of sequential importance sampling.
We discussed the least‐squares problem in the previous chapter, which uses a collection of data points to determine an optimal parameter by minimizing an empirical quadratic risk of the form:
Most learning algorithms, including deep neural networks with many layers and parameters, act as black-box procedures where feature vectors at the input layer are transformed into label predictions at the output layer through a succession of nonlinear transformations. Given how prevalent learning-based systems are becoming in modern practice, including their use in fields such as medical diagnosis, autonomous systems, and even legal proceedings, it is necessary to have confidence in their predictions in order to ensure reliable, fair, and nondiscriminatory conclusions. For this reason, one needs to understand how classification results are attained, and what attributes in the input data have influenced the decisions most heavily. Questions of this type are addressed under the topic of explainability in machine learning.
In the previous chapter we clarified the representation power of Bayesian networks. In this chapter, we examine the solution to inference problems over these networks. In particular, given some observations, we would like to determine (a) the states of some nodes or (b) the most probable hypothesis or explanation corresponding to the observations. We present three inference methods known as (1) inference by enumeration, which is a brute force exact method, (2) inference by variable elimination, which is a more efficient procedure, and (3) belief propagation. The last method is described in the next chapter, where it is shown to be a special case of the sum-product message-passing algorithm. In general, solving inference problems over Bayesian networks is NP-complete (i.e., they cannot be solved in polynomial complexity in the number of nodes); for the benefit of the reader, we describe the various notions of NP-complexity in the concluding remarks of the chapter. We also describe in the chapter approaches to learn the underlying graph structure from observations including the Chow-Liu algorithm and graphical LASSO.
In this and the next chapter we discuss two binary classification schemes known as perceptron and support vector machines. In contrast to logistic regression, these methods approximate neither the conditional pdf, nor the joint pdf, .
Inference deals with the estimation of hidden parameters or random variables from observations of other related variables. In this chapter, we study the basic, yet fundamental, problem of inferring an unknown random quantity from observations of another random quantity by using the mean-square-error (MSE) criterion. Several other design criteria can be used for inference purposes besides MSE, such as the mean-absolute error (MAE) and the maximum a-posteriori (MAP) criteria. We will encounter these possibilities in future chapters, starting with the next chapter. We initiate our discussions of inference problems though by focusing on the MSE criterion due to its mathematical tractability and because it sheds light on several important questions that arise in the study of inference problems in general.
In this chapter we describe two ensemble learning techniques, known as bagging and boosting, which aggregate the decisions of a mixture of learners to enable enhanced classification performance. In particular, they help transform a collection of “weak” learners into a more robust learning machine.
The expectation-maximization (EM) and Baum–Welch algorithms are particularly useful for the processing of data arising from mixture models. Both techniques enable us to identify the parameters of the underlying components, for both cases when the observations are independent of each other or follow a first-order Markovian process. In this chapter, we consider another important example of a mixture model consisting of a collection of independent sources, a mixture matrix, and the observations. The objective is to undo the mixing and recover the original sources. The resulting technique is known as independent component analysis (ICA).