We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
We continue our discussion of hidden Markov models (HMMs) and consider in this chapter the solution of decoding problems. Specifically, given a sequence of observations , we would like to devise mechanisms that allow us to estimate the underlying sequence of state or latent variables . That is, we would like to recover the state evolution that “most likely” explains the measurements. We already know how to perform decoding for the case of mixture models with independent observations by using (38.12a)–(38.12b). The solution is more challenging for HMMs because of the dependency among the states.
The various reinforcement learning algorithms described in the last two chapters rely on estimating state values, , or state–action values, , directly.
One prominent application of the variational inference methodology of Chapter 36 arises in the context of topic modeling. In this application, the objective is to discover similarities between texts or documents such as news articles. For example, given a large library of articles, running perhaps into the millions, such as a database of newspaper articles written over 100 years, it would be useful to be able to discover in an automated manner the multitude of topics that are covered in the database and to cluster together articles dealing with similar topics such as sports or health or politics. In another example, when a user is browsing an article online, it would be useful to be able to identify automatically the subject matter of the article in order to recommend to the reader other articles of similar content. Latent Dirichlet allocation (or LDA) refers to the procedure that results from applying variational inference techniques to topic modeling in order to address questions of this type.
The maximum-likelihood (ML) formulation is one of the most formidable tools for the solution of inference problems in modern statistical analysis. It allows the estimation of unknown parameters in order to fit probability density functions (pdfs) onto data measurements. We introduce the ML approach in this chapter and limit our discussions to properties that will be relevant for the future developments in the text. The presentation is not meant to be exhaustive, but targets key concepts that will be revisited in later chapters. We also avoid anomalous situations and focus on the main features of ML inference that are generally valid under some reasonable regularity conditions.
The temporal learning algorithms TD(0) and TD() of the previous chapter are useful procedures for state value evaluation; i.e., they permit the estimation of the state value function for a given target policy by observing actions and rewards arising from this policy (on‐policy learning) or another behavior policy (off‐policy learning).In most situations, however, we are not interested in state values but rather in determining optimal policies, denoted by (i.e., in selecting what optimal actions an agent should follow in a Markov decision process (MDP)).
We have described a range of supervised learning algorithms in the previous chapters, including several neural network implementations and their training by means of the backpropagation algorithm. The performance of some of these algorithms has been demonstrated in practice to match or even exceed human performance in important applications. At the same time, it has also been observed that the algorithms are susceptible to adversarial attacks that can drive them to erroneous decisions under minimal perturbations to the data. For instance, adding small perturbations to an image that may not even be perceptible to the human eye has been shown to cause learning algorithms to classify the image incorrectly.
We expectation-maximization (EM) algorithm can be used to estimate the underlying parameters of the conditional probability density functions (pdfs) by approximating the maximum-likelihood (ML) solution. We found that the algorithm operates on a collection of independent observations, where each observation is generated independently from one of the mixture components. In this chapter and the next, we extend this construction and consider hidden Markov models (HMMs), where the mixture component for one observation is now dependent on the component used to generate the most recent past observation.
Oftentimes, the dimension of the feature space, , is prohibitively large either for computational or visualization purposes. In these situations, it becomes necessary to perform an initial dimensionality reduction step where each is replaced by a lower-dimensional vector with .
Maximum likelihood (ML) is a powerful statistical tool that determines model parameters in order to fit probability density functions (pdfs) onto data measurements. The estimated pdfs can then be used for at least two purposes. First, they can help construct optimal estimators or classifiers (such as the conditional mean estimator, the maximum a-posteriori (MAP) estimator, or the Bayes classifier) since, as we already know from previous chapters, these optimal constructions require knowledge of the conditional or joint probability distributions of the variables involved in the inference problem. Second, once a pdf is learned, we can sample from it to generate additional observations. For example, consider a database consisting of images of cats and assume we are able to characterize (or learn) the pdf distribution of the pixel values in these images. Then, we could use the learned pdf to generate “fake” cat-like images (i.e., ones that look like real cats). We will learn later in this text that this construction is possible and some machine-learning architectures are based on this principle: They use data to learn what we call a “generative model,” and then use the model to generate “similar” data. We provide a brief explanation to this effect in the next section, where we explain the significance of posterior distributions.