We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Natural organisms inhabit a dynamical environment and arguably a large part of natural intelligence is in modelling causal relations and consequences of actions. In this sense, modelling temporal data is of fundamental interest. In a more artificial environment, there are many instances where predicting the future is of interest, particularly in areas such as finance and also in tracking of moving objects.
In Part IV, we discuss some of the classical models of timeseries that may be used to represent temporal data and also to make predictions of the future. Many of these models are well known in different branches of science from physics to engineering and are heavily used in areas such as speech recognition, financial prediction and control. We also discuss some more sophisticated models in Chapter 25, which may be skipped at first reading.
As an allusion to the fact that natural organisms inhabit a temporal world, we also address in Chapter 26 some basic models of how information processing might be achieved in distributed systems.
When the distribution is multiply connected it would be useful to have a generic inference approach that is efficient in its reuse of messages. In this chapter we discuss an important structure, the junction tree, that by clustering variables enables one to perform message passing efficiently (although the structure onwhich themessage passing occurs may consist of intractably large clusters). The most important thing is the junction tree itself, based on which different message-passing procedures can be considered. The junction tree helps forge links with the computational complexity of inference in fields from computer science to statistics and physics.
Clustering variables
In Chapter 5 we discussed efficient inference for singly connected graphs, for which variable elimination and message-passing schemes are appropriate. In the multiply connected case, however, one cannot in general perform inference by passing messages only along existing links in the graph. The idea behind the Junction Tree Algorithm (JTA) is to form a new representation of the graph in which variables are clustered together, resulting in a singly connected graph in the cluster variables (albeit on a different graph). The main focus of the development will be on marginal inference, though similar techniques apply to different inferences, such as finding the most probable state of the distribution.
At this stage it is important to point out that the JTA is not a magic method to deal with intractabilities resulting from multiply connected graphs; it is simply a way to perform correct inference on a multiply connected graph by transforming to a singly connected structure.
Hidden Markov models assume that the underlying process is discrete; linear dynamical systems that the underlying process is continuous. However, there are scenarios in which the underlying system might jump from one continuous regime to another. In this chapter we discuss a class of models that can be used in this situation. Unfortunately the technical demands of this class of models are somewhat more involved than in previous chapters, although the models are correspondingly more powerful.
Introduction
Complex timeseries which are not well described globally by a single linear dynamical system may be divided into segments, each modelled by a potentially different LDS. Such models can handle situations in which the underlying model ‘jumps’ from one parameter setting to another. For example a single LDS might well represent the normal flows in a chemical plant. When a break in a pipeline occurs, the dynamics of the system changes from one set of linear flow equations to another. This scenario can be modelled using a set of two linear systems, each with different parameters. The discrete latent variable at each time st ∈ {normal, pipe broken} indicates which of the LDSs is most appropriate at the current time. This is called a Switching LDS (SLDS) and is used in many disciplines, from econometrics to machine learning [12, 63, 59, 235, 324, 189].
Probabilistic models explicitly take into account uncertainty and deal with our imperfect knowledge of the world. Suchmodels are of fundamental significance in Machine Learning since our understanding of the world will always be limited by our observations and understanding. We will focus initially on using probabilistic models as a kind of expert system.
In Part I, we assume that the model is fully specified. That is, given a model of the environment, how can we use it to answer questions of interest? We will relate the complexity of inferring quantities of interest to the structure of the graph describing the model. In addition, we will describe operations in terms of manipulations on the corresponding graphs. As we will see, provided the graphs are simple tree-like structures, most quantities of interest can be computed efficiently.
Part I deals with manipulating mainly discrete variable distributions and forms the background to all the later material in the book.
In Part II we address how to learn a model from data. In particular we will discuss learning a model as a form of inference on an extended distribution, now taking into account the parameters of the model.
Learning a model or model parameters from data forces us to deal with uncertainty since with only limited data we can never be certain which is the ‘correct’ model. We also address how the structure of a model, not just its parameters, can in principle be learned.
In Part II we show how learning can be achieved under simplifying assumptions, such as maximum likelihood that set parameters by those that would most likely reproduce the observed data. We also discuss the problems that arise when, as is often the case, there is missing data.
Together with Part I, Part II prepares the basic material required to embark on understanding models in machine learning, having the tools required to learn models from data and subsequently query them to answer questions of interest.
Sampling methods are popular and well known for approximate inference. In this chapter we give an introduction to the less well known class of deterministic approximation techniques. These have been spectacularly successful in branches of the information sciences and many have their origins in the study of large-scale physical systems.
Introduction
Deterministic approximate inference methods are an alternative to the sampling techniques discussed in Chapter 27. Drawing exact independent samples is typically computationally intractable and assessing the quality of the sample estimates is difficult. In this chapter we discuss some alternatives. The first, Laplace's method, is a simple perturbation technique. The second class of methods are those that produce rigorous bounds on quantities of interest. Such methods are interesting since they provide certain knowledge – it may be sufficient, for example, to show that a marginal probability is greater than 0.1 in order to make an informed decision. A further class of methods are the consistency methods, such as loopy belief propagation. Such methods have revolutionised certain fields, including error correction [197]. It is important to bear in mind that no single approximation technique, deterministic or stochastic, is going to beat all others on all problems, given the same computational resources. In this sense, insight as to the properties of the various approximations is useful in matching an approximation method to the problem at hand.
We can now make a first connection between probability and graph theory. A belief network introduces structure into a probabilistic model by using graphs to represent independence assumptions among the variables. Probability operations such as marginalising and conditioning then correspond to simple operations on the graph, and details about the model can be ‘read’ from the graph. There is also a benefit in terms of computational efficiency. Belief networks cannot capture all possible relations among variables. However, they are natural for representing ‘causal’ relations, and they are a part of the family of graphical models we study further in Chapter 4.
The benefits of structure
It's tempting to think of feeding a mass of undigested data and probability distributions into a computer and getting back good predictions and useful insights in extremely complex environments. However, unfortunately, such a naive approach is likely to fail. The possible ways variables can interact is extremely large, so that without some sensible assumptions we are unlikely to make a useful model. Independently specifying all the entries of a table p(x1, …, xN) over binary variables xi takes O(2N) space, which is impractical for more than a handful of variables. This is clearly infeasible in many machine learning and related application areas where we need to deal with distributions on potentially hundreds if not millions of variables. Structure is also important for computational tractability of inferring quantities of interest.
from
Part One
-
Frameworks for Scaling Up Machine Learning
By
Mihai Budiu, Microsoft Research, Mountain View, CA, USA,
Dennis Fetterly, Microsoft Research, Mountain View, CA, USA,
Michael Isard, Microsoft Research, Mountain View, CA, USA,
Frank McSherry, Microsoft Research, Mountain View, CA, USA,
Yuan Yu, Microsoft Research, Mountain View, CA, USA
This chapter describes DryadLINQ, a general-purpose system for large-scale data-parallel computing, and illustrates its use on a number of machine learning problems.
The main motivation behind the development of DryadLINQ was to make it easier for nonspecialists to write general-purpose, scalable programs that can operate on very large input datasets. In order to appeal to nonspecialists, we designed the programming interface to use a high level of abstraction that insulates the programmer from most of the detail and complexity of parallel and distributed execution. In order to support general-purpose computing, we embedded these high-level abstractions in .NET, giving developers access to full-featured programming languages with rich type systems and proven mechanisms (such as classes and libraries) for managing complex, long-lived, and geographically distributed software projects. In order to support scalability over very large data and compute clusters, the DryadLINQ compiler generates code for the Dryad runtime, a well-tested and highly efficient distributed execution engine.
As machine learning moves into the industrial mainstream and operates over diverse data types including documents, images, and graphs, it is increasingly appealing to move away from domain-specific languages like MATLAB and toward general-purpose languages that support rich types and standardized libraries. The examples in this chapter demonstrate that a general-purpose language such as C# supports effective, concise implementations of standard machine learning algorithms and that DryadLINQ efficiently scales these implementations to operate over hundreds of computers and very large datasets primarily limited by disk capacity.
from
Part Two
-
Supervised and Unsupervised Learning Algorithms
By
Joseph Gonzalez, Carnegie Mellon University, Pittsburgh, PA, USA,
Yucheng Low, Carnegie Mellon University, Pittsburgh, PA, USA,
Carlos Guestrin, Carnegie Mellon University, Pittsburgh, PA, USA
Probabilistic graphical models are used in a wide range of machine learning applications. From reasoning about protein interactions (Jaimovich et al., 2006) to stereo vision (Sun, Shum, and Zheng, 2002), graphical models have facilitated the application of probabilistic methods to challenging machine learning problems. A core operation in probabilistic graphical models is inference – the process of computing the probability of an event given particular observations. Although inference is NP-complete in general, there are several popular approximate inference algorithms that typically perform well in practice. Unfortunately, the approximate inference algorithms are still computationally intensive and therefore can benefit from parallelization. In this chapter, we parallelize loopy belief propagation (or loopy BP in short), which is used in a wide range of ML applications (Jaimovich et al., 2006; Sun et al., 2002; Lan et al., 2006; Baron, Sarvotham, and Baraniuk, 2010; Singla and Domingos, 2008).
We begin by briefly reviewing the sequential BP algorithm as well as the necessary background in probabilistic graphical models. We then present a collection of parallel shared memory BP algorithms that demonstrate the importance of scheduling in parallel BP. Next, we develop the Splash BP algorithm, which combines new scheduling ideas to address the limitations of existing sequential BP algorithms and achieve theoretically optimal parallel performance. Finally, we present how to efficiently implement loopy BP algorithms in the distributed parallel setting by addressing the challenges of distributed state and load balancing.
In this chapter, we address distributed learning algorithms for statistical latent variable models, with a focus on topic models. Many high-dimensional datasets, such as text corpora and image databases, are too large to allow one to learn topic models on a single computer. Moreover, a growing number of applications require that inference be fast or in real time, motivating the exploration of parallel and distributed learning algorithms.
We begin by reviewing topic models such as Latent Dirichlet Allocation and Hierarchical Dirichlet Processes. We discuss parallel and distributed algorithms for learning these models and show that these algorithms can achieve substantial speedups without sacrificing model quality. Next we discuss practical guidelines for running our algorithms within various parallel computing frameworks and highlight complementary speedup techniques. Finally, we generalize our distributed approach to handle Bayesian networks.
Several of the results in this chapter have appeared in previous papers in the specific context of topic modeling. The goal of this chapter is to present a comprehensive overview of distributed inference algorithms and to extend the general ideas to a broader class of Bayesian networks.
Latent Variable Models
Latent variable models are a class of statistical models that explain observed data with latent (or hidden) variables. Topic models and hidden Markov models are two examples of such models, where the latent variables are the topic assignment variables and the hidden states, respectively. Given observed data, the goal is to perform Bayesian inference over the latent variables and use the learned model to make inferences or predictions.
Automatic speech recognition (ASR) allows multimedia contents to be transcribed from acoustic waveforms into word sequences. It is an exemplar of a class of machine learning applications where increasing compute capability is enabling new industries such as automatic speech analytics. Speech analytics help customer service call centers search through recorded content, track service quality, and provide early detection of service issues. Fast and efficient ASR enables economic employment of a plethora of text-based data analytics on multimedia contents, opening the door to many possibilities.
In this chapter, we describe our approach for scalable parallelization of the most challenging component of ASR: the speech inference engine. This component takes a sequence of audio features extracted from a speech waveform as input, compares them iteratively to a speech model, and produces the most likely interpretation of the speech waveform as a word sequence. The speech model is a database of acoustic characteristics, word pronunciations, and phrases from a particular language. Speech models for natural languages are represented with large irregular graphs consisting of millions of states and arcs. Referencing these models involves accessing an unpredictable data working set guided by “what was said” in the speech input. The inference process is highly challenging to parallelize efficiently.
We demonstrate that parallelizing an application is much more than recoding the program in another language. It requires careful consideration of data, task, and runtime concerns to successfully exploit the full parallelization potential of an application.
This book attempts to aggregate state-of-the-art research in parallel and distributed machine learning. We believe that parallelization provides a key pathway for scaling up machine learning to large datasets and complex methods. Although large-scale machine learning has been increasingly popular in both industrial and academic research communities, there has been no singular resource covering the variety of approaches recently proposed. We did our best to assemble the most representative contemporary studies in one volume. While each contributed chapter concentrates on a distinct approach and problem, together with their references they provide a comprehensive view of the field.
We believe that the book will be useful to the broad audience of researchers, practitioners, and anyone who wants to grasp the future of machine learning. To smooth the ramp-up for beginners, the first five chapters provide introductory material on machine learning algorithms and parallel computing platforms. Although the book gets deeply technical in some parts, the reader is assumed to have only basic prior knowledge of machine learning and parallel/distributed computing, along with college-level mathematical maturity. We hope that an engineering undergraduate who is familiar with the notion of a classifier and had some exposure to threads, MPI, or MapReduce will be able to understand the majority of the book's content. We also hope that a seasoned expert will find this book full of new, interesting ideas to inspire future research in the area.
Mining frequent subtrees in a database of rooted and labeled trees is an important problem in many domains, ranging from phylogenetic analysis to biochemistry and from linguistic parsing to XML data analysis. In this work, we revisit this problem and develop an architecture-conscious solution targeting emerging multicore systems. Specifically, we identify a sequence of memory-related optimizations that significantly improve the spatial and temporal locality of a state-of-the-art sequential algorithm – alleviating the effects of memory latency. Additionally, these optimizations are shown to reduce the pressure on the front-side bus, an important consideration in the context of large-scale multicore architectures. We then demonstrate that these optimizations, although necessary, are not sufficient for efficient parallelization on multicores, primarily because of parametric and data-driven factors that make load balancing a significant challenge. To address this challenge, we present a methodology that adaptively and automatically modulates the type and granularity of the work being shared among different cores. The resulting algorithm achieves near perfect parallel efficiency on up to 16 processors on challenging real-world applications. The optimizations we present have general-purpose utility, and a key outcome is the development of a generalpurpose scheduling service for moldable task scheduling on emerging multicore systems.
The field of knowledge discovery is concerned with extracting actionable knowledge from data efficiently. Although most of the early work in this field focused on mining simple transactional datasets, recently there has been a significant shift toward analyzing data with complex structure such as trees and graphs.
Computer vision is a challenging application area for learning algorithms. For instance, the task of object detection is a critical problem for many systems, like mobile robots, that remains largely unsolved. In order to interact with the world, robots must be able to locate and recognize large numbers of objects accurately and at reasonable speeds. Unfortunately, off-the-shelf computer vision algorithms do not yet achieve sufficiently high detection performance for these applications. A key difficulty with many existing algorithms is that they are unable to take advantage of large numbers of examples. As a result, they must rely heavily on prior knowledge and hand-engineered features that account for the many kinds of errors that can occur. In this chapter, we present two methods for improving performance by scaling up learning algorithms to large datasets: (1) using graphics processing units (GPUs) and distributed systems to scale up the standard components of computer vision algorithms and (2) using GPUs to automatically learn high-quality feature representations using deep belief networks (DBNs). These methods are capable of not only achieving high performance but also removing much of the need for hand-engineering common in computer vision algorithms.
The fragility of many vision algorithms comes from their lack of knowledge about the multitude of visual phenomena that occur in the real world. Whereas humans can intuit information about depth, occlusion, lighting, and even motion from still images, computer vision algorithms generally lack the ability to deal with these phenomena without being engineered to account for them in advance.
Facing a problem of clustering amultimillion-data-point collection, amachine learning practitioner may choose to apply the simplest clustering method possible, because it is hard to believe that fancier methods can be applicable to datasets of such scale. Whoever is about to adopt this approach should first weigh the following considerations:
Simple clustering methods are rarely effective. Indeed, four decades of research would not have been spent on data clustering if a simple method could solve the problem. Moreover, even the simplest methods may run for long hours on a modern PC, given a large-scale dataset. For example, consider a simple online clustering algorithm (which, we believe, is machine learning folklore): first initialize k clusters with one data point per cluster, then iteratively assign the rest of data points into their closest clusters (in the Euclidean space). If k is small enough, we can run this algorithm on one machine, because it is unnecessary to keep the entire data in RAM. However, besides being slow, it will produce low-quality results, especially when the data is highly multi-dimensional.
State-of-the-art clustering methods can scale well, which we aim to justify in this chapter.
With the deployment of large computational facilities (such as Amazon.com's EC2, IBM's BlueGene, and HP's XC), the Parallel Computing paradigm is probably the only currently available option for tackling gigantic data processing tasks. Parallel methods are becoming an integral part of any data processing system, and thus getting special attention (e.g., universities introduce parallel methods to their core curricula; see Johnson et al., 2008).