To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Computer vision is a challenging application area for learning algorithms. For instance, the task of object detection is a critical problem for many systems, like mobile robots, that remains largely unsolved. In order to interact with the world, robots must be able to locate and recognize large numbers of objects accurately and at reasonable speeds. Unfortunately, off-the-shelf computer vision algorithms do not yet achieve sufficiently high detection performance for these applications. A key difficulty with many existing algorithms is that they are unable to take advantage of large numbers of examples. As a result, they must rely heavily on prior knowledge and hand-engineered features that account for the many kinds of errors that can occur. In this chapter, we present two methods for improving performance by scaling up learning algorithms to large datasets: (1) using graphics processing units (GPUs) and distributed systems to scale up the standard components of computer vision algorithms and (2) using GPUs to automatically learn high-quality feature representations using deep belief networks (DBNs). These methods are capable of not only achieving high performance but also removing much of the need for hand-engineering common in computer vision algorithms.
The fragility of many vision algorithms comes from their lack of knowledge about the multitude of visual phenomena that occur in the real world. Whereas humans can intuit information about depth, occlusion, lighting, and even motion from still images, computer vision algorithms generally lack the ability to deal with these phenomena without being engineered to account for them in advance.
Facing a problem of clustering amultimillion-data-point collection, amachine learning practitioner may choose to apply the simplest clustering method possible, because it is hard to believe that fancier methods can be applicable to datasets of such scale. Whoever is about to adopt this approach should first weigh the following considerations:
Simple clustering methods are rarely effective. Indeed, four decades of research would not have been spent on data clustering if a simple method could solve the problem. Moreover, even the simplest methods may run for long hours on a modern PC, given a large-scale dataset. For example, consider a simple online clustering algorithm (which, we believe, is machine learning folklore): first initialize k clusters with one data point per cluster, then iteratively assign the rest of data points into their closest clusters (in the Euclidean space). If k is small enough, we can run this algorithm on one machine, because it is unnecessary to keep the entire data in RAM. However, besides being slow, it will produce low-quality results, especially when the data is highly multi-dimensional.
State-of-the-art clustering methods can scale well, which we aim to justify in this chapter.
With the deployment of large computational facilities (such as Amazon.com's EC2, IBM's BlueGene, and HP's XC), the Parallel Computing paradigm is probably the only currently available option for tackling gigantic data processing tasks. Parallel methods are becoming an integral part of any data processing system, and thus getting special attention (e.g., universities introduce parallel methods to their core curricula; see Johnson et al., 2008).
The set of features used by a learning algorithm can have a dramatic impact on the performance of the algorithm. Including extraneous features can make the learning problem more difficult by adding useless, noisy dimensions that lead to over-fitting and increased computational complexity. Conversely, excluding useful features can deprive the model of important signals. The problem of feature selection is to find a subset of features that allows the learning algorithm to learn the “best” model in terms of measures such as accuracy or model simplicity.
The problem of feature selection continues to grow in both importance and difficulty as extremely high-dimensional datasets become the standard in real-world machine learning tasks. Scalability can become a problem for even simple approaches. For example, common feature selection approaches that evaluate each new feature by training a new model containing that feature require learning a linear number of models each time they add a new feature. This computational cost can add up quickly when we iteratively add many new features. Even those techniques that use relatively computationally inexpensive tests of a feature's value, such as mutual information, require at least linear time in the number of features being evaluated.
As a simple illustrative example, consider the task of classifying websites. In this case, the dataset could easily contain many millions of examples. Including very basic features such as text unigrams on the page or HTML tags could easily provide many thousands of potential features for the model.
Micro-robots, unmanned aerial vehicles, imaging sensor networks, wireless phones, and other embedded vision systems all require low cost and high-speed implementations of synthetic vision systems capable of recognizing and categorizing objects in a scene.
Many successful object recognition systems use dense features extracted on regularly spaced patches over the input image. The majority of the feature extraction systems have a common structure composed of a filter bank (generally based on oriented edge detectors or 2D Gabor functions), a nonlinear operation (quantization, winner-take-all, sparsification, normalization, and/or pointwise saturation), and finally a pooling operation (max, average, or histogramming). For example, the scale-invariant feature transform (SIFT) (Lowe, 2004) operator applies oriented edge filters to a small patch and determines the dominant orientation through a winner-take-all operation. Finally, the resulting sparse vectors are added (pooled) over a larger patch to form a local orientation histogram. Some recognition systems use a single stage of feature extractors (Lazebnik, Schmid, and Ponce, 2006; Dalal and Triggs, 2005; Berg, Berg, and Malik, 2005; Pinto, Cox, and DiCarlo, 2008).
Other models such as HMAX-type models (Serre, Wolf, and Poggio, 2005; Mutch, and Lowe, 2006) and convolutional networks use two more layers of successive feature extractors. Different training algorithms have been used for learning the parameters of convolutional networks. In LeCun et al. (1998b) and Huang and LeCun (2006), pure supervised learning is used to update the parameters. However, recent works have focused on training with an auxiliary task (Ahmed et al., 2008) or using unsupervised objectives (Ranzato et al., 2007b; Kavukcuoglu et al., 2009; Jarrett et al., 2009; Lee et al., 2009).
In this chapter we look at leveraging the MapReduce distributed computing framework (Dean and Ghemawat, 2004) for parallelizing machine learning methods of wide interest, with a specific focus on learning ensembles of classification or regression trees. Building a production-ready implementation of a distributed learning algorithm can be a complex task. With the wide and growing availability of MapReduce-capable computing infrastructures, it is natural to ask whether such infrastructures may be of use in parallelizing common data mining tasks such as tree learning. For many data mining applications, MapReduce may offer scalability as well as ease of deployment in a production setting (for reasons explained later).
We initially give an overview of MapReduce and outline its application in a classic clustering algorithm, k-means. Subsequently, we focus on PLANET: a scalable distributed framework for learning tree models over large datasets. PLANET defines tree learning as a series of distributed computations and implements each one using the MapReduce model. We show how this framework supports scalable construction of classification and regression trees, as well as ensembles of such models. We discuss the benefits and challenges of using a MapReduce compute cluster for tree learning and demonstrate the scalability of this approach by applying it to a real-world learning task from the domain of computational advertising.
MapReduce is a simple model for distributed computing that abstracts away many of the difficulties in parallelizing data management operations across a cluster of commodity machines.
The graphics processing unit (GPU) of modern computers has evolved into a powerful, general-purpose, massively parallel numerical (co-)processor. The numerical computation in a number of machine learning algorithms fits well on the GPU. To help identify such algorithms, we present uniformly fine-grained data-parallel computing and illustrate it on two machine learning algorithms, clustering and regression clustering, on a GPU and central processing unit (CPU) mixed computing architecture. We discuss the key issues involved in a successful design of the algorithms, data structures, and computation partitioning between a CPU and a GPU. Performance gains on a CPU and GPU mixed architecture are compared with the performance of the regression clustering algorithm implemented completely on a CPU. Significant speedups are reported. A GPU and CPU mixed architecture also achieves better cost-performance and energy-performance ratios.
The computing power of the CPU has increased dramatically in the past few decades, supported by both miniaturization and increasing clock frequencies. More and more electronic gates were packed onto the same area of a silicon die as miniaturization continued. Hardware-supported parallel computing, pipelining for example, further increased the computing power of CPUs. Frequency increases speeded up CPUs even more directly. However, the long-predicted physical limit of the miniaturization process was finally hit a few years ago such that increasing the frequency was no longer feasible due to the accompanied nonlinear increase in power consumption, even though miniaturization still continues.
from
Part One
-
Frameworks for Scaling Up Machine Learning
By
Edwin Pednault, IBM Research, Yorktown Heights, NY, USA,
Elad Yom-Tov, Yahoo! Research, New York, NY, USA,
Amol Ghoting, IBM Research, Yorktown Heights, NY, USA
In many ways, the objective of the IBM Parallel Machine Learning Toolbox (PML) is similar to that of Google's MapReduce programming model (Dean and Ghemawat, 2004) and the open source Hadoop system, which is to provide Application Programming Interfaces (APIs) that enable programmers who have no prior experience in parallel and distributed systems to nevertheless implement parallel algorithms with relative ease. Like MapReduce and Hadoop, PML supports associative-commutative computations as its primary parallelization mechanism. Unlike MapReduce and Hadoop, PML fundamentally assumes that learning algorithms can be iterative in nature, requiring multiple passes over data. It also extends the associative-commutative computational model in various aspects, the most important of which are:
The ability to maintain the state of each worker node between iterations, making it possible, for example, to partition and distribute data structures across workers
Efficient distribution of data, including the ability for each worker to read a subset of the data, to sample the data, or to scan the entire dataset
Access to both sparse and dense datasets
Parallel merge operations using tree structures for efficient collection of worker results on very large clusters
In order to make these extensions to the computational model and still address ease of use, PML provides an object-oriented API in which algorithms are objects that implement a predefined set of interface methods. The PML infrastructure then uses these interface methods to distribute algorithm objects and their computations across multiple compute nodes.
By
Ron Bekkerman, LinkedIn Corporation, Mountain View, CA, USA,
Mikhail Bilenko, Microsoft Research, Redmond, WA, USA,
John Langford, Yahoo! Research, New York, NY, USA
Distributed and parallel processing of very large datasets has been employed for decades in specialized, high-budget settings, such as financial and petroleum industry applications. Recent years have brought dramatic progress in usability, cost effectiveness, and diversity of parallel computing platforms, with their popularity growing for a broad set of data analysis and machine learning tasks.
The current rise in interest in scaling up machine learning applications can be partially attributed to the evolution of hardware architectures and programming frameworks that make it easy to exploit the types of parallelism realizable in many learning algorithms. A number of platforms make it convenient to implement concurrent processing of data instances or their features. This allows fairly straightforward parallelization of many learning algorithms that view input as an unordered batch of examples and aggregate isolated computations over each of them.
Increased attention to large-scale machine learning is also due to the spread of very large datasets across many modern applications. Such datasets are often accumulated on distributed storage platforms, motivating the development of learning algorithms that can be distributed appropriately. Finally, the proliferation of sensing devices that perform real-time inference based on high-dimensional, complex feature representations drives additional demand for utilizing parallelism in learning-centric applications. Examples of this trend include speech recognition and visual object detection becoming commonplace in autonomous robots and mobile devices.
Semi-supervised learning (SSL) is the process of training decision functions using small amounts of labeled and relatively large amounts of unlabeled data. In many applications, annotating training data is time consuming and error prone. Speech recognition is the typical example, which requires large amounts of meticulously annotated speech data (Evermann et al., 2005) to produce an accurate system. In the case of document classification for internet search, it is not even feasible to accurately annotate a relatively large number of web pages for all categories of potential interest. SSL lends itself as a useful technique in many machine learning applications because one need annotate only relatively small amounts of the available data. SSL is related to the problem of transductive learning (Vapnik, 1998). In general, a learner is transductive if it is designed for prediction on only a closed dataset, where the test set is revealed at training time. In practice, however, transductive learners can be modified to handle unseen data (Sindhwani, Niyogi, and Belkin, 2005; Zhu, 2005a). Chapter 25 in Chapelle, Scholkopf, and Zien (2007) gives a full discussion on the relationship between SSL and transductive learning. In this chapter, SSL refers to the semi-supervised transductive classification problem.
Let x ∈ X denote the input to the decision function (classifier), f, and y ∈ Y denote its output label, that is, f : X → Y. In most cases f(x) = argmaxy∈Yp(y|x).
The web search ranking task has become increasingly important because of the rapid growth of the internet. With the growth of the web and the number of web search users, the amount of available training data for learning web ranking models has also increased. We investigate the problem of learning to rank on a cluster using web search data composed of 140,000 queries and approximately 14 million URLs. For datasets much larger than this, distributed computing will become essential, because of both speed and memory constraints. We compare a baseline algorithm that has been carefully engineered to allow training on the full dataset using a single machine, in order to evaluate the loss or gain incurred by the distributed algorithms we consider. The underlying algorithm we use is a boosted tree ranking algorithm called LambdaMART, where a split at a given vertex in each decision tree is determined by the split criterion for a particular feature. Our contributions are twofold. First, we implement a method for improving the speed of training when the training data fits in main memory on a single machine by distributing the vertex split computations of the decision trees. The model produced is equivalent to the model produced from centralized training, but achieves faster training times. Second, we develop a training method for the case where the training data size exceeds the main memory of a single machine. Our second approach easily scales to far larger datasets, that is, billions of examples, and is based on data distribution.
We provide an overall description of the Ciao multiparadigm programming system emphasizing some of the novel aspects and motivations behind its design and implementation. An important aspect of Ciao is that, in addition to supporting logic programming (and, in particular, Prolog), it provides the programmer with a large number of useful features from different programming paradigms and styles and that the use of each of these features (including those of Prolog) can be turned on and off at will for each program module. Thus, a given module may be using, e.g., higher order functions and constraints, while another module may be using assignment, predicates, Prolog meta-programming, and concurrency. Furthermore, the language is designed to be extensible in a simple and modular way. Another important aspect of Ciao is its programming environment, which provides a powerful preprocessor (with an associated assertion language) capable of statically finding non-trivial bugs, verifying that programs comply with specifications, and performing many types of optimizations (including automatic parallelization). Such optimizations produce code that is highly competitive with other dynamic languages or, with the (experimental) optimizing compiler, even that of static languages, all while retaining the flexibility and interactive development of a dynamic language. This compilation architecture supports modularity and separate compilation throughout. The environment also includes a powerful autodocumenter and a unit testing framework, both closely integrated with the assertion system. The paper provides an informal overview of the language and program development environment. It aims at illustrating the design philosophy rather than at being exhaustive, which would be impossible in a single journal paper, pointing instead to previous Ciao literature.
Spectral clustering is a technique for finding group structure in data. It makes use of the spectrum of the data similarity matrix to perform dimensionality reduction for clustering in fewer dimensions. Spectral clustering algorithms have been shown to be more effective in finding clusters than traditional algorithms such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computation time when the size of a dataset is large. To perform clustering on large datasets, in this work, we parallelize both memory use and computation using MapReduce and MPI. Through an empirical study on a document set of 534,135 instances and a photo set of 2,121,863 images, we show that our parallel algorithm can effectively handle large problems.
Clustering is one of the most important subfields of machine learning and data mining tasks. In the last decade, spectral clustering (e.g., Shi and Malik, 2000; Meila and Shi, 2000; Fowlkes et al., 2004), motivated by normalized graph cut, has attracted much attention. Unlike traditional partition-based clustering, spectral clustering exploits a pairwise data similarity matrix. It has been shown to be more effective than traditional methods such as k-means, which considers only the similarity between instances and k centroids (Ng, Jordan, and Weiss, 2001).Because of its effectiveness, spectral clustering has been widely used in several areas such as information retrieval and computer vision (e.g., Dhillon, 2001; Xu, Liu, and Gong, 2003; Shi and Malik, 2000; Yu and Shi, 2003).
from
Part Two
-
Supervised and Unsupervised Learning Algorithms
By
Edward Y. Chang, Google Research, Beijing, China,
Hongjie Bai, Google Research, Beijing, China,
Kaihua Zhu, Google Research, Beijing, China,
Hao Wang, Google Research, Beijing, China,
Jian Li, Google Research, Beijing, China,
Zhihuan Qiu, Google Research, Beijing, China
The paradigm of Tabled Logic Programming (TLP) is now supported by a number of Prolog systems, including XSB, YAP Prolog, B-Prolog, Mercury, ALS, and Ciao. The reasons for this are partly theoretical: tabling ensures termination and optimal known complexity for queries to a large class of programs. However, the overriding reasons are practical. TLP allows sophisticated programs to be written concisely and efficiently, especially when mechanisms such as tabled negation and call and answer subsumption are supported. As a result, TLP has now been used in a variety of applications from program analysis to querying over the semantic web. This paper provides a survey of TLP and its applications as implemented in the XSB Prolog, along with discussion of how XSB supports tabling with dynamically changing code, and in a multi-threaded environment.
Massive training datasets, ranging in size from tens of gigabytes to several terabytes, arise in diverse machine learning applications in areas such as text mining of web corpora, multimedia analysis of image and video data, retail modeling of customer transaction data, bioinformatic analysis of genomic and microarray data, medical analysis of clinical diagnostic data such as functional magnetic resonance imaging (fMRI) images, and environmental modeling using sensor and streaming data. Provost and Kolluri (1999) in their overview of machine learning with massive datasets, emphasize the need for developing parallel algorithms and implementations for these applications.
In this chapter, we describe the Transform Regression (TReg) algorithm (Pednault, 2006), which is a general-purpose, non-parametric methodology suitable for a wide variety of regression applications. TReg was originally created for the data mining component of the IBM InfoSphere Warehouse product, guided by a challenging set of requirements:
The modeling time should be comparable to linear regression.
The resulting models should be compact and efficient to apply.
The model quality should be reliable without any further tuning.
The model training and scoring should be parallelized for large datasets stored as partitioned tables in IBM's DB2 database systems.
Requirements 1 and 2 were deemed necessary for a successful commercial algorithm, although this ruled out certain ensemble-based methods that produce highquality models but have high computation and storage requirements. Requirement 3 ensured that the chosen algorithm did not unduly compromise the concomitant model quality in view of requirements 1 and 2.