We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Regression analysis has been a major theoretical pillar for supervised machine learning since it is applicable to a broad range of identification, prediction and classification problems. There are two major approaches to the design of robust regressors. The first category involves a variety of regularization techniques whose principle lies in incorporating both the error and the penalty terms into the cost function. It is represented by the ridge regressor. The second category is based on the premise that the robustness of the regressor could be enhanced by accounting for potential measurement errors in the learning phase. These techniques are known as errors-in-variables models in statistics and are relatively new to the machine learning community. In our discussion, such errors in variables are viewed as additive input perturbation.
This chapter aims at enhancing the robustness of estimators by incorporating input perturbation into the conventional regression analysis. It develops a kernel perturbation-regulated regressor (PRR) that is based on the errors-in-variables models. The PRR offers a strong smoothing capability that is critical to the robustness of regression or classification results. For Gaussian cases, the notion of orthogonal polynomials is instrumental to optimal estimation and its error analysis. More exactly, the regressor may be expressed as a linear combination of many simple Hermite regressors, each focusing on one (and only one) orthogonal polynomial.
This chapter will cover the fundamental theory of linear regression and regularization analysis. The analysis leads to a closed-form error formula that is critical for order-error tradeoff.
In Chapter 8, it is shown that the kernel ridge regressor (KRR) offers a unified treatment for over-determined and under-determined systems. Another way of achieving unification of these two linear systems approaches is by means of the support vector machine (SVM) learning model proposed by Vapnik [41, 280, 281].
Just like FDA, the objective of SVM aims at the separation of two classes. FDA is focused on the separation of the positive and negative centroids with the total data distribution taken into account. In contrast, SVM aims at the separation of only the so-called support vectors, i.e. only those which are deemed critical for class separation.
Just like ridge regression, the objective of the SVM classifier also involves minimization of the two-norm of the decision vector.
The key component in SVM learning is to identify a set of representative training vectors deemed to be most useful for shaping the (linear or nonlinear) decision boundary. These training vectors are called “support vectors.” The rest of the training vectors are called non-support vectors. Note that only support vectors can directly take part in the characterization of the decision boundary of the SVM.
SVM has successfully been applied to an enormously broad spectrum of application domains, including signal processing and classification, image retrieval, multimedia, fault detection, communication, computer vision, security/authentication, time-series prediction, biomedical prediction, and bioinformatics.
Two primary techniques for dimension-reducing feature extraction are subspace projection and feature selection. This chapter will explore the key subspace projection approaches, i.e. PCA and KPCA.
(i) Section 3.2 provides motivations for dimension reduction by pointing out (1) the potential adverse effect of large feature dimensions and (2) the potential advantage of focusing on a good set of highly selective representations.
(ii) Section 3.3 introduces subspace projection approaches to feature-dimension reduction. It shows that the well-known PCA offers the optimal solution under two information-preserving criteria: least-squares error and maximum entropy.
(iii) Section 3.4 discusses several numerical methods commonly adopted for computation of PCA, including singular value decomposition (on the data matrix), spectral decomposition (on the scatter matrix), and spectral decomposition (on the kernel matrix).
(iv) Section 3.5 shows that spectral factorization of the kernel matrix leads to both kernel-based spectral space and kernel PCA (KPCA) [238]. In fact, KPCA is synonymous with the kernel-induced spectral feature vector. We shall show that nonlinear KPCA offers an enhanced capability in handling complex data analysis. By use of examples, it will be demonstrated that nonlinear kernels offer greater visualization flexibility in unsupervised learning and higher discriminating power in supervised learning.
Why dimension reduction?
In many real-world applications, the feature dimension (i.e. the number of features or attributes in an input vector) could easily be as high as tens of thousands. Such an extreme dimensionality could be very detrimental to data analysis and processing.
This part contains two chapters concerning reduction of the dimension of the feature space, which plays a vital role in improving learning efficiency as well as prediction performance.
Chapter 3 covers the most prominent subspace projection approach, namely the classical principal component analysis (PCA), cf. Algorithm 3.1. Theorems 3.1 and 3.2 establish the optimality of PCA for both the minimum reconstruction error and maximum entropy criteria. The optimal error and entropy attainable by PCA are given in closed form. Algorithms 3.2, 3.3, and 3.4 describe the numerical procedures for the computation of PCA via the data matrix, scatter matrix, and kernel matrix, respectively.
Given a finite training dataset, the PCA learning model meets the LSP condition, thus the conventional PCA model can be kernelized. When a nonlinear kernel is adopted, it further extends to the kernel-PCA (KPCA) learning model. The KPCA algorithms can be presented in intrinsic space or empirical space (see Algorithms 3.5 and 3.6). For several real-life datasets, visualization via KPCA shows more visible data separability than that via PCA. Moreover, KPCA is closely related to the kernel-induced spectral space, which proves instrumental for error analysis in unsupervised and supervised applications.
Chapter 4 explores various aspects of feature selection methods for supervised and unsupervised learning scenarios. It presents several filtering-based and wrapper-based methods for feature selection, a popular method for dimension reduction.
A series of important applications of combinatorics on words has emerged with the development of computerized text and string processing. The aim of this volume, the third in a trilogy, is to present a unified treatment of some of the major fields of applications. After an introduction that sets the scene and gathers together the basic facts, there follow chapters in which applications are considered in detail. The areas covered include core algorithms for text processing, natural language processing, speech processing, bioinformatics, and areas of applied mathematics such as combinatorial enumeration and fractal analysis. No special prerequisites are needed, and no familiarity with the application areas or with the material covered by the previous volumes is required. The breadth of application, combined with the inclusion of problems and algorithms and a complete bibliography will make this book ideal for graduate students and professionals in mathematics, computer science, biology and linguistics.
This is the first comprehensive introduction to Support Vector Machines (SVMs), a generation learning system based on recent advances in statistical learning theory. SVMs deliver state-of-the-art performance in real-world applications such as text categorisation, hand-written character recognition, image classification, biosequences analysis, etc., and are now established as one of the standard tools for machine learning and data mining. Students will find the book both stimulating and accessible, while practitioners will be guided smoothly through the material required for a good grasp of the theory and its applications. The concepts are introduced gradually in accessible and self-contained stages, while the presentation is rigorous and thorough. Pointers to relevant literature and web sites containing software ensure that it forms an ideal starting point for further study. Equally, the book and its associated web site will guide practitioners to updated literature, new applications, and on-line software.
As one of the most comprehensive machine learning texts around, this book does justice to the field's incredible richness, but without losing sight of the unifying principles. Peter Flach's clear, example-based approach begins by discussing how a spam filter works, which gives an immediate introduction to machine learning in action, with a minimum of technical fuss. Flach provides case studies of increasing complexity and variety with well-chosen examples and illustrations throughout. He covers a wide range of logical, geometric and statistical models and state-of-the-art topics such as matrix factorisation and ROC analysis. Particular attention is paid to the central role played by features. The use of established terminology is balanced with the introduction of new and useful concepts, and summaries of relevant background material are provided with pointers for revision if necessary. These features ensure Machine Learning will set a new standard as an introductory textbook.
AND SO WE HAVE come to the end of our journey through the ‘making sense of data’ landscape. We have seen how machine learning can build models from features for solving tasks involving data. We have seen how models can be predictive or descriptive; learning can be supervised or unsupervised; and models can be logical, geometric, probabilistic or ensembles of such models. Now that I have equipped you with the basic concepts to understand the literature, there is a whole world out there for you to explore. So it is only natural for me to leave you with a few pointers to areas you may want to learn about next.
One thing that we have often assumed in the book is that the data comes in a form suitable for the task at hand. For example, if the task is to label e-mails we conveniently learn a classifier from data in the form of labelled e-mails. For tasks such as class probability estimation I introduced the output space (for the model) as separate from the label space (for the data) because the model outputs (class probability estimates) are not directly observable in the data and have to be reconstructed. An area where the distinction between data and model output is much more pronounced is reinforcement learning. Imagine you want to learn how to be a good chess player. This could be viewed as a classification task, but then you require a teacher to score every move.
TWO HEADS ARE BETTER THAN ONE – a well-known proverb suggesting that two minds working together can often achieve better results. If we read ‘features’ for ‘heads’ then this is certainly true in machine learning, as we have seen in the preceding chapters. But we can often further improve things by combining not just features but whole models, as will be demonstrated in this chapter. Combinations of models are generally known as model ensembles. They are among the most powerful techniques in machine learning, often outperforming other methods. This comes at the cost of increased algorithmic and model complexity.
The topic of model combination has a rich and diverse history, to which we can only partly do justice in this short chapter. The main motivations came from computational learning theory on the one hand, and statistics on the other. It is a well-known statistical intuition that averaging measurements can lead to a more stable and reliable estimate because we reduce the influence of random fluctuations in single measurements. So if we were to build an ensemble of slightly different models from the same training data, we might be able to similarly reduce the influence of random fluctuations in single models. The key question here is how to achieve diversity between these different models. As we shall see, this can often be achieved by training models on random subsets of the data, and even by constructing them from random subsets of the available features.
TREE MODELS ARE among the most popular models in machine learning. For example, the pose recognition algorithm in the Kinect motion sensing device for the Xbox game console has decision tree classifiers at its heart (in fact, an ensemble of decision trees called a random forest about which you will learn more in Chapter 11). Trees are expressive and easy to understand, and of particular appeal to computer scientists due to their recursive ‘divide-and-conquer’ nature.
In fact, the paths through the logical hypothesis space discussed in the previous chapter already constitute a very simple kind of tree. For instance, the feature tree in Figure 5.1 (left) is equivalent to the path in Figure 4.6 (left) on p.117. This equivalence is best seen by tracing the path and the tree from the bottom upward.
The left-most leaf of the feature tree represents the concept at the bottom of the path, covering a single positive example.
The next concept up in the path generalises the literal Length = 3 into Length = [3,5] by means of internal disjunction; the added coverage (one positive example) is represented by the second leaf from the left in the feature tree.
By dropping the condition Teeth = few we add another two covered positives.
Dropping the ‘Length’ condition altogether (or extending the internal disjunction with the one remaining value ‘4’) adds the last positive, and also a negative.
THE PREVIOUS CHAPTER introduced binary classification and associated tasks such as ranking and class probability estimation. In this chapter we will go beyond these basic tasks in a number of ways. Section 3.1 discusses how to handle more than two classes. In Section 3.2 we consider the case of a real-valued target variable. Section 3.3 is devoted to various forms of learning that are either unsupervised or aimed at learning descriptive models.
Handling more than two classes
Certain concepts are fundamentally binary. For instance, the notion of a coverage curve does not easily generalise to more than two classes. We will now consider general issues related to having more than two classes in classification, scoring and class probability estimation. The discussion will address two issues: how to evaluate multi-class performance, and how to build multi-class models out of binary models. The latter is necessary for some models, such as linear classifiers, that are primarily designed to separate two classes. Other models, including decision trees, handle any number of classes quite naturally.
Multi-class classification
Classification tasks with more than two classes are very common. For instance, once a patient has been diagnosed as suffering from a rheumatic disease, the doctor will want to classify him or her further into one of several variants. If we have k classes, performance of a classifier can be assessed using a k-by-k contingency table. Assessing performance is easy if we are interested in the classifier's accuracy, which is still the sum of the descending diagonal of the contingency table, divided by the number of test instances.
IN THIS CHAPTER and the next we take a bird's-eye view of the wide range of different tasks that can be solved with machine learning techniques. ‘Task’ here refers to whatever it is that machine learning is intended to improve performance of (recall the definition of machine learning on p.3), for example, e-mail spam recognition. Since this is a classification task, we need to learn an appropriate classifier from training data. Many different types of classifiers exist: linear classifiers, Bayesian classifiers, distancebased classifiers, to name a few. We will refer to these different types as models; they are the subject of Chapters 4–9. Classification is just one of a range of possible tasks for which we can learn a model: other tasks that will pass the review in this chapter are class probability estimation and ranking. In the next chapter we will discuss regression, clustering and descriptive modelling. For each of these tasks we will discuss what it is, what variants exist, how performance at the task could be assessed, and how it relates to other tasks. We will start with some general notation that is used in this chapter and throughout the book (see Background 2.1 for the relevant mathematical concepts).
The objects of interest in machine learning are usually referred to as instances. The set of all possible instances is called the instance space, denoted in this book.
MACHING LEARNING IS a practical subject as much as a computational one. While we may be able to prove that a particular learning algorithm converges to the theoretically optimal model under certain assumptions, we need actual data to investigate, e.g., the extent to which those assumptions are actually satisfied in the domain under consideration, or whether convergence happens quickly enough to be of practical use. We thus evaluate or run particular models or learning algorithms on one or more data sets, obtain a number of measurements and use these to answer particular questions we might be interested in. This broadly characterises what is known as machine learning experiments.
In the natural sciences, an experiment can be seen as a question to nature about a scientific theory. For example, Arthur Eddington's famous 1919 experiment to verify Einstein's theory of general relativity asked the question: Are rays of light bent by gravitational fields produced by large celestial objects such as the Sun? To answer this question, the perceived position of stars was recorded under several conditions including a total solar eclipse. Eddington was able to show that these measurements indeed differed to an extent unexplained by Newtonian physics but consistent with general relativity.
While you don't have to travel to the island of Príncipe to perform machine learning experiments, they bear some similarity to experiments in physics in that machine learning experiments pose questions about models that we try to answer by means of measurements on data.