Kernel Methods and Machine Learning

Dedication
S. Y. Kung, Princeton University, New Jersey
Book:

Kernel Methods and Machine Learning

Published online:

05 July 2014

Print publication:

17 April 2014, pp v-vi
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Appendix B - kNN, PNN, and Bayes classifiers
from Part VIII - Appendices
S. Y. Kung, Princeton University, New Jersey
Book:

Kernel Methods and Machine Learning

Published online:

05 July 2014

Print publication:

17 April 2014, pp 549-560
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

There are two main learning strategies, namely inductive learning and transductive learning. These strategies are differentiated by their different ways of treating the (distribution of) testing data. The former adopts off-line learning models, see Figure B.1(a), but the latter usually adopts online learning models, see Figure B.1(b).
Inductive learning strategies. The decision rule, trained under an inductive setting, must cover all the possible data in the entire vector space. More explicitly, the discriminant function f(w, x) can be trained by inductive learning. This approach can effectively distill the information inherent in the training dataset off-line into a simple set of decision parameters w, thus enjoying the advantage of having a low classification complexity. As shown in Figure B.1(a), this approach contains two stages: (1) an off-line learning phase and (2) an on-field prediction phase. During the learning phase, the training dataset is used to learn the decision parameter w, which dictates the decision boundary: f(w, x) = 0. In the prediction phase, no more learning is required, so the decision making can be made on-the-fly with minimum latency.
Transductive learning strategies. In this case, the learner may explicitly make use of the structure and/or location of the putative test dataset in the decision process [281]. Hence, the discriminant function can be tailored to the specific test sample after it has been made known, presumably improving the prediction accuracy.

Part VII - Kernel methods and statistical estimation theory
S. Y. Kung, Princeton University, New Jersey
Book:

Kernel Methods and Machine Learning

Published online:

05 July 2014

Print publication:

17 April 2014, pp 457-458
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Linear prediction and system identification has become a well-established field in information sciences. On the other hand, kernel methods have acquired their popularity only during the past two decades, so they have a relatively short history. It is vital to establish a theoretical foundation linking kernel methods and the rich theory in estimation, prediction, and system identfication. This part contains two chapters addressing this important issue. Chapter 14 focuses on statistical analysis employing knowledge of the (joint) density functions of all the input and output variables and their respective noises. Chapter 15 focuses on estimation, prediction, and system identification using observed samples and prior knowledge of the irst- and second-order statistics of the system parameters.
In Chapter 14, the classical formulation for (linear) ridge regression will be extended to (nonlinear) kernel ridge regression (KRR). Using the notion of orthogonal polynomials (Theorem 14.1), closed-form results for the error analysis can be derived for KRR, see Theorem 14.2. The regression analysis can be further generalized to the errors-in-variables models, where the measurements of the input variable are no longer perfect. This leads to the development of perturbation-regulated regressors (PRRs), regarding which two major theoretical fronts will be explored:
Under the Gaussian distribution assumption, the analysis on PRR benefits greatly from the exact knowledge of the joint statistical property of the ideal and perturbed input variables. The property of conventional orthogonal polynomials (Theorem 14.1) is extended to harness the cross-orthogonality between the original and perturbed input variables.

Part VIII - Appendices
S. Y. Kung, Princeton University, New Jersey
Book:

Kernel Methods and Machine Learning

Published online:

05 July 2014

Print publication:

17 April 2014, pp 537-538
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Appendix A reviews useful validation techniques and test schemes for learning models.
Appendix B covers supervised classifiers without an explicit learning process and the popular and fundamental Bayes classifiers.

Contents
S. Y. Kung, Princeton University, New Jersey
Book:

Kernel Methods and Machine Learning

Published online:

05 July 2014

Print publication:

17 April 2014, pp vii-xvi
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

14 - Statistical regression analysis and errors-in-variables models
from Part VII - Kernel methods and statistical estimation theory
S. Y. Kung, Princeton University, New Jersey
Book:

Kernel Methods and Machine Learning

Published online:

05 July 2014

Print publication:

17 April 2014, pp 459-493
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
Regression analysis has been a major theoretical pillar for supervised machine learning since it is applicable to a broad range of identification, prediction and classification problems. There are two major approaches to the design of robust regressors. The first category involves a variety of regularization techniques whose principle lies in incorporating both the error and the penalty terms into the cost function. It is represented by the ridge regressor. The second category is based on the premise that the robustness of the regressor could be enhanced by accounting for potential measurement errors in the learning phase. These techniques are known as errors-in-variables models in statistics and are relatively new to the machine learning community. In our discussion, such errors in variables are viewed as additive input perturbation.
This chapter aims at enhancing the robustness of estimators by incorporating input perturbation into the conventional regression analysis. It develops a kernel perturbation-regulated regressor (PRR) that is based on the errors-in-variables models. The PRR offers a strong smoothing capability that is critical to the robustness of regression or classification results. For Gaussian cases, the notion of orthogonal polynomials is instrumental to optimal estimation and its error analysis. More exactly, the regressor may be expressed as a linear combination of many simple Hermite regressors, each focusing on one (and only one) orthogonal polynomial.
This chapter will cover the fundamental theory of linear regression and regularization analysis. The analysis leads to a closed-form error formula that is critical for order-error tradeoff.

10 - Support vector machines
from Part V - Support vector machines and variants
S. Y. Kung, Princeton University, New Jersey
Book:

Kernel Methods and Machine Learning

Published online:

05 July 2014

Print publication:

17 April 2014, pp 343-379
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
In Chapter 8, it is shown that the kernel ridge regressor (KRR) offers a unified treatment for over-determined and under-determined systems. Another way of achieving unification of these two linear systems approaches is by means of the support vector machine (SVM) learning model proposed by Vapnik [41, 280, 281].
Just like FDA, the objective of SVM aims at the separation of two classes. FDA is focused on the separation of the positive and negative centroids with the total data distribution taken into account. In contrast, SVM aims at the separation of only the so-called support vectors, i.e. only those which are deemed critical for class separation.
Just like ridge regression, the objective of the SVM classifier also involves minimization of the two-norm of the decision vector.
The key component in SVM learning is to identify a set of representative training vectors deemed to be most useful for shaping the (linear or nonlinear) decision boundary. These training vectors are called “support vectors.” The rest of the training vectors are called non-support vectors. Note that only support vectors can directly take part in the characterization of the decision boundary of the SVM.
SVM has successfully been applied to an enormously broad spectrum of application domains, including signal processing and classification, image retrieval, multimedia, fault detection, communication, computer vision, security/authentication, time-series prediction, biomedical prediction, and bioinformatics.

3 - PCA and kernel PCA
from Part II - Dimension-reduction: PCA/KPCA and feature selection
S. Y. Kung, Princeton University, New Jersey
Book:

Kernel Methods and Machine Learning

Published online:

05 July 2014

Print publication:

17 April 2014, pp 79-117
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
Two primary techniques for dimension-reducing feature extraction are subspace projection and feature selection. This chapter will explore the key subspace projection approaches, i.e. PCA and KPCA.
(i) Section 3.2 provides motivations for dimension reduction by pointing out (1) the potential adverse effect of large feature dimensions and (2) the potential advantage of focusing on a good set of highly selective representations.
(ii) Section 3.3 introduces subspace projection approaches to feature-dimension reduction. It shows that the well-known PCA offers the optimal solution under two information-preserving criteria: least-squares error and maximum entropy.
(iii) Section 3.4 discusses several numerical methods commonly adopted for computation of PCA, including singular value decomposition (on the data matrix), spectral decomposition (on the scatter matrix), and spectral decomposition (on the kernel matrix).
(iv) Section 3.5 shows that spectral factorization of the kernel matrix leads to both kernel-based spectral space and kernel PCA (KPCA) [238]. In fact, KPCA is synonymous with the kernel-induced spectral feature vector. We shall show that nonlinear KPCA offers an enhanced capability in handling complex data analysis. By use of examples, it will be demonstrated that nonlinear kernels offer greater visualization flexibility in unsupervised learning and higher discriminating power in supervised learning.
Why dimension reduction?
In many real-world applications, the feature dimension (i.e. the number of features or attributes in an input vector) could easily be as high as tens of thousands. Such an extreme dimensionality could be very detrimental to data analysis and processing.

References
S. Y. Kung, Princeton University, New Jersey
Book:

Kernel Methods and Machine Learning

Published online:

05 July 2014

Print publication:

17 April 2014, pp 561-577
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Part II - Dimension-reduction: PCA/KPCA and feature selection
S. Y. Kung, Princeton University, New Jersey
Book:

Kernel Methods and Machine Learning

Published online:

05 July 2014

Print publication:

17 April 2014, pp 77-78
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This part contains two chapters concerning reduction of the dimension of the feature space, which plays a vital role in improving learning efficiency as well as prediction performance.
Chapter 3 covers the most prominent subspace projection approach, namely the classical principal component analysis (PCA), cf. Algorithm 3.1. Theorems 3.1 and 3.2 establish the optimality of PCA for both the minimum reconstruction error and maximum entropy criteria. The optimal error and entropy attainable by PCA are given in closed form. Algorithms 3.2, 3.3, and 3.4 describe the numerical procedures for the computation of PCA via the data matrix, scatter matrix, and kernel matrix, respectively.
Given a finite training dataset, the PCA learning model meets the LSP condition, thus the conventional PCA model can be kernelized. When a nonlinear kernel is adopted, it further extends to the kernel-PCA (KPCA) learning model. The KPCA algorithms can be presented in intrinsic space or empirical space (see Algorithms 3.5 and 3.6). For several real-life datasets, visualization via KPCA shows more visible data separability than that via PCA. Moreover, KPCA is closely related to the kernel-induced spectral space, which proves instrumental for error analysis in unsupervised and supervised applications.
Chapter 4 explores various aspects of feature selection methods for supervised and unsupervised learning scenarios. It presents several filtering-based and wrapper-based methods for feature selection, a popular method for dimension reduction.

Index
S. Y. Kung, Princeton University, New Jersey
Book:

Kernel Methods and Machine Learning

Published online:

05 July 2014

Print publication:

17 April 2014, pp 578-591
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Kernel Methods and Machine Learning

Refine listing

Refine listing

Actions for selected content:

31 results in Kernel Methods and Machine Learning

Dedication

Appendix B - kNN, PNN, and Bayes classifiers

Summary

Part VII - Kernel methods and statistical estimation theory

Summary

Part VIII - Appendices

Summary

Contents

14 - Statistical regression analysis and errors-in-variables models

Summary

10 - Support vector machines

Summary

3 - PCA and kernel PCA

Summary

References

Part II - Dimension-reduction: PCA/KPCA and feature selection

Summary

Index

Refine listing

Refine listing

Actions for selected content:

Save Search

31 results in Kernel Methods and Machine Learning

Summary

Summary

Summary

Summary

Summary

Summary

Summary