To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In Chapter 1 we gave a general overview to pattern analysis. We identified three properties that we expect of a pattern analysis algorithm: computational efficiency, robustness and statistical stability. Motivated by the observation that recoding the data can increase the ease with which patterns can be identified, we will now outline the kernel methods approach to be adopted in this book. This approach to pattern analysis first embeds the data in a suitable feature space, and then uses algorithms based on linear algebra, geometry and statistics to discover patterns in the embedded data.
The current chapter will elucidate the different components of the approach by working through a simple example task in detail. The aim is to demonstrate all of the key components and hence provide a framework for the material covered in later chapters.
Any kernel methods solution comprises two parts: a module that performs the mapping into the embedding or feature space and a learning algorithm designed to discover linear patterns in that space. There are two main reasons why this approach should work. First of all, detecting linear relations has been the focus of much research in statistics and machine learning for decades, and the resulting algorithms are both well understood and efficient. Secondly, we will see that there is a computational shortcut which makes it possible to represent linear patterns efficiently in high-dimensional spaces to ensure adequate representational power. The shortcut is what we call a kernel function.
This chapter presents a number of algorithms for particular pattern analysis tasks such as novelty-detection, classification and regression. We consider criteria for choosing particular pattern functions, in many cases derived from stability analysis of the corresponding tasks they aim to solve. The optimisation of the derived criteria can be cast in the framework of convex optimization, either as linear or convex quadratic programs. This ensures that as with the algorithms of the last chapter the methods developed here do not suffer from the problem of local minima. They include such celebrated methods as support vector machines for both classification and regression.
We start, however, by describing how to find the smallest hypersphere containing the training data in the embedding space, together with the use and analysis of this algorithm for detecting anomalous or novel data. The techniques introduced for this problem are easily adapted to the task of finding the maximal margin hyperplane or support vector solution that separates two sets of points again possibly allowing some fraction of points to be exceptions. This in turn leads to algorithms for the case of regression.
An important feature of many of these systems is that, while enforcing the learning biases suggested by the stability analysis, they also produce ‘sparse’ dual representations of the hypothesis, resulting in efficient algorithms for both training and test point evaluation. This is a result of the Karush–Kuhn–Tucker conditions, which play a crucial role in the practical implementation and analysis of these algorithms.
It is often the case that we know something about the process generating the data. For example, DNA sequences have been generated through evolution in a series of modifications from ancestor sequences, text can be viewed as being generated by a source of words perhaps reflecting the topic of the document, a time series may have been generated by a dynamical system of a certain type, 2-dimensional images by projections of a 3-dimensional scene, and so on.
For all of these data sources we have some, albeit imperfect, knowledge about the source generating the data, and hence of the type of invariances, features and similarities (in a word, patterns) that we can expect it to contain. Even simple and approximate models of the data can be used to create kernels that take account of the insight thus afforded.
Models of data can be either deterministic or probabilistic and there are also several different ways of turning them into kernels. In fact some of the kernels we have already encountered can be regarded as derived from generative data models.
However in this chapter we put the emphasis on generative models of the data that are frequently pre-existing. We aim to show how these models can be exploited to provide features for an embedding function for which the corresponding kernel can be efficiently evaluated.
Although the emphasis will be mainly on two classes of kernels induced by probabilistic models, P-kernels and Fisher kernels, other methods exist to incorporate generative information into kernel design.
As we have seen in Chapter 2, the use of kernel functions provides a powerful and principled way of detecting nonlinear relations using well-understood linear algorithms in an appropriate feature space. The approach decouples the design of the algorithm from the specification of the feature space. This inherent modularity not only increases the flexibility of the approach, it also makes both the learning algorithms and the kernel design more amenable to formal analysis. Regardless of which pattern analysis algorithm is being used, the theoretical properties of a given kernel remain the same. It is the purpose of this chapter to introduce the properties that characterise kernel functions.
We present the fundamental properties of kernels, thus formalising the intuitive concepts introduced in Chapter 2. We provide a characterization of kernel functions, derive their properties, and discuss methods for designing them. We will also discuss the role of prior knowledge in kernel-based learning machines, showing that a universal machine is not possible, and that kernels must be chosen for the problem at hand with a view to capturing our prior belief of the relatedness of different examples. We also give a framework for quantifying the match between a kernel and a learning task.
Given a kernel and a training set, we can form the matrix known as the kernel, or Gram matrix: the matrix containing the evaluation of the kernel function on all pairs of data points.
The study of patterns in data is as old as science. Consider, for example, the astronomical breakthroughs of Johannes Kepler formulated in his three famous laws of planetary motion. They can be viewed as relations that he detected in a large set of observational data compiled by Tycho Brahe.
Equally the wish to automate the search for patterns is at least as old as computing. The problem has been attacked using methods of statistics, machine learning, data mining and many other branches of science and engineering.
Pattern analysis deals with the problem of (automatically) detecting and characterising relations in data. Most statistical and machine learning methods of pattern analysis assume that the data is in vectorial form and that the relations can be expressed as classification rules, regression functions or cluster structures; these approaches often go under the general heading of ‘statistical pattern recognition’. ‘Syntactical’ or ‘structural pattern recognition’ represents an alternative approach that aims to detect rules among, for example, strings, often in the form of grammars or equivalent abstractions.
The evolution of automated algorithms for pattern analysis has undergone three revolutions. In the 1960s efficient algorithms for detecting linear relations within sets of vectors were introduced. Their computational and statistical behaviour was also analysed. The Perceptron algorithm introduced in 1957 is one example. The question of how to detect nonlinear relations was posed as a major research goal at that time.
As discussed in Chapter 1 perhaps the most important property of a pattern analysis algorithm is that it should identify statistically stable patterns. A stable relation is one that reflects some property of the source generating the data, and is therefore not a chance feature of the particular dataset. Proving that a given pattern is indeed significant is the concern of ‘learning theory’, a body of principles and methods that estimate the reliability of pattern functions under appropriate assumptions about the way in which the data was generated. The most common assumption is that the individual training examples are generated independently according to a fixed distribution, being the same distribution under which the expected value of the pattern function is small. Statistical analysis of the problem can therefore make use of the law of large numbers through the ‘concentration’ of certain random variables.
Concentration would be all that we need if we were only to consider one pattern function. Pattern analysis algorithms typically search for pattern functions over whole classes of functions, by choosing the function that best fits the particular training sample. We must therefore be able to prove stability not of a pre-defined pattern, but of one deliberately chosen for its fit to the data.
Clearly the more pattern functions at our disposal, the more likely that this choice could be a spurious pattern. The critical factor that controls how much our choice may have compromised the stability of the resulting pattern is the ‘capacity’ of the function class.
A structure is a user-defined data type that allows a single variable to store more than one type of data. In Visual Basic 6, structures were actually called user-defined types. The structure object has been improved in VB.NET, however, because VB.NET structures can contain both data and subprocedures to operate on those data. Structures are very common in form to classes, though they have several limitations that restrict their use to solve object-oriented programming problems. Because they are similar in form to classes, structures provide an excellent introduction to the use of classes, which is why we spend an entire chapter discussing them.
USING STRUCTURES
In Chapter 2 you were introduced to the concept of the abstract data type. To implement an ADT in VB.NET, we need to use a special data type called a structure. A structure allows us to store multiple components of different data types in one logical unit. In some languages structures are known as records.
The atomic data types (Integer, Single, String) allow us to store only single data items in a variable, as does an array (well, not strictly, since an array can be of type Object). For example, when we store a number in an Integer variable, we can only store one number in the variable. With structures, we can store more than one piece of the data in the structure and the data can be of different types.
Object-oriented programming (OOP) is just one in a series of technologies that has been deemed as the savior of software. In the world of computing, software often gets a bad reputation because so many programs are delivered to the end user late, with significant bugs, and not designed to completely solve the problem the program was written to solve. Object-oriented programming provides a set of tools and techniques to help programmers manage program complexity.
OOP DEFINED
Object-oriented programming is a programming technique that involves structuring a program around special, user-defined data types called classes. Classes are used to break a problem up into a set of objects that interact with each other. A class consists of both the data that define an object and subprograms, called methods, which describe the object's behavior.
A language that is to be called a true OOP language must implement three concepts—encapsulation, polymorphism, and inheritance. Without all three of these features, a programming language can be considered object-based, as Visual Basic 6 is, but all three must be present for the language to be considered a true object-oriented language.
Encapsulation
In the traditional, third-generation programming language (such as earlier versions of Basic, C, and Fortran), all programs consist of two primary elements—program statements and data. The program statements are used to perform operations on the data, but the two elements are always considered to be separate parts of a computer program.
This chapter explains how to perform database programming in an object-oriented way. There are several techniques (patterns) we can use to make a VB.NET/ADO.NET program object-oriented. Many of these techniques were first discussed (though not necessarily first used) in Martin Fowler's book Patterns of Enterprise Application Architecture (Fowler 2003). This chapter will distill some of the patterns he presents into working code that a VB.NET programmer will recognize, especially a programmer who now understands OOP. First, though, we'll provide you with an overview of how to use ADO.NET to access data stored in a database.
AN OVERVIEW OF ADO.NET
ActiveX Data Objects.NET (ADO.NET) is an object-oriented database API that allows a programmer to use one set of classes to access many different types of databases.
ADO.NET Objects
ADO.NET consists of a set of classes that encapsulate the behavior of the different aspects of a database. These classes include objects that represent the database, individual tables, columns within tables, and rows in tables. There are also specialized objects for making database connections and database commands.
The following list highlights these objects
DataSet represents a subset of a database and is a parent object to many of the other objects used in ADO.NET.
DataTable is used to work with the contents of a single table.
DataColumn is used to represent each column in a table.
DataRow represents a row of data from a table. Row data are retrieved from the Rows collection of a DataTable object.
DataAdapter is used as a bridge between a DataTable object and the physical data source, or database, the program is using.
Programs written in VB.NET utilize the Throw-Catch model of exception handling when dealing with errors. The classes found in the.NET Framework use this model for handling errors, but the classes you develop must generate their own exceptions for handling errors. In this chapter, we discuss how exception classes are created and how to use them. We start the chapter with a review of the Throw-Catch model, which includes the Try-Catch-Finally statement.
EXCEPTION HANDLING IN VB.NET
The term VB.NET uses for errors that occur in executing code is exception. Writing code that deals with errors in a program is called exception handling. Exception handling in VB.NET consists of writing code that watches for exceptions when they're thrown and writing code that causes an exception to be thrown when an error condition arises. A VB.NET programmer is not responsible for always writing exception-generating code since the.NET Framework classes throw their own exceptions.
For example, trying to open a file that doesn't exist throws an exception because there was no file to open. This object is called FileNotFoundException. In the next section we examine how to write code to catch built-in exception objects.
Writing Exception-Handling Code
As we've discussed, trying to open a nonexistent file throws an exception. For our program to recognize the exception, we have to use a special construct—the Try-Catch-Finally statement.