To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The previous chapter gave a formal definition of the learning problem, and showed that it can be solved if the class HN of functions is finite. However, many interesting function classes are not finite. For example, the number of functions computed by the perceptron with real-valued weights and inputs is infinite. Many other neural networks can also be represented as a parameterized function class with an infinite parameter set. We shall see that learning is possible for many (but not all) function classes like this, provided the function class is not too complex. In this chapter, we examine two measures of the complexity of a function class, the growth function and the VC-dimension, and we show that these are intimately related. In the next two chapters, we shall see that the growth function and VC-dimension of a function class determine the inherent sample complexity of the learning problem.
The Growth Function
Consider a finite subset S of the input space X. For a function class H, the restriction of H to the set S (that is, the set of restrictions to S of all functions in H) is denoted by H|s. If H|s is the set of all functions from S to {0, 1}, then clearly, H is as powerful as it can be in classifying the points in S. We can view the cardinality of H|s (and in particular how it compares with 2|s|) as a measure of the classification complexity of H with respect to the set S.
Chapters 4 and 5 show that the Vapnik-Chervonenkis dimension is crucial in characterizing learnability by binary-output networks, and that it can be used to bound the growth function. Chapter 10 shows that covering numbers are a generalization of the growth function useful for analysing classification by real-output neural networks (or, more generally, by real-valued function classes). We see later in the book that covering numbers are also important in analysing other models of learning. It is natural to ask whether there is a ‘combinatorial’ measure analogous to the VC-dimension that can be used to bound the covering numbers of a class of real-valued functions, and hence to quantify the sample complexity of classification learning. This is largely true, although the definitions and proofs are more complicated than for the binary classification case. In this chapter we introduce the key ‘dimensions’ that we use in our analysis of learning with real function classes and establish some associated basic results and useful properties. In the next chapter we show how these dimensions may be used to bound the covering numbers.
The Pseudo-Dimension
The definition of the pseudo-dimension
To introduce the first of the new dimensions, we first present a slightly different formulation of the definition of the VC-dimension. For a set of functions H mapping from X to {0, 1}, recall that a subset S = {x1, x2, … xm} of X is shattered by H if H|s has cardinality 2m.
In this chapter, and many subsequent ones, we deal with feed-forward neural networks. Initially, we shall be particularly concerned with feed-forward linear threshold networks, which can be thought of as combinations of perceptrons.
To define a neural network class, we need to specify the architecture of the network and the parameterized functions computed by its components. In general, a feed-forward neural network has as its main components a set of computation units, a set of input units, and a set of connections from input or computation units to computation units. These connections are directed; that is, each connection is from a particular unit to a particular computation unit. The key structural property of a feed-forward network—the feed-forward condition—is that these connections do not form any loops. This means that the units can be labelled with integers in such a way that if there is a connection from the unit labelled i to the computation unit labelled j then i < j.
Associated with each unit is a real number called its output. The output of a computation unit is a particular function of the outputs of units that are connected to it. The feed-forward condition guarantees that the outputs of all units in the network can be written as an explicit function of the network inputs.
This book is about the use of artificial neural networks for supervised learning problems. Many such problems occur in practical applications of artificial neural networks. For example, a neural network might be used as a component of a face recognition system for a security application. After seeing a number of images of legitimate users' faces, the network needs to determine accurately whether a new image corresponds to the face of a legitimate user or an imposter. In other applications, such as the prediction of future price of shares on the stock exchange, we may require a neural network to model the relationship between a pattern and a real-valued quantity.
In general, in a supervised learning problem, the learning system must predict the labels of patterns, where the label might be a class label or a real number. During training, it receives some partial information about the true relationship between patterns and their labels in the form of a number of correctly labelled patterns. For example, in the face recognition application, the learning system receives a number of images, each labelled as either a legitimate user or an imposter. Learning to accurately label patterns from training data in this way has two major advantages over designing a hard-wired system to solve the same problem: it can save an enormous amount of design effort, and it can be used for problems that cannot easily be specified precisely in advance, perhaps because the environment is changing.
In this chapter, we consider learning algorithms for classes F of realvalued functions that can be expressed as convex combinations of functions from some class G of basis functions. The key example of such a class is that of feed-forward networks with a linear output unit in which the sum of the magnitudes of the output weights is bounded by some constant B. In this case, the basis function class G is the set of functions that can be computed by any non-output unit in the network, and their negations, scaled by B. We investigate two algorithms. Section 26.2 describes Construct, an algorithm for the real prediction problem, and Section 26.3 describes Adaboost, an algorithm for the restricted version of the real classification problem. Both algorithms use a learning algorithm for the basis function class to iteratively add basis functions to a convex combination, leaving previous basis functions fixed.
Real Estimation with Convex Combinations of Basis Functions
Theorem 14.10 (Section 14.4) shows that any convex combination of bounded basis functions can be accurately approximated (with respect to the distance dL2(P), for instance) using a small convex combination. This shows that the approximate-SEM problem for the class co(G) can be solved by considering only small convex combinations of functions from G. In fact, the problem can be simplified even further. The following theorem shows that we can construct a small convex combination in an iterative way, by greedily minimizing error as each basis function is added.
In the previous chapters we showed that a class of functions of finite VC-dimension is learnable by the fairly natural class of SEM algorithms, and we provided bounds on the estimation error and sample complexity of these learning algorithms in terms of the VC-dimension of the class. In this chapter we provide lower bounds on the estimation error and sample complexity of any learning algorithm. These lower bounds are also in terms of the VC-dimension, and are not vastly different from the upper bounds of the previous chapter. We shall see, as a consequence, that the VC-dimension not only characterizes learnability, in the sense that a function class is learnable if and only if it has finite VC-dimension, but it provides precise information about the number of examples required.
A Lower Bound for Learning
A technical lemma
The first step towards a general lower bound on the sample complexity is the following technical lemma, which will also prove useful in later chapters. It concerns the problem of estimating the parameter describing a Bernoulli random variable.
Lemma 5.1Suppose that α is a random variable uniformly distributed on {α−, α+}, where α− = 1/2 − ∈/2 and α+ = 1/2 + ∈/2, with 0 < ∈ < 1. Suppose that ξ1, …, ξmare i.i.d. (independent and identically distributed) {0, 1}-valued random variables with Pr(ξi = 1) = α for all i. Let f be a function from {0, 1}mto {α−, α+}.
In this part of the book, we turn our attention to aspects of the time complexity, or computational complexity of learning. Until now we have discussed only the sample complexity of learning, and we have been using the phrase ‘learning algorithm’ without any reference to algorithmics. But issues of running time are crucial. If a learning algorithm is to be of practical value, it must, first, be possible to implement the learning algorithm on a computer; that is, it must be computable and therefore, in a real sense, an algorithm, not merely a function. Furthermore, it should be possible to produce a good output hypothesis ‘quickly’.
One subtlety that we have not so far explicitly dealt with is that a practical learning algorithm does not really output a hypothesis; rather, it outputs a representation of a hypothesis. In the context of neural networks, such a representation consists of a state of the network; that is, an assignment of weights and thresholds. In studying the computational complexity of a learning algorithm, one therefore might take into account the ‘complexity’ of the representation output by the learning algorithm. However, this will not be necessary in the approach taken here. For convenience, we shall continue to use notation suggesting that the output of a learning algorithm is a function from a class of hypotheses, but the reader should be aware that, formally, the output is a representation of such a function.
This part of the book examines supervised learning problems in which we require a learning system to model the relationship between a pattern and a real-valued quantity. For example, in using a neural network to predict the future price of shares on the stock exchange, or to estimate the probability that a particular patient will experience problems during a surgical procedure, the predictions are represented by the real-valued output of the network.
In the pattern classification problems studied in Parts 1 and 2, the (x, y) pairs are generated by a probability distribution on the product space X × {0, 1}. In a similar way, we assume in this part of the book that the data is generated by a probability distribution P on X × ℝ. This is a generalization of the pattern classification model, and includes a number of other data-generating processes as special cases. For example, it can model a deterministic relationship between patterns and their labels, where each (x, y) pair satisfies y = f(x) for some function f. It can model a deterministic relationship with additive independent observation noise, where yi = f(xi) + ηi, and the ηi are independent and identically distributed random variables. It can also model a noisy relationship in which the observation noise variables ηi are mutually independent, but the distribution of ηi depends on the pattern xi.
In this chapter we bound the pseudo-dimension and the fat-shattering dimension of the function classes computed by certain neural networks. The pseudo-dimension bounds follow easily from VC-dimension bounds obtained earlier, so these shall not detain us for long. Of more importance are the bounds we obtain on the fat-shattering dimension. We derive these bounds by bounding certain covering numbers. Later in the book, we shall use these covering number bounds directly.
We bound the covering numbers and fat-shattering dimensions for networks that are fully connected between adjacent layers, that have units with a bounded activation function satisfying a Lipschitz constraint, and that have all weights (or all weights in certain layers) constrained to be small. We give two main results on the covering numbers and fat-shattering dimensions of networks of this type. In Section 14.3 we give bounds in terms of the number of parameters in the network. In contrast, Section 14.4 gives bounds on the fat-shattering dimension that instead grow with the bound on the size of the parameters and, somewhat surprisingly, are independent of the number of parameters in the network. This result is consistent with the intuition we obtain by studying networks of linear units (units with the identity function as their activation function). For a network of this kind, no matter how large, the function computed by the network is a linear combination of the input variables, and so its pseudo-dimension does not increase with the number of parameters.
In this section we describe the basic model of learning we use in this part of the book. This model is applicable to neural networks with one output unit that computes either the value 0 or 1; that is, it concerns the types of neural network used for binary classification problems. Later in the book we develop more general models of learning applicable to many other types of neural network, such as those with a real-valued output.
The definition of learning we use is formally described using the language of probability theory. For the moment, however, we move towards the definition in a fairly non-technical manner, providing some informal motivation for the technical definitions that will follow.
In very general terms, in a supervised learning environment, neural network ‘learning’ is the adjustment of the network's state in response to data generated by the environment. We assume this data is generated by some random mechanism, which is, for many applications, reasonable. The method by which the state of the network is adjusted in response to the data constitutes a learning algorithm. That is, a learning algorithm describes how to change the state in response to training data. We assume that the ‘learner’ knows little about the process generating the data. This is a reasonable assumption for many applications of neural networks: if it is known that the data is generated according to a particular type of statistical process, then in practice it might be better to take advantage of this information by using a more restricted class of functions rather than a neural network.
Results in the previous chapter show that the VC-dimension of the class of functions computed by a network of linear threshold units with W parameters is no larger than a constant times W log W. These results cannot immediately be extended to networks of sigmoid units (with continuous activation functions), since the proofs involve counting the number of distinct outputs of all linear threshold units in the network as the input varies over m patterns, and a single sigmoid unit has an infinite number of output values. In this chapter and the next we derive bounds on the VC-dimension of certain sigmoid networks, including networks of units having the standard sigmoid activation function σ(α) = 1/(1 + e−α). Before we begin this derivation, we study an example that shows that the form of the activation function is crucial.
The Need for Conditions on the Activation Functions
One might suspect that if we construct networks of sigmoid units with a well-behaved activation function, they will have finite VC-dimension. For instance, perhaps it suffices if the activation function is sufficiently smooth, bounded, and monotonically increasing. Unfortunately, the situation is not so simple. The following result shows that there is an activation function that has all of these properties, and even has its derivative monotonically increasing to the left of zero and decreasing to the right (so it is convex and concave in those regions), and yet is such that a two-layer network having only two computation units in the first layer, each with this activation function, has infinite VC-dimension.
By
Tom Heskes, RWCP Theoretical Foundation, SNN, Department of Medical Physics and Biophysics, University of Nijmegen, Geert Grooteplein 21, 6525 EZ Nijmegen, The Netherlands,
Wim Wiegerinck, RWCP Theoretical Foundation, SNN, Department of Medical Physics and Biophysics, University of Nijmegen, Geert Grooteplein 21, 6525 EZ Nijmegen, The Netherlands.
We study the dynamics of on-line learning with time-correlated patterns. In this, we make a distinction between “small” networks and “large” networks. “Small” networks have a finite number of input units and are usually studied using tools from stochastic approximation theory in the limit of small learning parameters. “Large” networks have an extensive number of input units. A description in terms of individual weights is no longer useful and tools from statistical mechanics can be applied to compute the evolution of macroscopic order parameters. We give general derivations for both cases, but in the end focus on the effect of correlations on plateaus. Plateaus are long time spans in which the performance of the networks hardly changes. Learning in both “small” and “large” multi-layered perceptrons is often hampered by the presence of plateaus. The effect of correlations, however, appears to be quite different: they can have a huge beneficial effect in small networks, but seem to have only marginal effects in large networks.
Introduction
On-line learning with correlations
The ability to learn from examples is an essential feature in many neural network applications (Hertz et al., 1991; Haykin, 1994). Learning from examples enables the network to adapt its parameters or weights to its environment without the need for explicit knowledge of that environment. In on-line learning examples from the environment are continually presented to the network at distinct time steps. At each time step a small adjustment of the network's weights is made on the basis of the currently presented pattern. This procedure is iterated as long as the network learns.