To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
As a rule teachers are unreliable. From time to time they mix up questions or answer absentmindedly. How much can a student network learn about a target rule if some of the examples in the training set are corrupted by random noise? What is the optimal strategy for the student in this more complicated situation?
To analyse these questions in detail for the two-perceptron scenario is the aim of the present chapter. Let us emphasize that quite generally a certain robustness with respect to random influences is an indispensable requirement for any information processing system, both in biological and in technical contexts. If learning from examples were possible only for perfectly error-free training sets it would be of no practical interest. In fact, since the noise blurring the correct classifications of the teacher may usually be assumed to be independent of the examples, one expects that it will remain possible to infer the rule, probably at the expense of a larger training set.
A general feature of noisy generalization tasks is that the training set is no longer generated by a rule that can be implemented by the student. The problem is said to be unrealizable. A simple example is a training set containing the same input with different outputs, which is quite possible for noisy teachers. This means that for large enough training sets no student exists who is able to reproduce all classifications and the version space becomes empty.
So far we have been considering learning scenarios in which generalization shows up as a gradual process of improvement with the generalization error ε decreasing continuously from its initial pure guessing value ε = 0.5 to the asymptotic limit ε = 0. In the present chapter we study systems which display a quite different behaviour with sudden changes of the generalization ability taking place during the learning process. The reason for this new feature is the presence of discrete degrees of freedom among the parameters, which are adapted during the learning process. As we will see, discontinuous learning is a rather subtle consequence of this discreteness and methods of statistical mechanics are well suited to describe the situation. In particular the abrupt changes which occur in the generalization process can be described as first order phase transitions well studied in statistical physics.
Smooth networks
The learning scenarios discussed so far have been described in the framework of statistical mechanics as a continuous shift of the balance between energetic and entropic terms. In the case of perfect learning the energy describes how difficult it is for the student vector to stay in the version space (see (2.13)). For independent examples it is naturally given as a sum over the training set and scales for large α as e ∼ αε since the generalization error ε gives the probability of error and hence of an additional cost when a new example is presented.
In the present chapter we introduce the basic notions necessary to study learning problems within the framework of statistical mechanics. We also demonstrate the efficiency of learning from examples by the numerical analysis of a very simple situation. Generalizing from this example we will formulate the basic setup of a learning problem in statistical mechanics to be discussed in numerous modifications in later chapters.
Artificial neural networks
The statistical mechanics of learning has been developed primarily for networks of so-called formal neurons. The aim of these networks is to model some of the essential information processing abilities of biological neural networks on the basis of artificial systems with a similar architecture. Formal neurons, the microscopic building blocks of these artificial neural networks, were introduced more than 50 years ago by McCulloch and Pitts as extremely simplified models of the biological neuron [1]. They are bistable linear threshold elements which are either active or passive, to be denoted in the following by a binary variable S = ±1. The state Si of a given neuron i changes with time because of the signals it receives through its synaptic couplings Jij from either the “outside world” or other neurons j.
More precisely, neuron i sums up the incoming activity of all the other neurons weighted by the corresponding synaptic coupling strengths to yield the post-synaptic potential ∑jJij Sj and compares the result with a threshold θi specific to neuron i.
There is an important extreme case of learning from a noisy source as discussed in the previous chapter which deserves special consideration. It concerns the situation of an extremely noisy teacher in which the added noise is so strong that it completely dominates the teacher's output. The task for the student is then to reproduce a mapping with no correlations between input and output so that the notion of a teacher actually becomes obsolete. The central question is how many input–output pairs can typically be implemented by an appropriate choice of the couplings J. This is the so-called storage problem. Its investigation yields a measure for the flexibility of the network under consideration with respect to the implementation of different mappings between input and output.
The reason why we include a discussion of this case in the present book, which is mainly devoted to the generalization behaviour of networks, is threefold. Firstly, there is a historical point: in the physics community the storage properties of neural networks were discussed before emphasis was laid on their ability to learn from examples, and several important concepts have been introduced in connection with these earlier investigations. Secondly, in several situations the storage problem is somewhat simpler to analyse and therefore forms a suitable starting point for the more complicated investigation of the generalization performance. Thirdly, we will see in chapter 10 that the flexibility of a network architecture with respect to the implementation of different input–output relations also gives useful information on its generalization behaviour.
Understanding intelligent behaviour has always been fascinating to both laymen and scientists. The question has become very topical through the concurrence of a number of different issues. First, there is a growing awareness of the computational limits of serial computers, while parallel computation is gaining ground, both technically and conceptually. Second, several new non-invasive scanning techniques allow the human brain to be studied from its collective behaviour down to the activity of single neurons. Third, the increased automatization of our society leads to an increased need for algorithms that control complex machines performing complex tasks. Finally, conceptual advances in physics, such as scaling, fractals, bifurcation theory and chaos, have widened its horizon and stimulate the modelling and study of complex non-linear systems. At the crossroads of these developments, artificial neural networks have something to offer to each of them.
The observation that these networks can learn from examples and are able to discern an underlying rule has spurred a decade of intense theoretical activity in the statistical mechanics community on the subject. Indeed, the ability to infer a rule from a set of examples is widely regarded as a sign of intelligence. Without embarking on a thorny discussion about the nature or definition of intelligence, we just note that quite a few of the problems posed in standard IQ tests are exactly of this nature: given a sequence of objects (letters, pictures, …) one is asked to continue the sequence “meaningfully”, which requires one to decipher the underlying rule.
In the preceding chapters we have described various properties of learning in the perceptron, exploiting the fact that its simple architecture allows a rather detailed mathematical analysis. However, the perceptron suffers from a major deficiency that led to its demise in the late 1960s: being able only to implement linearly separable Boolean functions its computational capacities are rather limited. An obvious generalization is feed-forward multilayer networks with one or more intermediate layers of formal neurons neurons between input and output (cf. fig. 1.1c). On the one hand these may be viewed as being composed of individual perceptrons, so that their theoretical analysis may build on what has been accomplished for the perceptron. On the other hand the addition of internal degrees of freedom makes them computationally much more powerful. In fact multilayer neural networks are able to realize all possible Boolean functions between input and output, which makes them an attractive choice for practical applications. There is also a neurophysiological motivation for the study of multilayer networks since most neurons in biological neural nets are interneurons neither directly connected to sensory inputs nor to motor outputs.
The higher complexity of multilayer networks as compared to the simple perceptron makes the statistical mechanics analysis of their learning abilities more complicated and in general precludes the general and detailed characterization which was possible for the perceptron. Nevertheless, for tailored architectures and suitable learning scenarios very instructive results may be obtained, some of which will be discussed in the present chapter.
The generalization performance of some of the learning rules introduced in the previous chapter could be characterized either by using simple arguments from statistics as in the case of the Hebb rule, or by exploiting our results on Gibbs learning obtained in chapter 2 as in the case of the Bayes rule. Neither of these attempts is successful, however, in determining the generalization error of the remaining learning rules.
In this chapter we will introduce several modifications of the central statistical mechanics method introduced in chapter 2 which will allow us to analyse the generalization behaviour of these remaining rules. The main observation is that all these learning rules can be interpreted as prescriptions to minimize appropriately chosen cost functions. Generalizing the concept of Gibbs learning to non-zero training error will pave the way to studying such minimization problems in a unified fashion.
Before embarking on these general considerations, however, we will discuss in the first section of this chapter how learning rules aiming at maximal stabilities are most conveniently analysed.
The main results of this chapter concerning the generalization error of the various rules are summarized in fig. 4.3 and table 4.1.
Maximal stabilities
A minor extension of the statistical mechanics formalism introduced in chapter 2 is sufficient to analyse the generalization performance of the adatron and the pseudo-inverse rule. The common feature of these two rules is that they search for couplings with maximal stabilities, formalized by the maximization of the stability parameter K.
So far we have focused on the performance of various learning rules as a function of the size of the training set with examples which are all selected before training starts and remain available during the whole training period. However, in both real life and many practical situations, the training examples come and go with time. Learning then has to proceed on-line, using only the training example which is available at any particular time. This is to be contrasted with the previous scenario, called off-line or batch learning, in which all the training examples are available at all times.
For the Hebb rule, the off-line and on-line scenario coincide: each example provides an additive contribution to the synaptic vector, which is independent of the other examples. We mentioned already in chapter 3 that this rule performs rather badly for large training sets, precisely because it treats all the learning examples in exactly the same way. The purpose of this chapter is to introduce more advanced or alternative on-line learning rules, and to compare their performance with that of their off-line versions.
Stochastic gradient descent
In an on-line scenario, the training examples are presented once and in a sequential order and the coupling vector J is updated at each time step using information from this single example only.
In this chapter we investigate ICA models in which the number of sources, M, may be less than the number of sensors, N: so-called non-square mixing.
The ‘extra’ sensor observations are explained as observation noise. This general approach may be called Probabilistic Independent Component Analysis (PICA) by analogy with the Probabilistic Principal Component Analysis (PPCA) model of Tipping & Bishop [I9971; ICA and PCA don't have observation noise, PICA and PPCA do.
Non-square ICA models give rise to a likelihood model for the data involving an integral which is intractable. In this chapter we build on previous work in which the integral is estimated using a Laplace approximation. By making the further assumption that the unmixing matrix lies on the decorrelating manifold we are able to make a number of simplifications. Firstly, the observation noise can be estimated using PCA methods, and, secondly, optimisation takes place in a space having a much reduced dimensionality, having order M2 parameters rather than M × N. Again, building on previous work, we derive a model order selection criterion for selecting the appropriate number of sources. This is based on the Laplace approximation as applied to the decorrelating manifold. This is then compared with PCA model order selection methods on music and EEG datasets.
Non-Gaussianity is of paramount importance in ICA estimation. Without non-Gaussianity the estimation is not possible at all (unless the independent components have time-dependences). Therefore, it is not surprising that non-Gaussianity could be used as a leading principle in ICA estimation.
In this chapter, we derive a simple principle of ICA estimation: the independent components can be found as the projections that maximize non-Gaussianity. In addition to its intuitive appeal, this approach allows us to derive a highly efficient ICA algorithm, Fast ICA. This is a fixed-point algorithm that can be used for estimating the independent components one by one. At the end of the chapter, it will be seen that it is closely connected to maximum likelihood or infomax estimation as well.
Whitening
First, let us consider preprocessing techniques that are essential if we want to develop fast ICA methods.
The rather trivial preprocessing that is used in many cases is to centre x, i.e. subtract its mean vector m = ε{x} so as to make x a zero-mean variable. This implies that s is zero-mean as well. This preprocessing is made solely to simplify the ICA algorithms: it does not mean that the mean could not be estimated. After estimating the mixing matrix A with centred data, we can complete the estimation by adding the mean vector of s back to the centred estimates of s.
An unsupervised classification algorithm is derived by modelling observed data as a mixture of several mutually exclusive classes that are each described by linear combinations of independent, non-Gaussian densities. The algorithm estimates the density of each class and is able to model class distributions with non-Gaussian structure. It can improve classification accuracy compared with standard Gaussian mixture models. When applied to images, the algorithm can learn efficient codes (basis functions) for images that capture the statistical structure of the images. We applied this method to the problem of unsupervised classification, segmentation and de-noising of images. This method was effective in classifying complex image textures such as trees and rocks in natural scenes. It was also useful for de-noising and filling in missing pixels in images with complex structures. The advantage of this model is that image codes can be learned with increasing numbers of classes thus providing greater flexibility in modelling structure and in finding more image features than in either Gaussian mixture models or standard ICA algorithms.
Introduction
Recently, Blind Source Separation by Independent Component Analysis has been applied to signal processing problems including speech enhancement, telecommunications and medical signal processing. ICA finds a linear non-orthogonal coordinate system in multivariate data determined by second- and higher-order statistics. The goal of ICA is to linearly transform the data in such a way that the transformed variables are as statistically independent from each other as possible [Jutten & Herault, 1991, Comon, 1994, Bell & Sejnowski, 1995, Cardoso & Laheld, 1996, Lee et al., 2000b]. ICA generalizes the technique of Principal Component Analysis (PCA) and, like PCA, has proven a useful tool for finding structure in data.
Independent Component Analysis (ICA), as a signal processing tool, has shown great promise in many application domains, two of the most successful being telecommunications and biomedical engineering. With the growing awareness of ICA many other less obvious applications of the transform are starting to appear, for example financial time series prediction [Back & Weigend, 19981 and information retrieval [Isbell & Viola, 1999, Girolami, 2000a, Vinokourov & Girolami, 20001. In such applications ICA is being used as an unsupervised means of exploring and, hopefully, uncovering meaningful latent traits or structure within the data. In the case of financial time series prediction the factors which drive the evolution of the time series are hopefully uncovered, whereas within information retrieval the latent concepts or topics which generate key-words that occur in the documents are sought after. The strong assumption of independence of the hidden factors in ICA is difficult to argue for when there is limited a priori knowledge of the data, indeed it is desired that the analysis uncover injormative components which may, or may not, be independent. Nevertheless the independence assumption allows analytically tractable statistical models to be developed.
This chapter will consider how the standard ICA model can be extended and used in the unsupervised classification and visualisation of multivariate data. Prior to the formal presentation, to set the context of the remainder of the chapter, a short review of the ICA signal model and the corresponding statistical representation is given. The remaining sections propose ICA inspired techniques for the unsupervised classification and visualisation of multivariate data.
In recent years there has been an explosion of interest in the application and theory of independent component analysis (ICA). This book is aimed to provide a self-contained introduction to the subject as well as offering a set of invited contributions which we see as lying at the cutting edge of ICA research.
ICA is intimately linked with the problem of blind source separation-attempting to recover a set of underlying sources when only a noisy mapping from these sources, the observations, is given-and we regard this as the canonical form of ICA. Until recently this mapping was taken to be linear (but see Chapter 4) and “traditionally” (if tradition is allowed in a field of such recent developments) noiseless with the number of observations being equal to the number of hypothesised sources. It is surprising that even the simplest of ICA models can be invaluable and offer new insights into data analysis and interpretation. This, at first sight unreasonable, claim may be supported by noting that many observations of physical systems are produced by a linear combination of underlying sources. Furthermore, in many applications, it is an end in itself to produce a set of “sources” which are statistically independent rather than just decorrelated (see Chapter 1) and for this ICA would appear an ideal tool.
One may think of blind source separation as the problem of identifying speakers (sources) in a room given only recordings from a number of microphones, each of which records a linear mixture of the sources, whose statistical characteristics are unknown.
Here we consider the blind source separation problem when the mixing of the sources is non-stationary. Pursuing the speakers in a room analogy, we address the problem of identifying the speakers when they (or equivalently, the microphones) are moving. The problem is cast in terms of a hidden state (the mixing proportions of the sources) which we track using particle filter methods, which permit the tracking of arbitrary state densities. Murata et al. [I9971 have addressed this problem by adapting the learning rate and we mention work by Penny et al. [2000] on hidden Markov models for ICA which allows for abrupt changes in the mixing matrix with stationary periods in between.
We first briefly re-review classical Independent Component Analysis. ICA with non-stationary mixing is described in terms of a hidden state model and methods for estimating the sources and the mixing are described. Particle filter techniques are then introduced for the modelling of state densities. Finally, we address the non-stationary mixing problem when the sources are independent, but possess temporal correlations.