We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Optimization is an important operation in many domains of science and technology. Local optimization techniques typically employ some form of iterative procedure, based on derivatives of the function to be optimized (objective function). These techniques normally involve parameters that must be set by the user, often by trial and error. Those parameters can have a strong influence on the convergence speed of the optimization. In several cases, a significant speed advantage could be gained if one could vary these parameters during the optimization, to reflect the local characteristics of the function being optimized. Some parameter adaptation methods have been proposed for this purpose, for deterministic optimization situations. For stochastic (also called on-line) optimization situations, there appears to be no simple and effective parameter adaptation method.
This paper proposes a new method for parameter adaptation in stochastic optimization. The method is applicable to a wide range of objective functions, as well as to a large set of local optimization techniques. We present the derivation of the method, details of its application to gradient descent and to some of its variants, and examples of its use in the gradient optimization of several functions, as well as in the training of a multilayer perceptron by on-line backpropagation.
Introduction
Optimization is an operation that is often used in several different domains of science and technology. It normally consists of maximizing or minimizing a given function (called objective function), that is chosen to represent the quality of a given system. The system may be physical, (mechanical, chemical, etc.), a mathematical model, a computer program, etc., or even a mixture of several of these.
By
Sara Solla, Physics and Astronomy, Northwestern University, Evanston, IL 60208,; Physiology, Northwestern University Medical School, Chicago, IL 60611, USA,
Ole Winther, CONNECT, The Niels Bohr Institute, 2100 Copenhagen Ø, Denmark; Theoretical Physics II, Lund University, S-223 62 Lund, Sweden
The recently proposed Bayesian approach to online learning is applied to learning a rule defined as a noisy single layer perceptron with either continuous or binary weights. In the Bayesian online approach the exact posterior distribution is approximated by a simpler parametric posterior that is updated online as new examples are incorporated to the dataset. In the case of continuous weights, the approximate posterior is chosen to be Gaussian. The computational complexity of the resulting online algorithm is found to be at least as high as that of the Bayesian offline approach, making the online approach less attractive. A Hebbian approximation based on casting the full covariance matrix into an isotropic diagonal form significantly reduces the computational complexity and yields a previously identified optimal Hebbian algorithm. In the case of binary weights, the approximate posterior is chosen to be a biased binary distribution. The resulting online algorithm is derived and shown to outperform several other online approaches to this problem.
Introduction
Neural networks are adaptive systems characterized by a set of parameters w, the weights and biases that specify the connectivity among the neuronal computational elements. Of particular interest is the ability of these systems to learn from examples. Traditional formulations of the learning problem are based on a dynamical prescription for the adaptation of the parameters w. The learning process thus generates a trajectory in w space that starts from a random initial assignment w0 and leads to a specific w* that is in some sense optimal.
By
Yoshiyuki Kabashima, Dept. of Comp. Intelligence and Systems Science Graduate School of Science and Engineering Tokyo Institute of Technology Yokohama 226, Japan,
Shigeru Shinomoto, Dept. of Physics, Kyoto University Sakyo-ku, Kyoto, 606-8502, Japan
By
A. C. C. Coolen, Department of Mathematics, King's College, University of London, Strand, London WC2R 2LS, U.K.,
D. Saad, Department of Computer Science and Applied Mathematics, Aston University, Aston Triangle, Birmingham B4 7ET, U.K.
We study the dynamics of supervised learning in layered neural networks, in the regime where the size p of the training set is proportional to the number N of inputs. Here the local fields are no longer described by Gaussian distributions. We show how dynamical replica theory can be used to predict the evolution of macroscopic observables, including the relevant performance measures, incorporating the theory of complete training sets in the limit p/N → ∞ as a special case. For simplicity we restrict ourselves here to single-layer networks and realizable tasks.
Introduction
In the last few years much progress has been made in the analysis of the dynamics of supervised learning in layered neural networks, using the strategy of statistical mechanics: by deriving from the microscopic dynamical equations a set of closed laws describing the evolution of suitably chosen macroscopic observables (dynamic order parameters) in the limit of an infinite system size [e.g. Kinzel & Rujan (1990), Kinouchi & Caticha (1992), Biehl & Schwarze (1992, 1995), Saad & Solla (1995)]. A recent review and more extensive guide to the relevant references can be found in Mace & Coolen (1998a).
By
Magnus Rattray, Neural Computing Research Group, Aston University Birmingham B4 7ET, UK,
David Saad, Neural Computing Research Group, Aston University Birmingham B4 7ET, UK
We analyse the dynamics of a number of second order on-line learning algorithms training multi-layer neural networks, using the methods of statistical mechanics. We first consider on-line Newton's method, which is known to provide optimal asymptotic performance. We determine the asymptotic generalization error decay for a soft committee machine, which is shown to compare favourably with the result for standard gradient descent. Matrix momentum provides a practical approximation to this method by allowing an efficient inversion of the Hessian. We consider an idealized matrix momentum algorithm which requires access to the Hessian and find close correspondence with the dynamics of on-line Newton's method. In practice, the Hessian will not be known on-line and we therefore consider matrix momentum using a single example approximation to the Hessian. In this case good asymptotic performance may still be achieved, but the algorithm is now sensitive to parameter choice because of noise in the Hessian estimate. On-line Newton's method is not appropriate during the transient learning phase, since a suboptimal unstable fixed point of the gradient descent dynamics becomes stable for this algorithm. A principled alternative is to use Amari's natural gradient learning algorithm and we show how this method provides a significant reduction in learning time when compared to gradient descent, while retaining the asymptotic performance of on-line Newton's method.
Introduction
On-line learning is a popular method for training multi-layer feed-forward neural networks, especially for large systems and for problems requiring rapid and adaptive data processing. Under the on-line learning framework, network parameters are updated according to only the latest in a sequence of training examples.
Online learning is discussed from the viewpoint of Bayesian statistical inference. By replacing the true posterior distribution with a simpler parametric distribution, one can define an online algorithm by a repetition of two steps: An update of the approximate posterior, when a new example arrives, and an optimal projection into the parametric family. Choosing this family to be Gaussian, we show that the algorithm achieves asymptotic efficiency. An application to learning in single layer neural networks is given.
Introduction
Neural networks have the ability to learn from examples. For batch learning, a set of training examples is collected and subsequently an algorithm is run on the entire training set to adjust the parameters of the network. On the other hand, for many practical problems, examples arrive sequentially and an instantaneous action is required at each time. In order to save memory and time this action should not depend on the entire set of data which have arrived sofar. This principle is realized in online algorithms, where usually only the last example is used for an update of the network's parameters. Obviously, some amount of information about the past examples is discarded in this approach. Surprisingly, recent studies showed that online algorithms can achieve a similar performance as batch algorithms, when the number of data grows large (Biehl and Riegler 1994; Barkai et al 1995; Kim and Sompolinsky 1996).
In order to understand the abilities and limitations of online algorithms, the question of optimal online learning has been raised.
By
David Barber, Department of Medical Biophysics, University of Nijmegen, 6525 EZ Nijmegen, The Netherlands,
Peter Sollich, Department of Physics, University of Edinburgh, Edinburgh EH9 3JZ, U.K.
We analyse online gradient descent learning from finite training sets at non-infinitesimal learning rates η for both linear and non-linear networks. In the linear case, exact results are obtained for the time-dependent generalization error of networks with a large number of weights N, trained on p = αN examples. This allows us to study in detail the effects of finite training set size α on, for example, the optimal choice of learning rate η. We also compare online and offline learning, for respective optimal settings of η at given final learning time. Online learning turns out to be much more robust to input bias and actually outperforms offline learning when such bias is present; for unbiased inputs, online and offline learning perform almost equally well. Our analysis of online learning for non-linear networks (namely, soft-committee machines), advances the theory to more realistic learning scenarios. Dynamical equations are derived for an appropriate set of order parameters; these are exact in the limiting case of either linear networks or infinite training sets. Preliminary comparisons with simulations suggest that the theory captures some effects of finite training sets, but may not yet account correctly for the presence of local minima.
Introduction
The analysis of online (gradient descent) learning, which is one of the most common approaches to supervised learning found in the neural networks community, has recently been the focus of much attention. The characteristic feature of online learning is that the weights of a network (‘student’) are updated each time a new training example is presented, such that the error on this example is reduced.
By
Mauro Copelli, Limburgs Universitair Centrum B-3590 Diepenbeek, Belgium,
Nestor Caticha, Instituto de Física, Universidade de São Paulo Caixa Postal 66318, 05389–970 São Paulo, SP, Brazil
On-line supervised learning in the general K Tree Committee Machine (TCM) is studied for a uniform distribution of inputs. Examples are corrupted by multiplicative noise in the teacher output. From the differential equations which describe the learning dynamics, the modulation function which optimizes the generalization ability is exactly obtained for any finite K. The asymptotical behavior of the generalization error is shown to be independent of K. Robustness with respect to a misestimation of the noise level is also shown to be independent of K.
Introduction
When looking into the properties of different neural network architectures by studying their performance in different model situations, the main objective, rather than delving into the many differences, is to search for similarities. It is from these similarities that intrinsic properties of learning, that go beyond the particular characteristics of the simple models, may be identified.
In order to develop a program of this nature several studies within the community of Statistical Mechanics of Neural Networks (Watkin, Rau and Biehl, 1993) have been pursued. Among the most important contributions that this approach brings to the study of machine learning is the possibility of dealing with networks of a very large size, that is in the thermodynamic limit (TL) and of introducing efficient techniques to average over the randomness associated to the data. The model scenarios that have been analized arise from combinations of the different learning conditioning factors. These include, among others, unsupervised versus supervised learning, realizable rules or not, learning in the presence of noise or in the more idealized noiseless case, learning in a time dependent or constant environment.
A great deal of hyperbole has been devoted to neural networks, both in their first wave around 1960 (Widrow & Hoff, 1960; Rosenblatt, 1962) and in their renaissance from about 1985 (chiefly inspired by Rumelhart & McClelland, 1986), but the ideas of biological relevance seem to us to have detracted from the essence of what is being discussed, and are certainly not relevant to practical applications in pattern recognition. Because ‘neural networks’ has become a popular subject, it has collected many techniques which are only loosely related and were not originally biologically motivated. In this chapter we will discuss the core area of feed-forward or ‘back-propagation’ neural networks, which can be seen as extensions of the ideas of the perceptron (Section 3.6). From this connection, these networks are also known as multi-layer perceptrons.
A formal definition of a feed-forward network is given in the glossary. Informally, they have units which have one-way connections to other units, and the units can be labelled from inputs (low numbers) to outputs (high numbers) so that each unit is only connected to units with higher numbers. The units can always be arranged in layers so that connections go from one layer to a later layer. This is best seen graphically; see Figure 5.1.
Pattern recognition has a long and respectable history within engineering, especially for military applications, but the cost of the hardware both to acquire the data (signals and images) and to compute the answers made it for many years a rather specialist subject. Hardware advances have made the concerns of pattern recognition of much wider applicability. In essence it covers the following problem:
‘Given some examples of complex signals and the correct decisions for them, make decisions automatically for a stream of future examples.’
There are many examples from everyday life:
Name the species of a flowering plant.
Grade bacon rashers from a visual image.
Classify an X-ray image of a tumour as cancerous or benign.
Decide to buy or sell a stock option.
Give or refuse credit to a shopper.
Many of these are currently performed by human experts, but it is increasingly becoming feasible to design automated systems to replace the expert and either perform better (as in credit scoring) or ‘clone’ the expert (as in aids to medical diagnosis).
Neural networks have arisen from analogies with models of the way that humans might approach pattern recognition tasks, although they have developed a long way from the biological roots. Great claims have been made for these procedures, and although few of these claims have withstood careful scrutiny, neural network methods have had great impact on pattern recognition practice.
The use of tree-based methods for classification is relatively unfamiliar in both statistics and pattern recognition, yet they are widely used in some applications such as botany (Figure 7.1) and medical diagnosis as being extremely easy to comprehend (and hence have confidence in).
The automatic construction of decision trees dates from work in the social sciences by Morgan & Sonquist (1963) and Morgan & Messenger (1973). (Later work such as Doyle, 1973, and Doyle & Fenwick, 1975, commented on the pitfalls of such automated procedures.) In statistics Breiman et al. (1984) had a seminal influence both in bringing the work to the attention of statisticians and in proposing new algorithms for constructing trees. At around the same time decision tree induction was beginning to be used in the field of machine learning, which we review in Section 7.4, and in engineering (for example, Sethi & Sarvarayudu, 1982).
The terminology of trees is graphic, although conventionally trees such as Figure 7.2 are shown growing down the page. The root is the top node, and examples are passed down the tree, with decisions being made at each node until a terminal node or leaf is reached. Each non-terminal node contains a question on which a split is based. Each leaf contains the label of a classification. A subtree of T is a tree with root a node of T; it is a rooted subtree if its root is the root of T.
The supervised methods considered so far have learnt both the structure of the probability distributions and the numerical values from the training set, or in the case of parametric methods, imposed a conventional structure for convenience. Other methods incorporate non-numerical ‘real-world’ knowledge about the subject domain into the structure of the probability distributions. Such knowledge is often about causal relationships, or perhaps the lack of causality as expressed by conditional independence.
These ideas have been most explored within the field of expert systems. This is a loosely defined area, and definitions vary:
‘The label “expert system” is, broadly speaking, a program intended to make reasoned judgements or to give assistance in a complex area in which human skills are fallible or scarce. …’
(Lauritzen & Spiegelhalter, 1988, p. 157)
‘A program designed to solve problems at a level comparable to that of a human expert in a given domain.’ (Cooper, 1989)
‘An expert system has two parts. The first one is the knowledge base. It usually makes up most of the system. In its simplest form it is a list of IF … THEN rules: each specifies what to do, or what conclusions to draw, under a set of well-defined circumstances’.
The second part of the expert system often goes under the name of “shell”. As the name implies, it acts as a receptacle for the knowledge base and contains instruments for making efficient use of it.