To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
To act in a complex problem domain, a decision maker needs to know the current state of the domain in order to choose the most appropriate action. In a domain about which the decision maker has only uncertain knowledge and partial observations, it is often impossible to estimate the state of the domain with certainty. We introduce Bayesian networks as a concise graphical representation of a decision maker's probabilistic knowledge of an uncertain domain. We raise the issue of how to use such knowledge to estimate the current state of the domain effectively. To accomplish this task, the idea of message passing in graphical models is illustrated with several alternative methods. Subsequent chapters will present representational and computational techniques to address the limitation of these methods.
The basics of Bayesian probability theory are reviewed in Section 2.2. This is followed in Section 2.3 by a demonstration of the intractability of traditional belief updating using joint probability distributions. The necessary background in graph theory is then provided in Section 2.4. Section 2.5 introduces Bayesian networks as a concise graphical model for probabilistic knowledge. In Section 2.6, the fundamental idea of local computation and message passing in modern probabilistic inference using graphical models is illustrated using so-called λ – π message passing in tree-structured models. The limitation of λ – π message passing is discussed followed by the presentation of an alternative exact inference method, loop cutset conditioning, in Section 2.7 and an alternative approximate inference method, forward stochastic sampling, in Section 2.8.
In Chapters 6 through 9, we studied in detail why a set of agents over a large and complex domain should be organized into an MSBN and how. We studied how they can perform probabilistic reasoning exactly, effectively, and distributively. In this chapter, we discuss other important issues that have not yet been addressed but will merit research effort in the near future.
Multiagent Reasoning in Dynamic Domains
Practical problem domains can be static or dynamic. In a static domain, each domain variable takes a value from its space and will not change its value with time. Hence, at what instant in time an agent observes the variable makes no difference. On the other hand, in a dynamic domain, a variable may take different values from its space at different times. The temperature of a house changes after heating is turned on. The pressure of a sealed boiler at a chemical plant increases after the liquid inside boils. A patient suffers from a disease and recovers after the proper treatment. A device in a piece of equipment behaves normally until it wears out. Dynamic domains are more general, and a static domain can be viewed as a snapshot of a dynamic domain at a particular instant in time or within a time period in which the changes of variable values are ignorable.
A Bayesian network can be used to model static and dynamic domains.
Chapter 3 has shown that, in order to use concise message passing in a single cluster graph for exact belief updating with a nontree BN, one must reorganize the DAG into a junction tree. Graphical representations of probabilistic knowledge result in efficiency through the exploration of conditional independence in terms of graphical separation, as seen in Chapter 2. Therefore, the reorganization needs to preserve the independence–separation relations of the BN as much as possible. This chapter formally describes how independence is mapped into separation in different graphical structures and presents algorithms for converting a DAG dependence structure into a junction tree while preserving graphical separation to the extent possible.
Section 4.2 defines the graphical separation in three types of graphs commonly used for modeling probabilistic knowledge: u-separation in undirected graphs, d-separation in directed acyclic graphs, and h-separation in junction trees. The relation between conditional independence and the sufficient content of a message in concise message passing is established in Section 4.3. In Section 4.4, the concept of the independence map or I-map, which ties a graphical model to a problem domain based on the extent to which the model captures the conditional independence of the domain, is introduced. The concept of a moral graph is also introduced as an intermediate undirected graphical model to facilitate the conversion of a DAG model to a junction tree model. Section 4.5 introduces a class of undirected graphs known as chordal graphs and establishes the relation between chordal graphs and junction trees.
Chapter 7 has presented compilation of an MSBN into an LJF as an alternative dependence structure suitable for multiagent belief updating by concise message passing. Just as in the single-agent paradigm in which the conditional probability distributions of a BN are converted into potentials in a junction tree model, the conditional probability distributions in an MSBN need to be converted into potentials in the LJF before inference can take place. This chapter presents methods for performing such conversions and passing potentials as messages effectively among agents so that each agent can update belief correctly with respect to the observations made by all agents in the system.
Section 8.2 defines the potential associated with each component of an LJF and describes their initialization based on probability distributions in the original MSBN. Section 8.3 analyzes the topological structures of two linkage trees over an agent interface computed by two adjacent agents through distributed computation. This analysis demonstrates that, even though each linkage tree is created by one of the agents independently, the two linkage trees have equivalent topologies. This result ensures that the two agents will have the identical message structures when they communicate through the corresponding linkage trees. Sections 8.4 and 8.5 present direct interagent message passing between a pair of agents. The effects of such message passing are formally established. The algorithms for multiagent communication through intra- and interagent message passing are presented in Section 8.6.
An intelligent agent is a computational or natural system that senses its environment and takes actions intelligently according to its goals. We focus on computational (versus natural) agents that act in the interests of their human principals. Such intelligent agents aid humans in making decisions. Intelligent agents can play several possible roles in the human decision process. They may play the roles of a consultant, an assistant, or a delegate. For simplicity, we will refer to intelligent agents as just agents.
When an agent acts as a consultant (Figure 1.1), it senses the environment but does not take actions directly. Instead, it tells the human principal what it thinks should be done. The final decision rests on the human principal. Many expert systems, such as medical expert systems (Teach and Shortliffe [75]), are used in this way. In one possible scenario, human doctors independently examine patients and arrive at their own opinions about the diseases in question. However, before the physicians finalize their diagnoses and treatments, the recommendations from expert systems are considered, possibly causing the doctors to revise their original opinions. Intelligent agents are used as consultants when the decision process can be conducted properly by humans with satisfactory results, the consequences of a bad decision are serious, and agent performance is comparable to that of humans but the agents have not been accorded high degrees of trust.
In Chapter 6, MSBNs were derived as the knowledge representation for multiagent uncertain reasoning under the five basic assumptions. As in the case of single-agent BNs, we want agents organized into an MSBN to perform exact inference effectively by concise message passing. Chapter 4 discussed converting or compiling a multiply connected BN into a junction tree model to perform belief updating by message passing. Because each subnet in an MSBN is multiply connected in general, a similar compilation is needed to perform belief updating in an MSBN by message passing. In this chapter, we present the issues and algorithms for the structural compilation of an MSBN. The outcome of the compilation is an alternative dependence structure called a linked junction forest. Most steps involved in compiling an MSBN are somewhat parallel to those used in compiling a BN such as moralization, triangulation, and junction tree construction, although additional issues must be dealt with.
The motivations for distributed compilation are discussed in Section 7.2. Section 7.3 presents algorithms for multiagent distributive compilation of the MSBN structure into its moral graph structure. Sections 7.4 and 7.5 introduce an alternative representation of the agent interface called a linkage tree, which is used to support concise interagent message passing. The need to construct linkage trees imposes additional constraints when the moral graph structure is triangulated into the chordal graph structure. Section 7.6 develops algorithms for multiagent distributive triangulation subject to these constraints.
In the preceding chapters we investigated in detail the scenario of a student perceptron learning from a teacher perceptron. This is a typical example of what is commonly referred to as supervised learning. But we all gratefully acknowledge that learning from examples does not always require the presence of a teacher!
However, what is it that can be learned besides some specific classification of examples provided by a teacher? The key observation is that learning from unclassified examples is possible if their distribution has some underlying structure. The main issue in unsupervised learning is then to extract these intrinsic features from a set of examples alone. This problem is central to many pattern recognition and data compression tasks with a variety of important applications [110].
Far from attempting to review the many existing approaches to unsupervised learning, we will show in the present chapter how statistical mechanics methods introduced before can be applied to some special scenarios of unsupervised learning closely related to the teacher–student perceptron problem. This will illustrate on the one hand how statistical mechanics can be used for the analysis of unsupervised situations, while on the other hand we will gain new understanding of the supervised problem by reformulating it as a special case of an unsupervised one.
For a fixed set of input examples, one can decompose the N-sphere into cells each consisting of all the perceptron coupling vectors J giving rise to the same classification of those examples. Several aspects of perceptron learning discussed in the preceding chapters are related to the geometric properties of this decomposition, which turns out to have random multifractal properties. Our outline of the mathematical techniques related to the multifractal method will of course be short and ad rem; see [172, 173] for a more detailed introduction. But this alternative description provides a deeper and unified view of the different learning properties of the perceptron. It highlights some of the more subtle aspects of the thermodynamic limit and its role in the statistical mechanics analysis of perceptron learning. In this way we finish our discussion of the perceptron with an encompassing multifractal description, preparing the way for the application of this approach to the analysis of multilayer networks.
The shattered coupling space
Consider a set of p = αN examples ξµ generated independently at random from the uniform distribution on the N-sphere. Each hyperplane perpendicular to one of these inputs cuts the coupling space of a spherical perceptron, which is the very same N-sphere, into two half-spheres according to the two possible classifications of the example.
In this book we have discussed how various aspects of learning in artificial neural networks may be quantified by using concepts and techniques developed in the statistical mechanics of disordered systems. These methods grew out of the desire to understand some strange low-temperature properties of disordered magnets; nevertheless their usefulness for and efficiency in the analysis of a completely different class of complex systems underlines the generality and strength of the principles of statistical mechanics.
In this final chapter we have collected some additional examples of non-physical complex systems for which an analysis using methods of statistical mechanics similar to those employed for the study of neural networks has given rise to new and interesting results. Compared with the previous chapters, the discussions in the present one will be somewhat more superficial – merely pointing to the qualitative analogies with the problems elucidated previously, rather than working out the consequences in full detail. Moreover, some of the problems we consider are strongly linked to information processing and artificial neural networks, whereas others are not. In all cases quenched random variables are used to represent complicated interactions which are not known in detail, and the typical behaviour in a properly defined thermodynamic limit is of particular interest.
Support vector machines
The main reason which prevents the perceptron from being a serious candidate for the solution of many real-world learning problems is that it can only implement linearly separable Boolean functions.
The Gibbs rule discussed in the previous chapter characterizes the typical generalization behaviour of the students forming the version space. It is hence well suited for a general theoretical analysis. For a concrete practical problem it is, however, hardly the best choice and there is a variety of other learning rules which are often more direct and may also show a better performance. The purpose of this chapter is to introduce a representative selection of these learning rules, to discuss some of their features, and to compare their properties with those of the Gibbs rule.
The Hebb rule
The oldest and maybe most important learning rule was introduced by D. Hebb in the late 1940s. It is, in fact, an application at the level of single neurons of the idea of Pavlov coincidence training. In his famous experiment, Pavlov showed how a dog, which was trained to receive its food when, at the same time, a light was being turned on, would also start to salivate when the light alone was lit. In some way, the coincidence of the two events, food and light, had established a connection in the brain of the dog such that, even when only one of the events occurred, the memory of the other would be stimulated. The basic idea behind the Hebb rule [32] is quite similar: strengthen the connection of neurons that fire together.
As a rule teachers are unreliable. From time to time they mix up questions or answer absentmindedly. How much can a student network learn about a target rule if some of the examples in the training set are corrupted by random noise? What is the optimal strategy for the student in this more complicated situation?
To analyse these questions in detail for the two-perceptron scenario is the aim of the present chapter. Let us emphasize that quite generally a certain robustness with respect to random influences is an indispensable requirement for any information processing system, both in biological and in technical contexts. If learning from examples were possible only for perfectly error-free training sets it would be of no practical interest. In fact, since the noise blurring the correct classifications of the teacher may usually be assumed to be independent of the examples, one expects that it will remain possible to infer the rule, probably at the expense of a larger training set.
A general feature of noisy generalization tasks is that the training set is no longer generated by a rule that can be implemented by the student. The problem is said to be unrealizable. A simple example is a training set containing the same input with different outputs, which is quite possible for noisy teachers. This means that for large enough training sets no student exists who is able to reproduce all classifications and the version space becomes empty.
So far we have been considering learning scenarios in which generalization shows up as a gradual process of improvement with the generalization error ε decreasing continuously from its initial pure guessing value ε = 0.5 to the asymptotic limit ε = 0. In the present chapter we study systems which display a quite different behaviour with sudden changes of the generalization ability taking place during the learning process. The reason for this new feature is the presence of discrete degrees of freedom among the parameters, which are adapted during the learning process. As we will see, discontinuous learning is a rather subtle consequence of this discreteness and methods of statistical mechanics are well suited to describe the situation. In particular the abrupt changes which occur in the generalization process can be described as first order phase transitions well studied in statistical physics.
Smooth networks
The learning scenarios discussed so far have been described in the framework of statistical mechanics as a continuous shift of the balance between energetic and entropic terms. In the case of perfect learning the energy describes how difficult it is for the student vector to stay in the version space (see (2.13)). For independent examples it is naturally given as a sum over the training set and scales for large α as e ∼ αε since the generalization error ε gives the probability of error and hence of an additional cost when a new example is presented.
In the present chapter we introduce the basic notions necessary to study learning problems within the framework of statistical mechanics. We also demonstrate the efficiency of learning from examples by the numerical analysis of a very simple situation. Generalizing from this example we will formulate the basic setup of a learning problem in statistical mechanics to be discussed in numerous modifications in later chapters.
Artificial neural networks
The statistical mechanics of learning has been developed primarily for networks of so-called formal neurons. The aim of these networks is to model some of the essential information processing abilities of biological neural networks on the basis of artificial systems with a similar architecture. Formal neurons, the microscopic building blocks of these artificial neural networks, were introduced more than 50 years ago by McCulloch and Pitts as extremely simplified models of the biological neuron [1]. They are bistable linear threshold elements which are either active or passive, to be denoted in the following by a binary variable S = ±1. The state Si of a given neuron i changes with time because of the signals it receives through its synaptic couplings Jij from either the “outside world” or other neurons j.
More precisely, neuron i sums up the incoming activity of all the other neurons weighted by the corresponding synaptic coupling strengths to yield the post-synaptic potential ∑jJij Sj and compares the result with a threshold θi specific to neuron i.