To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Specifying predicates on the system state provides an important handle to specify, observe, and detect the behavior of a system. This is useful in formally reasoning about the system behavior. By being able to detect a specified predicate in the execution, we gain the ability to monitor the execution. Predicate specification and detection has uses in distributed debugging, sensor networks used for sensing in various applications, and industrial process control. As an example in the manufacturing process, a system may be monitoring the pressure of Reagent A and the temperature of Reagent B. Only when ψ1 = (PressureA > 240 KPa) ∧ (TemperatureB > 300 °C) should the two reagents be mixed. As another example, consider a distributed execution where variables x, y, and z are local to processes Pi, Pj, and Pk, respectively. An application might be interested in detecting the predicate ψ2 = xi + yj + zk < −125. In a nuclear power plant, sensors at various locations would monitor the relevant parameters such as the radioactivity level and temperature at multiple locations within the reactor.
Observe that the “predicate detection” problem is inherently different from the global snapshot problem. A global snapshot gives one of the possible states that could have existed during the period of the snapshot execution. Thus, a snapshot algorithm can observe only one of the predicate values that could have existed during the algorithm execution.
Recording the global state of a distributed system on-the-fly is an important paradigm when one is interested in analyzing, testing, or verifying properties associated with distributed executions. Unfortunately, the lack of both a globally shared memory and a global clock in a distributed system, added to the fact that message transfer delays in these systems are finite but unpredictable, makes this problem non-trivial.
This chapter first defines consistent global states (also called consistent snapshots) and discusses issues which have to be addressed to compute consistent distributed snapshots. Then several algorithms to determine on-the-fly such snapshots are presented for several types of networks (according to the properties of their communication channels, namely, FIFO, non-FIFO, and causal delivery).
Introduction
A distributed computing system consists of spatially separated processes that do not share a common memory and communicate asynchronously with each other by message passing over communication channels. Each component of a distributed system has a local state. The state of a process is characterized by the state of its local memory and a history of its activity. The state of a channel is characterized by the set of messages sent along the channel less the messages received along the channel. The global state of a distributed system is a collection of the local states of its components.
Recording the global state of a distributed system is an important paradigm and it finds applications in several aspects of distributed system design.
Deadlocks are a fundamental problem in distributed systems and deadlock detection in distributed systems has received considerable attention in the past. In distributed systems, a process may request resources in any order, which may not be known a priori, and a process can request a resource while holding others. If the allocation sequence of process resources is not controlled in such environments, deadlocks can occur. A deadlock can be defined as a condition where a set of processes request resources that are held by other processes in the set.
Deadlocks can be dealt with using any one of the following three strategies: deadlock prevention, deadlock avoidance, and deadlock detection. Deadlock prevention is commonly achieved by either having a process acquire all the needed resources simultaneously before it begins execution or by pre-empting a process that holds the needed resource. In the deadlock avoidance approach to distributed systems, a resource is granted to a process if the resulting global system is safe. Deadlock detection requires an examination of the status of the process–resources interaction for the presence of a deadlock condition. To resolve the deadlock, we have to abort a deadlocked process.
In this chapter, we study several distributed deadlock detection techniques based on various strategies.
System model
A distributed system consists of a set of processors that are connected by a communication network. The communication delay is finite but unpredictable.
This chapter deals with the design of fault-tolerant distributed systems. It is widely known that the design and verification of fault-tolerent distributed systems is a difficult problem. Consensus and atomic broadcast are two important paradigms in the design of fault-tolerent distributed systems and they find wide applications. Consensus allows a set of processes to reach a common decision or value that depends upon the initial values at the processes, regardless of failures. In atomic broadcast, processes reliably broadcast messages such that they agree on the set of messages delivered and the order of message deliveries.
This chapter focuses on solutions to consensus and atomic broadcast problems in asynchronous distributed systems. In asynchronous distributed systems, there is no bound on the time it takes for a process to execute a computation step or for a message to go from its sender to its receiver. In an asynchronous distributed system, there is no upper bound on the relative processor speeds, execution times, clock drifts, and delay during the transmission of messages although they are finite. This is mainly casued by unpredictable loads on the system that causes asynchrony in the system and one cannot make any timing assumptions of any types. On the other hand, synchronous systems are characterized by strict bounds on the execution times and message transmission delays.
The concept of causality between events is fundamental to the design and analysis of parallel and distributed computing and operating systems. Usually causality is tracked using physical time. However, in distributed systems, it is not possible to have global physical time; it is possible to realize only an approximation of it. As asynchronous distributed computations make progress in spurts, it turns out that the logical time, which advances in jumps, is sufficient to capture the fundamental monotonicity property associated with causality in distributed systems. This chapter discusses three ways to implement logical time (e.g., scalar time, vector time, and matrix time) that have been proposed to capture causality between events of a distributed computation.
Causality (or the causal precedence relation) among events in a distributed system is a powerful concept in reasoning, analyzing, and drawing inferences about a computation. The knowledge of the causal precedence relation among the events of processes helps solve a variety of problems in distributed systems. Examples of some of these problems is as follows:
Distributed algorithms design The knowledge of the causal precedence relation among events helps ensure liveness and fairness in mutual exclusion algorithms, helps maintain consistency in replicated databases, and helps design correct deadlock detection algorithms to avoid phantom and undetected deadlocks.
Peer-to-peer (P2P) network systems use an application-level organization of the network overlay for flexibly sharing resources (e.g., files and multimedia documents) stored across network-wide computers. In contrast to the client–server model, any node in a P2P network can act as a server to others and, at the same time, act as a client. Communication and exchange of information is performed directly between the participating peers and the relationships between the nodes in the network are equal. Thus, P2P networks differ from other Internet applications in that they tend to share data from a large number of end users rather than from the more central machines and Web servers. Several well known P2P networks that allow P2P file-sharing include Napster, Gnutella, Freenet, Pastry, Chord, and CAN.
Traditional distributed systems used DNS (domain name service) to provide a lookup from host names (logical names) to IP addresses. Special DNS servers are required, and manual configuration of the routing information is necessary to allow requesting client nodes to navigate the DNS hierarchy. Further, DNS is confined to locating hosts or services (not data objects that have to be a priori associated with specific computers), and host names need to be structured as per administrative boundary regulations. P2P networks overcome these drawbacks, and, more importantly, allow the location of arbitrary data objects.
In this chapter, we first study a methodical framework in which distributed algorithms can be classified and analyzed. We then consider some basic distributed graph algorithms. We then study synchronizers, which provide the abstraction of a synchronous system over an asynchronous system. Finally, we look at some practical graph problems, to appreciate the necessity of designing efficient distributed algorithms.
Topology abstraction and overlays
The topology of a distributed system can be typically viewed as an undirected graph in which the nodes represent the processors and the edges represent the links connecting the processors. Weights on the edges can represent some cost function we need to model in the application. There are usually three (not necessarily distinct) levels of topology abstraction that are useful in analyzing the distributed system or a distributed application. These are now described using Figure 5.1. To keep the figure simple, only the relevant end hosts participating in the application are shown. The WANs are indicated by ovals drawn using dashed lines. The switching elements inside the WANs, and other end hosts that are not participating in the application, are not shown even though they belong to the physical topological view. Similarly, all the edges connecting all end hosts and all edges connecting to all the switching elements inside the WANs also belong to the physical topology view even though only some edges are shown.
The field of distributed computing covers all aspects of computing and information access across multiple processing elements connected by any form of communication network, whether local or wide-area in the coverage. Since the advent of the Internet in the 1970s, there has been a steady growth of new applications requiring distributed processing. This has been enabled by advances in networking and hardware technology, the falling cost of hardware, and greater end-user awareness. These factors have contributed to making distributed computing a cost-effective, high-performance, and fault-tolerant reality. Around the turn of the millenium, there was an explosive growth in the expansion and efficiency of the Internet, which was matched by increased access to networked resources through the World Wide Web, all across the world. Coupled with an equally dramatic growth in the wireless and mobile networking areas, and the plummeting prices of bandwidth and storage devices, we are witnessing a rapid spurt in distributed applications and an accompanying interest in the field of distributed computing in universities, governments organizations, and private institutions.
Advances in hardware technology have suddenly made sensor networking a reality, and embedded and sensor networks are rapidly becoming an integral part of everyone's life – from the home network with the interconnected gadgets to the automobile communicating by GPS (global positioning system), to the fully networked office with RFID monitoring. In the emerging global village, distributed computing will be the centerpiece of all computing and information access sub-disciplines within computer science.
Agreement among the processes in a distributed system is a fundamental requirement for a wide range of applications. Many forms of coordination require the processes to exchange information to negotiate with one another and eventually reach a common understanding or agreement, before taking application-specific actions. A classical example is that of the commit decision in database systems, wherein the processes collectively decide whether to commit or abort a transaction that they participate in. In this chapter, we study the feasibility of designing algorithms to reach agreement under various system models and failure models, and, where possible, examine some representative algorithms to reach agreement.
We first state some assumptions underlying our study of agreement algorithms:
Failure models Among the n processes in the system, at most f processes can be faulty. A faulty process can behave in any manner allowed by the failure model assumed. The various failure models – fail-stop, send omission and receive omission, and Byzantine failures – were discussed in Chapter 5. Recall that in the fail-stop model, a process may crash in the middle of a step, which could be the execution of a local operation or processing of a message for a send or receive event. In particular, it may send a message to only a subset of the destination set before crashing. In the Byzantine failure model, a process may behave arbitrarily.
Distributed systems today are ubiquitous and enable many applications, including client–server systems, transaction processing, the World Wide Web, and scientific computing, among many others. Distributed systems are not fault-tolerant and the vast computing potential of these systems is often hampered by their susceptibility to failures. Many techniques have been developed to add reliability and high availability to distributed systems. These techniques include transactions, group communication, and rollback recovery. These techniques have different tradeoffs and focus. This chapter covers the rollback recovery protocols, which restore the system back to a consistent state after a failure.
Rollback recovery treats a distributed system application as a collection of processes that communicate over a network. It achieves fault tolerance by periodically saving the state of a process during the failure-free execution, enabling it to restart from a saved state upon a failure to reduce the amount of lost work. The saved state is called a checkpoint, and the procedure of restarting from a previously checkpointed state is called rollback recovery. A checkpoint can be saved on either the stable storage or the volatile storage depending on the failure scenarios to be tolerated.
In distributed systems, rollback recovery is complicated because messages induce inter-process dependencies during failure-free operation. Upon a failure of one or more processes in a system, these dependencies may force some of the processes that did not fail to roll back, creating what is commonly called a rollback propagation.
In a distributed system, processes make local decisions based on their limited view of the system state. A process learns of new facts when it receives messages from other processes, and can reason only with the additional knowledge available to it. This chapter provides a formal framework in which it is easier to understand the role of knowledge in the system, and how processes can reason with such knowledge. The first three sections are based on the book by Fagin et al. The logic of knowledge, classically termed as epistemic logic, is the formal logical analysis of reasoning about knowledge. Epistemic knowledge first received much attention from philosophers in the mid-twentieth century.
The muddy children puzzle
Consider the classical “muddy children” puzzle of Halpern and Moses and Halpern and Fagin. Imagine there are n children who return from playing outdoors, and k, k ≥ 1, of the n children have mud on their foreheads. Let Ψ denote the fact “at least one child has a muddy forehead.” Assume that each child can see all other children and their foreheads, but not their own forehead. We also assume that the children are intelligent and truthful, and answer any question asked of them, simultaneously. We now consider two scenarios.
Inter-process communication via message-passing is at the core of any distributed system. In this chapter, we will study non-FIFO, FIFO, causal order, and synchronous order communication paradigms for ordering messages. We will then examine protocols that provide these message orders. We will also examine several semantics for group communication with multicast – in particular, causal ordering and total ordering. We will then look at how exact semantics can be specified for the expected behavior in the face of processor or link failures. Multicasts are required at the application layer when superimposed topologies or overlays are used, as well as at the lower layers of the protocol stack. We will examine some popular multicast algorithms at the network layer. An example of such an algorithm is the Steiner tree algorithm, which is useful for setting up multi-party teleconferencing and videoconferencing multicast sessions.
Notation
As before, we model the distributed system as a graph (N, L). The following notation is used to refer to messages and events:
When referring to a message without regard for the identity of the sender and receiver processes, we use mi. For message mi, its send and receive events are denoted as si and ri, respectively.
More generally, send and receive events are denoted simply as s and r. When the relationship between the message and its send and receive events is to be stressed, we also use M, send(M), and receive(M), respectively.
In wireless communications, transmission power is an important resource. Power control, also known as transmit power control, is a significant design problem in modern wireless networks. Power control comprises the techniques and algorithms used to manage and adjust the transmitted power of base stations and handsets. Power control also serves several purposes, including reducing cochannel interference (CCI), managing data quality, maximizing cell capacity, minimizing handset mean transmit power, etc. In this chapter, we illustrate the basic power-control problems and some possible solutions.
In wireless communication systems, two important detrimental effects that decrease network performance are the time-varying nature of the channels and CCI. The average channel gain is primarily determined by large-scale path-loss factors such as propagation loss and shadowing. The instant channel gain is also affected by small-scale fading factors such as multipath fading. Because the available bandwidth is limited, the channels are reused for different transmissions. This channel reuse increases the network capacity per area, but, on the other hand, it causes CCI. Because of these effects, the signalto- interference-noise ratio (SINR) at a receiver output can fluctuate of the order of tens of decibels. Power control is an effective resource-allocation method to combat these detrimental effects. The transmitted power is adjusted according to the channel condition so as to maintain the received signal quality. Power control is no longer one user's problem, because a user's transmit power causes other users' interferences.
The objective of power control in wireless networks is to control the transmit power to guarantee a certain link quality and to reduce CCI.
Recently, cooperative communications have gained attention as an emerging transmit strategy for future wireless networks. Cooperative communications efficiently take advantage of the broadcasting nature of wireless networks. The basic idea is that users or nodes in a wireless network share their information and transmit cooperatively as a virtual antenna array, thus providing diversity that can significantly improve system performance. In cooperative transmission, relays are assigned to help a sender in forwarding its information to its receiver. Thus the receiver gets several copies of the same information via independent channels.
The pioneer work on cooperative transmission can be found, e.g., in [53], where a general information theoretical framework about relaying channels is established. In [272, 273], a CDMA-based two-user cooperative modulation scheme has been proposed. The main idea is to allow each user to retransmit estimates of their partner's received information such that each user's information is transmitted to the receiver at the highest possible rate. This work is extended in [175], where the outage and the ergodic capacity behavior of various cooperative protocols, e.g., decode-and-forward (DF) and amplify-and-forward (AF) cooperative protocols, are analyzed for a three-user case under quasi-static fading channels. In [298], the authors provided SER performance analysis and optimum power allocation for DF cooperative systems in a narrowband Rayleigh fading environment. The work in [138] analyzes the schemes based on the same channel without fading, but with more complicated transmitter cooperative schemes involving dirty paper coding. In [199], a cooperative broadcast strategy was proposed with an objective of maximizing the network lifetime.
Over the past few decades, increasing demands from military, national security, and commercial customers have been driving the large-scale deployment of ad hoc networks, sensor networks, and personal area networks. Unlike the cellular network or the WiMAX, there are no sophisticated infrastructures such as base stations for these wireless networks. In these scenarios, the mobile users have to set up the network functionality on their own. For example, an individual sensor can sense its immediate environment, process what it senses, communicate its results to others over a wireless link, and possibly take an action in response. Although a single sensor has very limited use, a network of sensors can be used to manage large environments and systems.
These types of wireless networks contain new types of computing machines, run different kinds of network applications, execute in different physical environments, and possess large numbers of nodes. Moreover, there are other challenges such as battery life, maximal power, interferences, limited bandwidth, and connectivity for the different applications of such networks. Significant scientific and technical progress is required to realize the potential of these networks. In short, building such networks requires overcoming many challenges, especially from an optimization design point of view. In this chapter, we study three examples for ad hoc networks, sensor networks, and ultrawide-band (UWB) networks, respectively.
Because different users have different channels and locations at different times, resource allocation can take advantage of the time, frequency, multiuser, and spatial diversity. Specifically for spatial diversity, transceivers employ antenna arrays and adjust their beam patterns such that they have good channel gain toward the desired directions, whereas the aggregate interference power is minimized at their output. Antenna-array processing techniques such as beam forming can be applied to receive and transmit multiple signals that are separated in space. Hence multiple cochannel users can be supported in each cell to increase the capacity by exploring the spatial diversity. Many works have been reported in the literature. Traditional beam formers such as minimum-mean-square-error (MMSE) and minimum-variance-distortion-response (MVDR) methods have been commonly employed [130]. Many joint power-control and beam forming algorithms are proposed in [183, 203, 248, 250]. The application of antenna arrays has been proposed in [214] to increase the network capacity in CDMA systems.
In this chapter, we first consider the resource-allocation example that jointly considers the antenna-array processing. We consider a system with beam forming capabilities in the receiver. An iterative algorithm is proposed to jointly update the transmission powers and the beam-former weights so that it converges to the joint optimal beam-forming and transmission power vector. The algorithm is distributed and uses only local interference measurements. In an uplink transmission scenario it is shown how base assignment can be incorporated in addition to beam forming and power control such that a globally optimum solution is obtained. The network capacity and the saving in mobile power are then evaluated through numerical study.
Over the past few decades, wireless communications and networking have witnessed an unprecedented growth and have become pervasive much sooner than anyone could have imagined. In wireless communication systems, two important detrimental effects that decrease network performance are the channel's time-varying nature and CCI. Because of effects such as multipath fading, shadowing, path loss, propagation delay, and noise level, the SINR at a receiver output can fluctuate on the order of tens of decibels. The other major challenge for the system design is the limited available RF spectrum. Channel reuse is a common method used to increase the wireless system capacity by reusing the same channel beyond some distance. However, this introduces CCI that degrades the link quality.
A general strategy to combat these detrimental effects is the dynamic allocation of resources such as transmitted power, modulation rates, channel assignment, and scheduling based on the channel conditions. Power control is one direct approach toward minimizing CCI. The transmit power is constantly adjusted. They are increased if the SINRs at the receivers are low and are decreased if the SINRs are high. Such a process improves the quality ofweak links and reduces the unnecessary transmit power. In [2, 92] centralized power control schemes are proposed to balance the carrier-to-interference ratio (CIR) or maximize the minimum CIR in all links. Those algorithms need global information about all link gains and power. The distributed power control algorithms that use only local measurements of SINR are presented in [209, 352].