To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This paper introduces an endofunctor VT on the category of frames that is parametrised by an endofunctor T on the category Set that satisfies certain constraints. This generalises Johnstone's construction of the Vietoris powerlocale in the sense that his construction is obtained by taking for T the finite covariant power set functor. Our construction of the T-powerlocale VT out of a frame is based on ideas from coalgebraic logic and makes explicit the connection between the Vietoris construction and Moss's coalgebraic cover modality.
We show how to extend certain natural transformations between set functors to natural transformations between T-powerlocale functors. Finally, we prove that the operation VT preserves some properties of frames, such as regularity, zero-dimensionality and the combination of zero-dimensionality and compactness.
Ever since the beginning of the history of computer systems, the demand for more performance has been the most important driving force for evolution in computer architecture. In particular, many important applications demand more performance than a single (serial) processor core can provide, and historically have pushed parallel architecture technology. A good example is numerical programs used in computer simulation to analyze and solve problems in science and engineering, such as climate modeling, weather forecasting, or computer-aided design. Another example is commercial systems in which a large pool of independent queries must be executed to meet the growing demands of the information age. Over the years, another driving force for parallel architectures has been the fear of impending technological barriers that would eventually stall the performance growth of serial computers. These two forces have fueled from the beginning a keen interest in multiprocessor architecture research. While scientific computing needs made these research efforts relevant early on in the market place, multiprocessor technology hit the mainstream with the shift to multi-core computers at the beginning of the twenty-first century.
This chapter is devoted to design principles of multiprocessor systems. It focuses on two multiprocessor architectural styles: shared-memory and message-passing multiprocessor systems. Both styles use multiple processors with the goal of achieving a linear speedup of computational power with the number of processors. However, they differ in the method by which the processors exchange data. Processors in shared-memory multiprocessors share the same address space and can exchange data through shared-memory locations by regular load and store instructions.
This chapter is dedicated to the correct and reliable communication of values in shared-memory multiprocessors. Relevant correctness properties of the memory system of shared-memory multiprocessors include coherence, the memory consistency model (henceforth also referred to as the memory model), and the reliable execution of synchronization primitives. Since chip multiprocessors are designed as shared-memory multi-core systems, this chapter targets correctness issues not only in symmetric multiprocessors (SMPs) or large-scale cache coherent distributed shared-memory systems (cc-NUMAs and COMAs) covered in Chapter 5, but also in chip multiprocessors with core multi-threading (CMPs) covered in Chapter 8.
The correctness of a shared-memory multi-threaded program must be independent of the relative execution speed of its threads, because of the numerous unpredictable events that can disrupt the execution of any one thread, such as DVFS (dynamic voltage and frequency scaling), thermal emergencies, conflicts for hardware and software resources, interrupts, exceptions, kernel activity, thread scheduling, data allocation delays, and interactions with other running programs. If a multi-threaded program is written for a dedicated machine in which timing is highly predictable and the program is written in a way that takes timing into account for its correctness (such as, possibly, in real-time systems), many conclusions of this chapter should be revised. In other words, the target software throughout this chapter is portable shared-memory multi-threaded programs written for general-purpose or multi-purpose machines and includes the operating system kernel.
In this chapter we sharpen our focus on thread-level parallelism within a single die. Parallelism within a die comes in different forms. Within a single-core, multiple threads can be executed to improve resource utilization, an approach called core multi-threading. There are three approaches to core multi-threading depending on how and when instructions are fetched from multiple ready threads: block multi-threading, interleaved multi-threading and simultaneous multi-threading. We show the hardware additions and modifications necessary for each of these three multi-threading approaches to work within the contexts of traditional (single-threaded) in-order and out-of-order processors. We use example-driven approaches to show the performance advantages of finer-grain multi-threading over coarse-grain multithreading. The performance advantages come at additional hardware cost.
The next paradigm to provide on-die parallelism is exploiting multiple cores on the same chip. Chip multiprocessors (CMPs) are fast becoming ubiquitous in all walks of computing, from cell phones to datacenter servers. We explain the fundamental advantages of CMPs over traditional shared-memory multiprocessors (SMPs) mostly borne from the fact that all cores are tightly integrated on a single die by on-die interconnects.We describe three on-die interconnect topologies common today for building CMPs. When all cores on a CMP are identical, the CMP is said to be homogeneous. The cores in heterogeneous CMPs differ in their capabilities. We describe various heterogeneous CMP designs and the gamut of different performance and functionality possible.
Given the widening gaps between processor speed, main memory (DRAM) speed, and secondary memory (disk) speed, it has become more and more difficult in recent years to feed data and instructions at the speed required by the processor while providing the ever-expanding memory space expected by modern applications. Modern systems rely on a memory hierarchy based on speed, size, and cost, as illustrated in Figure 4.1. Left of the dotted line is the cache hierarchy. Right of the dotted line is the virtual memory hierarchy, which may include a disk cache (not shown).
It has been observed over the years that the speed gap between the processor (clocked at multiple gigahertz and executing multiple instructions per clock) and main memory (with access times in the tens or even hundreds of nanoseconds) is growing exponentially. This problem is commonly referred to as the memory wall. A hierarchy of multiple levels of caches with various sizes and access times are employed to bridge the speed gap. Moreover, caches at every level are becoming more and more complex to help reduce or hide the latency of cache misses. To support OoO dynamically scheduled processors, which may have more than ten memory accesses pending at any time, modern, lockup-free (non-blocking) caches are capable of handling multiple cache hits and misses at a time. Furthermore, data and instructions are prefetched in caches before they are needed. In this chapter we describe these enhancements to cache designs.
For the past 20 years we have lived through the information revolution, powered by the explosive growth of semiconductor integration and of the internet. The exponential performance improvement of semiconductor devices was predicted by Moore's law as early as the 1960s. There are several formulations of Moore's law. One of them is directed at the computing power of microprocessors. Moore's law predicts that the computing power of microprocessors will double every 18–24 months at constant cost so that their cost-effectiveness (the ratio between performance and cost) will grow at an exponential rate. It has been observed that the computing power of entire systems also grows at the same pace. This law has endured the test of time and still remains valid today. This law will be tested repeatedly, both now and in the future, as many people see today strong evidence that the “end of the ride” is near, mostly because the miniaturization of CMOS technology is fast reaching its limit, the so-called CMOS endpoint.
Besides semiconductor technology, improved chip designs have also fueled the phenomenal performance growth of microprocessors over the years. Historically, with each new process generation, the logic switching speed and the amount of on-chip logic have both increased dramatically. Faster switching speeds lead to higher clock rates. Aggressive chip designs also contribute to higher clock rates by improving the design of circuits or by pipelining the steps in the execution of an instruction.
Computer architecture is a fast evolving field, mostly because it is driven by rapidly changing technologies. We have all been accustomed to phenomenal improvements in the speed and reliability of computing systems since the mid 1990s, mostly due to technology improvements, faster clock rates, and deeper pipelines. These improvements have had a deep impact on society by bringing high-performance computing to the masses, by enabling the internet revolution and by fostering huge productivity gains in all human activities. We are in the midst of an information revolution of the same caliber as the industrial revolution of the eighteenth century, and few would deny that this revolution has been fueled by advances in technology and microprocessor architecture.
Unfortunately, these rapid improvements in computing systems may not be sustainable in future. Pipeline depths have reached their useful limit, and frequency cannot be cranked up for ever because of power constraints. As technology evolves and on-chip feature sizes shrink, reliability, complexity, and power/energy issues have become prime considerations in computer design, besides traditional measures such as cost, area, and performance. These trends have ushered a renaissance of parallel processing and parallel architectures, because they offer a clear path – some would say the only path – to solving all current and foreseeable problems in architecture. A widespread belief today is that, unless we can unleash and harness the power of parallel processing, the computing landscape will be very different very soon, and this dramatic change will have profound societal impacts.
Modern computer systems are becoming increasingly complex as more devices and functionalities are integrated. Throughout the entire design cycle of systems, simulation is a crucial tool for computer architecture researchers to evaluate novel ideas and explore the design space. Compared with hardware prototyping and analytic modeling, simulation strikes a better balance between accuracy, cost, flexibility, and complexity. As the design complexity of state-of-theart microprocessors keeps growing and manufacturing costs skyrocket, computer architecture simulation has become critical.
Simulations are pervasive in computer architecture research and design and affect the productivity of these activities to a great extent. Productivity is impacted at two levels: (1) the time and effort spent on developing the simulator and (2) the time consumed on running simulations with representative benchmarks. The dramatic growth of the integration density of microprocessor chips provides computer architects abundant on-chip real estate to enhance computing power with more complex architectural designs. In addition, power and reliability have turned into critical design constraints. Building a simulation infrastructure that allows a designer to consider performance, power, and reliability in a single unified framework leads to significant cost and delays in simulator development. Another direct consequence of a complex infrastructure is that simulation itself slows down, increasing the turnaround time for each design state exploration. Simulation slowdown is becoming particularly acute as computer architecture moves into the chip multiprocessor (CMP) era. The current approach of simulating CMPs with growing numbers of cores in a single thread is not scalable and cannot be sustained over time.
The processor and its instruction set are the fundamental components of any architecture because they drive its functionality. In some sense the processor is the “brain” of a computer system, and therefore understanding how processors work is essential to understanding the workings of a multiprocessor.
This chapter first covers instruction sets, including exceptions. Exceptions, which can be seen as a software extension to the processor instruction set, are an integral component of the instruction set architecture definition and must be adhered to. They impose constraints on processor architecture. Without the need to support exceptions, processors and multiprocessors could be much more efficient but would forgo the flexibility and convenience provided by software extensions to the instruction set in various contexts. A basic instruction set is used throughout the book. This instruction set is broadly inspired by the MIPS instruction set, a rather simple instruction set. We adopt the MIPS instruction set because the fundamental concepts of processor organizations are easier to explain and grasp with simple instruction sets. However, we also explain extensions required for more complex instruction sets, such as the Intel x86, as need arises.
Since this book is about parallel architectures, we do not expose architectures that execute instructions one at a time. Thus the starting point is the 5-stage pipeline, which concurrently processes up to five instructions in every clock cycle. The 5-stage pipeline is a static pipeline in the sense that the order of instruction execution (or the schedule of instruction execution) is dictated by the compiler, an order commonly referred to as the program, thread, or process order, and the hardware makes no attempt to re-order the execution of instructions dynamically.
Technology has always played the most important role in the evolution of computer architecture over time and will continue to do so for the foreseeable future. Technological evolution has fostered rapid innovations in chip architecture. We give three examples motivated by performance, power, and reliability. In the past, architectural designs were dictated by performance/cost tradeoffs. Several well-known architectural discoveries resulted from the uneven progress of different technological parameters. For instance, caches were invented during the era when processor speed grew much faster than main memory speed. Recently, power has become a primary design constraint. Since the invention of the microprocessor, the amount of chip realestate has soared relentlessly, enabling an exponential rise of clock frequencies and ever more complex hardware designs. However, as the supply voltage approached its lower limit and power consumption became a primary concern, chip architecture shifted from high-frequency uniprocessor designs to chip multiprocessor architectures in order to contain power growth. This shift from uniprocessor to multiprocessor microarchitectures is a disrupting event caused by the evolution of technology. Finally, for decades processor reliability was a concern primarily for high-end server systems. As transistor feature sizes have shrunk over time they have become more susceptible to transient faults. Hence radiation-hardened architectures have been developed to protect computer systems from single-event upsets causing soft errors.
These examples of the impact of technology on computer design demonstrate that it is critical for a reader of this book to understand the basic technological parameters and features, and their scaling with each process generation.
Interconnection networks are an important component of every computer system. Central to the design of a high-performance parallel computer is the elimination of serializing bottlenecks that can cripple the exploitation of parallelism at any level. Instruction-level and thread-level parallelisms across processor cores demand a memory system that can feed the processor with instructions and data at high speed through deep cache memory hierarchies. However, even with a modest miss rate of one percent and with 100 cycle miss penalty, half of the execution time can be spent bringing instructions and data from memory to processors. It is imperative to keep the latency to move instructions and data between main memory and the cache hierarchy short.
It is also important that memory bandwidth be sufficient. If the memory bandwidth is not sufficient, contention among memory requests elongates the memory-access latency, which, in turn, may affect instruction execution time and throughput. For example, consider a nonblocking cache that has N outstanding misses. If the bus connecting the cache to memory can only transfer one block every T cycles, it takes N × T cycles to service the N misses as opposed to T cycles if the bus can transfer N blocks in parallel.
The role of interconnection networks is to transfer information between computer components in general, and between memory and processors in particular. This is important for all parallel computers, whether they are on a single processor chip – a chip multiprocessor or multi-core – or built from multiple processor chips connected to form a large-scale parallel computer.
Any function F: {0,. . ., N − 1} → {−1,1} such that F(x) can be computed from the binary digits of x using a bounded depth circuit is orthogonal to the Möbius function μ in the sense that
\[\frac{1}{N} \sum_{0 \leq x \leq N-1} \mu(x)F(x) → 0 \quad\text{as}~~ N → \infty.\]
The proof combines a result of Linial, Mansour and Nisan with techniques of Kátai and Harman, used in their work on finding primes with specified digits.
Bionic engineering has been a focus in the field of robotic researches. Inverse dynamic analysis is significant for the determination of dynamic parameters of bionic robots. The present paper uses a newly developed robot modular named Sambot to construct a caterpillar robotic mechanism, and designs a gait of trapezoidal wave locomotion for it. Two open-link models are put forth to simulate the dynamic behavior of such a locomotion. The inverse dynamic differential equations are derived by Kane's method and are then solved numerically by the Runge–Kutta method of the fourth order. Based on the numerical solutions of these differential equations, the applied joint torques required to produce the harmonic trapezoidal wave locomotion are determined finally, providing us important information on the gait control of the caterpillar robotic mechanism. Finally, the theoretical values of the joint torques are applied onto the present caterpillar mechanism to perform a locomotion experiment, which verifies the effectiveness of the present dynamics analysis.