14 results
5 - Multiprocessor systems
- Michel Dubois, University of Southern California, Murali Annavaram, University of Southern California, Per Stenström, Chalmers University of Technology, Gothenberg
-
- Book:
- Parallel Computer Organization and Design
- Published online:
- 05 November 2012
- Print publication:
- 30 August 2012, pp 232-308
-
- Chapter
- Export citation
-
Summary
CHAPTER OVERVIEW
Ever since the beginning of the history of computer systems, the demand for more performance has been the most important driving force for evolution in computer architecture. In particular, many important applications demand more performance than a single (serial) processor core can provide, and historically have pushed parallel architecture technology. A good example is numerical programs used in computer simulation to analyze and solve problems in science and engineering, such as climate modeling, weather forecasting, or computer-aided design. Another example is commercial systems in which a large pool of independent queries must be executed to meet the growing demands of the information age. Over the years, another driving force for parallel architectures has been the fear of impending technological barriers that would eventually stall the performance growth of serial computers. These two forces have fueled from the beginning a keen interest in multiprocessor architecture research. While scientific computing needs made these research efforts relevant early on in the market place, multiprocessor technology hit the mainstream with the shift to multi-core computers at the beginning of the twenty-first century.
This chapter is devoted to design principles of multiprocessor systems. It focuses on two multiprocessor architectural styles: shared-memory and message-passing multiprocessor systems. Both styles use multiple processors with the goal of achieving a linear speedup of computational power with the number of processors. However, they differ in the method by which the processors exchange data. Processors in shared-memory multiprocessors share the same address space and can exchange data through shared-memory locations by regular load and store instructions.
Contents
- Michel Dubois, University of Southern California, Murali Annavaram, University of Southern California, Per Stenström, Chalmers University of Technology, Gothenberg
-
- Book:
- Parallel Computer Organization and Design
- Published online:
- 05 November 2012
- Print publication:
- 30 August 2012, pp vii-x
-
- Chapter
- Export citation
7 - Coherence, synchronization, and memory consistency
- Michel Dubois, University of Southern California, Murali Annavaram, University of Southern California, Per Stenström, Chalmers University of Technology, Gothenberg
-
- Book:
- Parallel Computer Organization and Design
- Published online:
- 05 November 2012
- Print publication:
- 30 August 2012, pp 342-424
-
- Chapter
- Export citation
-
Summary
CHAPTER OVERVIEW
This chapter is dedicated to the correct and reliable communication of values in shared-memory multiprocessors. Relevant correctness properties of the memory system of shared-memory multiprocessors include coherence, the memory consistency model (henceforth also referred to as the memory model), and the reliable execution of synchronization primitives. Since chip multiprocessors are designed as shared-memory multi-core systems, this chapter targets correctness issues not only in symmetric multiprocessors (SMPs) or large-scale cache coherent distributed shared-memory systems (cc-NUMAs and COMAs) covered in Chapter 5, but also in chip multiprocessors with core multi-threading (CMPs) covered in Chapter 8.
The correctness of a shared-memory multi-threaded program must be independent of the relative execution speed of its threads, because of the numerous unpredictable events that can disrupt the execution of any one thread, such as DVFS (dynamic voltage and frequency scaling), thermal emergencies, conflicts for hardware and software resources, interrupts, exceptions, kernel activity, thread scheduling, data allocation delays, and interactions with other running programs. If a multi-threaded program is written for a dedicated machine in which timing is highly predictable and the program is written in a way that takes timing into account for its correctness (such as, possibly, in real-time systems), many conclusions of this chapter should be revised. In other words, the target software throughout this chapter is portable shared-memory multi-threaded programs written for general-purpose or multi-purpose machines and includes the operating system kernel.
8 - Chip multiprocessors
- Michel Dubois, University of Southern California, Murali Annavaram, University of Southern California, Per Stenström, Chalmers University of Technology, Gothenberg
-
- Book:
- Parallel Computer Organization and Design
- Published online:
- 05 November 2012
- Print publication:
- 30 August 2012, pp 425-487
-
- Chapter
- Export citation
-
Summary
CHAPTER OVERVIEW
In this chapter we sharpen our focus on thread-level parallelism within a single die. Parallelism within a die comes in different forms. Within a single-core, multiple threads can be executed to improve resource utilization, an approach called core multi-threading. There are three approaches to core multi-threading depending on how and when instructions are fetched from multiple ready threads: block multi-threading, interleaved multi-threading and simultaneous multi-threading. We show the hardware additions and modifications necessary for each of these three multi-threading approaches to work within the contexts of traditional (single-threaded) in-order and out-of-order processors. We use example-driven approaches to show the performance advantages of finer-grain multi-threading over coarse-grain multithreading. The performance advantages come at additional hardware cost.
The next paradigm to provide on-die parallelism is exploiting multiple cores on the same chip. Chip multiprocessors (CMPs) are fast becoming ubiquitous in all walks of computing, from cell phones to datacenter servers. We explain the fundamental advantages of CMPs over traditional shared-memory multiprocessors (SMPs) mostly borne from the fact that all cores are tightly integrated on a single die by on-die interconnects.We describe three on-die interconnect topologies common today for building CMPs. When all cores on a CMP are identical, the CMP is said to be homogeneous. The cores in heterogeneous CMPs differ in their capabilities. We describe various heterogeneous CMP designs and the gamut of different performance and functionality possible.
4 - Memory hierarchies
- Michel Dubois, University of Southern California, Murali Annavaram, University of Southern California, Per Stenström, Chalmers University of Technology, Gothenberg
-
- Book:
- Parallel Computer Organization and Design
- Published online:
- 05 November 2012
- Print publication:
- 30 August 2012, pp 193-231
-
- Chapter
- Export citation
-
Summary
CHAPTER OVERVIEW
Given the widening gaps between processor speed, main memory (DRAM) speed, and secondary memory (disk) speed, it has become more and more difficult in recent years to feed data and instructions at the speed required by the processor while providing the ever-expanding memory space expected by modern applications. Modern systems rely on a memory hierarchy based on speed, size, and cost, as illustrated in Figure 4.1. Left of the dotted line is the cache hierarchy. Right of the dotted line is the virtual memory hierarchy, which may include a disk cache (not shown).
It has been observed over the years that the speed gap between the processor (clocked at multiple gigahertz and executing multiple instructions per clock) and main memory (with access times in the tens or even hundreds of nanoseconds) is growing exponentially. This problem is commonly referred to as the memory wall. A hierarchy of multiple levels of caches with various sizes and access times are employed to bridge the speed gap. Moreover, caches at every level are becoming more and more complex to help reduce or hide the latency of cache misses. To support OoO dynamically scheduled processors, which may have more than ten memory accesses pending at any time, modern, lockup-free (non-blocking) caches are capable of handling multiple cache hits and misses at a time. Furthermore, data and instructions are prefetched in caches before they are needed. In this chapter we describe these enhancements to cache designs.
Parallel Computer Organization and Design
- Michel Dubois, Murali Annavaram, Per Stenström
-
- Published online:
- 05 November 2012
- Print publication:
- 30 August 2012
-
- Textbook
- Export citation
-
Teaching fundamental design concepts and the challenges of emerging technology, this textbook prepares students for a career designing the computer systems of the future. In-depth coverage of complexity, power, reliability and performance, coupled with treatment of parallelism at all levels, including ILP and TLP, provides the state-of-the-art training that students need. The whole gamut of parallel architecture design options is explained, from core microarchitecture to chip multiprocessors to large-scale multiprocessor systems. All the chapters are self-contained, yet concise enough that the material can be taught in a single semester, making it perfect for use in senior undergraduate and graduate computer architecture courses. The book is also teeming with practical examples to aid the learning process, showing concrete applications of definitions. With simple models and codes used throughout, all material is made open to a broad range of computer engineering/science students with only a basic knowledge of hardware and software.
Index
- Michel Dubois, University of Southern California, Murali Annavaram, University of Southern California, Per Stenström, Chalmers University of Technology, Gothenberg
-
- Book:
- Parallel Computer Organization and Design
- Published online:
- 05 November 2012
- Print publication:
- 30 August 2012, pp 521-542
-
- Chapter
- Export citation
1 - Introduction
- Michel Dubois, University of Southern California, Murali Annavaram, University of Southern California, Per Stenström, Chalmers University of Technology, Gothenberg
-
- Book:
- Parallel Computer Organization and Design
- Published online:
- 05 November 2012
- Print publication:
- 30 August 2012, pp 1-35
-
- Chapter
- Export citation
-
Summary
For the past 20 years we have lived through the information revolution, powered by the explosive growth of semiconductor integration and of the internet. The exponential performance improvement of semiconductor devices was predicted by Moore's law as early as the 1960s. There are several formulations of Moore's law. One of them is directed at the computing power of microprocessors. Moore's law predicts that the computing power of microprocessors will double every 18–24 months at constant cost so that their cost-effectiveness (the ratio between performance and cost) will grow at an exponential rate. It has been observed that the computing power of entire systems also grows at the same pace. This law has endured the test of time and still remains valid today. This law will be tested repeatedly, both now and in the future, as many people see today strong evidence that the “end of the ride” is near, mostly because the miniaturization of CMOS technology is fast reaching its limit, the so-called CMOS endpoint.
Besides semiconductor technology, improved chip designs have also fueled the phenomenal performance growth of microprocessors over the years. Historically, with each new process generation, the logic switching speed and the amount of on-chip logic have both increased dramatically. Faster switching speeds lead to higher clock rates. Aggressive chip designs also contribute to higher clock rates by improving the design of circuits or by pipelining the steps in the execution of an instruction.
Preface
- Michel Dubois, University of Southern California, Murali Annavaram, University of Southern California, Per Stenström, Chalmers University of Technology, Gothenberg
-
- Book:
- Parallel Computer Organization and Design
- Published online:
- 05 November 2012
- Print publication:
- 30 August 2012, pp xi-xviii
-
- Chapter
- Export citation
-
Summary
Computer architecture is a fast evolving field, mostly because it is driven by rapidly changing technologies. We have all been accustomed to phenomenal improvements in the speed and reliability of computing systems since the mid 1990s, mostly due to technology improvements, faster clock rates, and deeper pipelines. These improvements have had a deep impact on society by bringing high-performance computing to the masses, by enabling the internet revolution and by fostering huge productivity gains in all human activities. We are in the midst of an information revolution of the same caliber as the industrial revolution of the eighteenth century, and few would deny that this revolution has been fueled by advances in technology and microprocessor architecture.
Unfortunately, these rapid improvements in computing systems may not be sustainable in future. Pipeline depths have reached their useful limit, and frequency cannot be cranked up for ever because of power constraints. As technology evolves and on-chip feature sizes shrink, reliability, complexity, and power/energy issues have become prime considerations in computer design, besides traditional measures such as cost, area, and performance. These trends have ushered a renaissance of parallel processing and parallel architectures, because they offer a clear path – some would say the only path – to solving all current and foreseeable problems in architecture. A widespread belief today is that, unless we can unleash and harness the power of parallel processing, the computing landscape will be very different very soon, and this dramatic change will have profound societal impacts.
9 - Quantitative evaluations
- Michel Dubois, University of Southern California, Murali Annavaram, University of Southern California, Per Stenström, Chalmers University of Technology, Gothenberg
-
- Book:
- Parallel Computer Organization and Design
- Published online:
- 05 November 2012
- Print publication:
- 30 August 2012, pp 488-520
-
- Chapter
- Export citation
-
Summary
CHAPTER OVERVIEW
Modern computer systems are becoming increasingly complex as more devices and functionalities are integrated. Throughout the entire design cycle of systems, simulation is a crucial tool for computer architecture researchers to evaluate novel ideas and explore the design space. Compared with hardware prototyping and analytic modeling, simulation strikes a better balance between accuracy, cost, flexibility, and complexity. As the design complexity of state-of-theart microprocessors keeps growing and manufacturing costs skyrocket, computer architecture simulation has become critical.
Simulations are pervasive in computer architecture research and design and affect the productivity of these activities to a great extent. Productivity is impacted at two levels: (1) the time and effort spent on developing the simulator and (2) the time consumed on running simulations with representative benchmarks. The dramatic growth of the integration density of microprocessor chips provides computer architects abundant on-chip real estate to enhance computing power with more complex architectural designs. In addition, power and reliability have turned into critical design constraints. Building a simulation infrastructure that allows a designer to consider performance, power, and reliability in a single unified framework leads to significant cost and delays in simulator development. Another direct consequence of a complex infrastructure is that simulation itself slows down, increasing the turnaround time for each design state exploration. Simulation slowdown is becoming particularly acute as computer architecture moves into the chip multiprocessor (CMP) era. The current approach of simulating CMPs with growing numbers of cores in a single thread is not scalable and cannot be sustained over time.
3 - Processor microarchitecture
- Michel Dubois, University of Southern California, Murali Annavaram, University of Southern California, Per Stenström, Chalmers University of Technology, Gothenberg
-
- Book:
- Parallel Computer Organization and Design
- Published online:
- 05 November 2012
- Print publication:
- 30 August 2012, pp 74-192
-
- Chapter
- Export citation
-
Summary
CHAPTER OVERVIEW
The processor and its instruction set are the fundamental components of any architecture because they drive its functionality. In some sense the processor is the “brain” of a computer system, and therefore understanding how processors work is essential to understanding the workings of a multiprocessor.
This chapter first covers instruction sets, including exceptions. Exceptions, which can be seen as a software extension to the processor instruction set, are an integral component of the instruction set architecture definition and must be adhered to. They impose constraints on processor architecture. Without the need to support exceptions, processors and multiprocessors could be much more efficient but would forgo the flexibility and convenience provided by software extensions to the instruction set in various contexts. A basic instruction set is used throughout the book. This instruction set is broadly inspired by the MIPS instruction set, a rather simple instruction set. We adopt the MIPS instruction set because the fundamental concepts of processor organizations are easier to explain and grasp with simple instruction sets. However, we also explain extensions required for more complex instruction sets, such as the Intel x86, as need arises.
Since this book is about parallel architectures, we do not expose architectures that execute instructions one at a time. Thus the starting point is the 5-stage pipeline, which concurrently processes up to five instructions in every clock cycle. The 5-stage pipeline is a static pipeline in the sense that the order of instruction execution (or the schedule of instruction execution) is dictated by the compiler, an order commonly referred to as the program, thread, or process order, and the hardware makes no attempt to re-order the execution of instructions dynamically.
Frontmatter
- Michel Dubois, University of Southern California, Murali Annavaram, University of Southern California, Per Stenström, Chalmers University of Technology, Gothenberg
-
- Book:
- Parallel Computer Organization and Design
- Published online:
- 05 November 2012
- Print publication:
- 30 August 2012, pp i-vi
-
- Chapter
- Export citation
2 - Impact of technology
- Michel Dubois, University of Southern California, Murali Annavaram, University of Southern California, Per Stenström, Chalmers University of Technology, Gothenberg
-
- Book:
- Parallel Computer Organization and Design
- Published online:
- 05 November 2012
- Print publication:
- 30 August 2012, pp 36-73
-
- Chapter
- Export citation
-
Summary
CHAPTER OVERVIEW
Technology has always played the most important role in the evolution of computer architecture over time and will continue to do so for the foreseeable future. Technological evolution has fostered rapid innovations in chip architecture. We give three examples motivated by performance, power, and reliability. In the past, architectural designs were dictated by performance/cost tradeoffs. Several well-known architectural discoveries resulted from the uneven progress of different technological parameters. For instance, caches were invented during the era when processor speed grew much faster than main memory speed. Recently, power has become a primary design constraint. Since the invention of the microprocessor, the amount of chip realestate has soared relentlessly, enabling an exponential rise of clock frequencies and ever more complex hardware designs. However, as the supply voltage approached its lower limit and power consumption became a primary concern, chip architecture shifted from high-frequency uniprocessor designs to chip multiprocessor architectures in order to contain power growth. This shift from uniprocessor to multiprocessor microarchitectures is a disrupting event caused by the evolution of technology. Finally, for decades processor reliability was a concern primarily for high-end server systems. As transistor feature sizes have shrunk over time they have become more susceptible to transient faults. Hence radiation-hardened architectures have been developed to protect computer systems from single-event upsets causing soft errors.
These examples of the impact of technology on computer design demonstrate that it is critical for a reader of this book to understand the basic technological parameters and features, and their scaling with each process generation.
6 - Interconnection networks
- Michel Dubois, University of Southern California, Murali Annavaram, University of Southern California, Per Stenström, Chalmers University of Technology, Gothenberg
-
- Book:
- Parallel Computer Organization and Design
- Published online:
- 05 November 2012
- Print publication:
- 30 August 2012, pp 309-341
-
- Chapter
- Export citation
-
Summary
CHAPTER OVERVIEW
Interconnection networks are an important component of every computer system. Central to the design of a high-performance parallel computer is the elimination of serializing bottlenecks that can cripple the exploitation of parallelism at any level. Instruction-level and thread-level parallelisms across processor cores demand a memory system that can feed the processor with instructions and data at high speed through deep cache memory hierarchies. However, even with a modest miss rate of one percent and with 100 cycle miss penalty, half of the execution time can be spent bringing instructions and data from memory to processors. It is imperative to keep the latency to move instructions and data between main memory and the cache hierarchy short.
It is also important that memory bandwidth be sufficient. If the memory bandwidth is not sufficient, contention among memory requests elongates the memory-access latency, which, in turn, may affect instruction execution time and throughput. For example, consider a nonblocking cache that has N outstanding misses. If the bus connecting the cache to memory can only transfer one block every T cycles, it takes N × T cycles to service the N misses as opposed to T cycles if the bus can transfer N blocks in parallel.
The role of interconnection networks is to transfer information between computer components in general, and between memory and processors in particular. This is important for all parallel computers, whether they are on a single processor chip – a chip multiprocessor or multi-core – or built from multiple processor chips connected to form a large-scale parallel computer.