Recent trends in computer hardware have made modern computers parallel computers; that is, the power of today's computers depends on the number of processors. In order to use these powerful machines, we must split the algorithmic tasks into pieces and assign them to a large number of processors. In many cases, this requires nontrivial modifications of the algorithm. In this chapter, we discuss several basic concepts about parallelizing an algorithm and illustrate them in the context of loop/cluster identification.
The key issue in parallel computation is the distribution of computer memory. In other words, how much memory is available and at what access speed. Memories are organized in a hierarchical structure: L1 cache, L2 cache, main memory, and so on, all with different access speeds. The access speed also depends on the physical distance between the computing unit and the memory block.
Discussing how to fine-tune computer programs, taking all machine details into account, is clearly beyond the scope of this book. Therefore, in the following, we focus our discussion of parallel computers on two common types of architectures (Fig. 13.1): shared memory and distributed memory. In either case, we assume that the parallel computer has Np processors.
In most parallel computers available today, each local memory block is directly accessible by only a small number of processors in the local processor block which is physically closest to it. To access a remote block of memory, a processor must communicate with another processor that has direct access to this block. In some sense, the shared-memory architecture is a model for a local processor block. Alternatively, it can be regarded as a model of the whole computer system in which the communication cost is negligible. In this model, we do not have to account for the process of communicating between the processors. We simply assume that they are all reading and writing to the same block of memory, and each processor immediately knows when something is written to memory by another. The distributed-memory architecture, on the other hand, is a model in which every processor monopolizes the access to the block of memory that it possesses, and for a processor to get the information written on another processor's memory, the owner of the information must explicitly send it the content of this memory.