Lecture 13
Parallel Programs
One approach to writing parallel programs is to develop sequential programs and trust a compiler to find the parallelism. Part of this process involves identifying all of the dependences between references to values in the code in order to determine which operations are independent and thus can be executed in parallel. We'll leave an examination of dependences to an advanced compilers course. However, it should be noted that, in general, sequential programs do not parallelize well.
Some sequential programs can be easily analyzed and converted to parallel form. But any time a programmer has tried to be clever, or to hand-optimize code for a machine, the analysis becomes much more difficult and results can be marginal. Without human assistance, the typical perfomance improvement from a parallelizing compiler is around a factor of 2.
Parallelizing compilers typically employ an approach that has come to be known as Pavlov's programmers. They begin by promising automatic parallelization with no programming effort. This gets the developer to commit to using a parallel machine, which promises a factor of 1000 or more speedup. But the result of the automatic parallelization is a modest boost. When the developer complains, he or she is told that they have an abberant program and that they need to use special tools to look for evil dependences in their programs and then insert special comments to help the compiler around these nasty sections of code. Gradually, the developer learns to identify these problems and either rewrite the code or insert comments with greater frequency. In the end, the programmer has essentially learned a parallel programming language and has rewritten the code into a parallel program.
Bernstein's Conditions area simple dependence test for parallelism. Given two "processes" (which could in an abstract sense be individual statements) P1 and P2 with input sets I1 and I2, and output sets O1 andO2.

While these conditions are necessary to achieve parallelism in the sense of simultaneous execution of two processes, they ignore the possibility of temporally staggered parallelism as is encountered in pipelining.
Either of the first two conditions can be violated as long as the input set can be processed one step after the preceding output. As with any pipeline, there is a startup cost to fill the pipe, but once full, parallel speedup equivalent to the number of stages is achieved as long as the pipe is kept full.
Even the third condition can be violated if a third process can be created to combine the outputs in a pipelined manner, after they are computed.
Control
Parallelism
One definition of control parallelism is an implementation through pipelining or multiple functional units. This definition is narrowly focused on the concept of instruction level parallelism (ILP), in which it is determined that independence between individual operations allows their instructions to be executed in parallel. From the CPU designer's viewpoint, these are separate threads of instructions to be coordinated.
From the programmer's perspective, however, most of this parallelism is happening invisibly within a single control thread. An exception might be when the operations come from multiple processes, possibly under the control of the programmer Ð a concept known as multithreading, which we explore later.
Most people, however, think of control parallelism as a programming model in which multiple tasks with independent control threads are created. The Communicating Sequential Processes (CSP) model is an example of this, as is the Futures model employed in some parallel LISP implementations and other languages. Other models include the Parallel RAM (PRAM) and Bulk Synchronous Processes (BSP).
The problem is that many of the popular supercomputing applications do not have massive amounts of control parallelism. Who can think of some applications that naturally have a high degree of control parallelism?
High level vision.
Multi-agent systems.
Controlling an industrial process in a factory.
Handling on-line transactions, such as airline reservations.
Air traffic control.
Control parallelism tends to be more loosely coupled because a significant set of steps is executed between communication events. Communication events can be global or between subsets of the processes, but in either case are usually asynchronous. That is, the processes do not have to synchronize to communicate. However, depending on the protocol, communication can result in synchronization. What are the ways in which processes can communicate?
The essential approaches are:
A process can send a message to another and continue.
A process can send a message to another and wait for a reply.
A process can send a message to another and continue, but request a reply.
A message can be queued for receipt.
A message can interrupt the receiving process (possibly with priorities).
A message can be placed in some common memory, and be read at the discretion of the receiver(s).
A process can lock a shared resource so that it has exclusive access (read, write, or both).
A process can queue its request for locking.
A process can wait for a lock to be released.
Now let's consider how processes might be created. Here are some mechanisms:
Fork and join.
A parent task spawns child tasks.
Function or procedure calls get executed in parallel.
Branches in a control structure that subdivide a data set can be executed in parallel.
What happens when a process is created? What is its environment? What are the costs or overhead associated with creating a process?
If a process is a fully independent task that is recognized by the OS, then memory and other resources must be allocated, security control information must be established, code must be loaded, etc.
If the process is spawned within a task, then the OS need not be fully in control of it, and the allocation can be made from already assigned resources. Security isn't a problem because we can assume that the programmer does not write adversarial processes (at least not intentionally). But the burden of managing the processes falls more on the programmer.
Lighter weight threads may require even less overhead. An example might be a future function call, in which evaluation takes place in a parallel thread and the result is not returned until some time in the future when it is actually needed.
Data Parallelism
Data parallelism is a popular parallel programming model because it is essentially a sequential model that operates on parallel data types. Thus, it is easy to learn. At first glance, it seems easy to convert old algorithms to use it, and it appears to be straightforward to scale it up with the size of a data set.
In reality, data parallelism works well on the small scale, such as the parallel graphics operations now found in most microprocessors, which operate on four or eight pixels at a time. When we scale up, various problems arise. For example, keeping processing synchronized across a large collection of processors is constrained by speed-of-light considerations. We either have to slow the clock rate (reduces performance) or we have to relax our definition of synchronization, which leads to overhead for managing asynchrony (reduces performance).
On large data sets, it is common for processing to operate on a subset of the values (e.g., treat values greater than zero one way, then work on the values less than zero). Thus, utilization can be low Ð in extreme cases, it can be worse than sequential.
There are also subtle algorithmic differences that we must be concerned about. For example, a common parallel operation is a convolution operator that multiplies surrounding elements by numerical weights, sums the products and then divides by a constant. This can be done in parallel for every value in an array. However, when a sequential version accesses surrounding elements, some have already been updated and others have not (usually the elements above and to the left in the array are already processed, and those below and to the right have not yet been updated). In a straight parallel implementation, all of the neighboring elements are yet to be updated. It turns out that this difference can affect the final result, and can even make the algorithm fail to converge in cases where the sequential algorithm does. Users are understandably shy of any method to speed up their applications, which produces different answers as a side effect.
There are three basic architectural models of data-parallel processing. In the simplest model, corresponding to pure single-instruction multiple-data (SIMD) operation, an operation is performed on all elements of a parallel data type except those that are masked from participation. The operations exclude local indexing. This model is what is used in the graphics instruction set extensions of microprocessors, and is also a common approach in special-purpose signal or vector-data processors. The first SIMD systems were proposed in the 1950's and a prototype was built in 1963. The first commercial SIMD system appeared in 1968 (STARAN). In the 1980's and 1990's, this approach gained enough popularity that several companies built general-purpose SIMD array processors, but only a few of these remain in business, mostly supporting highly specialized applications, such as radar tracking.
In the second model, local indexing is added. The reason that local indexing is distinguished is that it greatly increases the silicon area. In a pure SIMD system, the same address is decoded for all processors, so one address decoder can be shared by all the processors on a chip. With local indexing, the address decoder must be replicated for each processor, with a significant increase in area. Unless the processor has a very wide word (32 or 64 bits), the area added to each processor is greater than the area occupied by memory, and may even be greater than that of the entire processor.
Local indexing provides the ability to write algorithms in which the components of the parallel data structure are themselves dynamic structures such as queues. Such algorithms may speed up processing from linear to logarithmic complexity. However, it must be kept in mind that this may be only a small portion of the overall processing, and that greater overall performance may be achieved with more nonindexing parallelism. Only one system has been built and sold commercially with this approach (MasPar), and is no longer in production.
In the third model, local branching is added. This is the single-program, multiple-data (SPMD) version of data parallelism in which the code and data are replicated and a constrained version of MIMD is executed such that the processors can independently take multiple branches. In a SIMD algorithm, branches must be taken sequentially. If they are expressed as branches, it is up to the compiler to translate them into a sequential series of code segments for a SIMD architecture. Of course, in an SPMD architecture, this presents no such problem.
SPMD machines have been built (Connection Machine 5), taking advantage of the inherent coordination of the tasks, and the common program memory, to provide more processors with faster interprocessor coordination capabilities. However, like SIMD arrays, these machines have not been a marketing success.
The reasons for the failure of stand-alone data parallel architectures are many. They involve custom hardware, and thus are costly. The custom hardware offers an overwhelming performance advantage only to a small subset of the application market, so the customer base is limited. They require applications to be rewritten into a non-portable form, which is costly and unattractive to many users. They are typically a generation or two behind mainstream processors in their use of technology, partly because it takes a generation or more to engineer new processors into a system, and also because smaller companies lack access to the latest chip fabrication technology. Once a system is built, it can take another generation to port the software to it. By that time, mainstream processors have become as much as 10 times faster.
Thus, a data parallel architecture really needs to be at least 100 times faster than processors of the same technology generation, so that it can deliver a 10X boost in performance over its competition. And of course, it must not cost more than ten times as much. Even then, the attraction is minimal, as customers can see avoiding the software porting cost by simply waiting a few years for mainstream processors to catch up. To really be viable, a data parallel architecture should deliver a factor of 1000 improvement over contemporary systems, and that is very hard to do at a price point that produces enough sales volume to support the research and development for the next generation of the system.
Most data parallelism today is either done at the small scale (graphics instructions), or with low-cost multiprocessor systems (such as networked clusters of PCs). Only a few companies build large-scale parallel machines, and their customers are those select few for whom the high cost is justified.
Data Flow
Traditional programming models rely on explicit flow of control. Instructions are executed in a particular order to move data from place to place, perform computations, and alter the order of instruction execution. In most processors, control flow imposes a certain amount of sequentiality because the control unit fetches one instruction at a time (note the influence of the von Neumann model).
In a MIMD system, of course, multiple fetches occur simultaneously but each processor still fetches and executes instructions sequentially. Superscalar processors can issue multiple instructions per cycle, but the number of simultaneous instructions is small.
Data flow, on the other hand, has no instructions. A program is a design for a network of functions to be applied to a data set. The data from the set flows into the program's network and, as values become available at the inputs of the functions, they perform computations and pass the results on to succeeding stages of the network.
In theory, the data can flow into the network with maximum parallelism. Because a data flow network is usually asynchronous, the processors are decoupled and can execute at high speed. However, there is considerable overhead in synchronizing units to pass each token, and considerable effort is required to balance the network so that bottlenecks do not clog it.
In a sense, a data flow program can be thought of as expanding the definition of a pipeline to nonlinear topologies. The problem, of course, is that unlike a simple pipeline a general purpose data flow processor must be able to reconfigure into an arbitrary set of functional units and interconnections. Current technology is unable to support such an architecture.
There are two usual solutions to the problem of trying to build a data flow architecture. One is to build a custom processor for specialized applications. There are even CAD tools that partially automate the process.
The other approach is to build a tagged-token architecture, which is more like a conventional processor but that has special support for managing tokens: data values with attached destination information. The data flow program specifies what functions are to be performed and what properties their input tokens must have in order for the function to execute. A token matcher checks available tokens and passes them to the function when they match its input requirements. Associative memories, as well as hardware hashing units have been employed for token matching.
A few data flow prototype machines have been built, and one company sells an outdated processor chip for a tagged-token dataflow model. These are all fine-grained data flow architectures -- that is, they operate at the level of individual arithmetic instructions.
While fine-grained dataflow has not caught on as a popular model of parallelism, coarse-grained dataflow is used in forms such as Unix pipes. A generalized approach to piping may be the most natural way to express the coarse structure of applications that deal with continuous streams of data, such a vision, speech understanding, sound recognition, radar tracking, etc. where a network of more powerful functions (filters, feature extractors, etc.) operate on the data as it passes through the system.
Orthogonal Parallel Models
Within a function in a data flow network, or a single thread in a set of parallel control threads, what kind of programming model do we use?
There need be no restriction. Of course, the problem is that the technology of today can barely support one form of parallelism, let alone multiple orthogonal forms. However, that does not mean that we should ignore the potential for parallelism that could result from applying orthogonal models. If we devise a fully parallel algorithm, we can always ignore some part of the parallelism in implementing a program.
Heterogeneous processors are one means of implementing orthogonal parallelism, especially if they support general communication or reconfigurable communication. Researchers explored another approach, however, that allows processors to reconfigure their basic structure -- essentially using programmable logic to reconfigure their gates into different forms of parallel processors as appropriate for the different stages of processing in a problem.
It should be noted however, that the devil is in the details. Whenever we try to write a parallel algorithm, we have in mind some machine model that guides our thinking. But even a very minor change in architecture within a single model (e.g. two pure SIMD machines, but with different accelerators for reducing arrays to scalar values, such as summing the array elements) can lead to very different algorithms for obtaining maximum efficiency. Thus, writing code for a machine that has multiple models, or can change its model, is quite a different undertaking. Programming such a machine would likely involve beginning with some sort of specification of the programming model, followed by algorithms written for that model.
Parallelism opens a Pandora's box of possibilities for parallel programming. It gives us far greater flexibility for designing architectures than sequential processing. But it also means that we have to deal with the diversity that results from that flexibility.