Lecture 16

Multivector and SIMD Computers

Vector Processing

As noted many times before, the regularity of scientific codes, with their emphasis on matrix computations, provides ample opportunity for accelerating performance.

Vector machines attempt to exploit this by providing explicit parallelism that maps onto one-dimensional arrays.

Of course, multiple dimension arrays can be considered as composites of one-dimensional arrays, although their layout in memory does not correspond directly to such a composition because of the linearization of arrays that are mapped to memory. Hence, as you might expect, one of the challenges in designing vector machines is to deal with memory access.

Early vector machines such as the CDC Star-100 had memory-memory pipelined architectures. They could handle a vector of any length, but their operational speed was limited by memory access. Even with 16-way interleaving, it was difficult to push speeds very high. And interleaving is usually fixed in its access pattern, so it only fetches vectors quickly when they are aligned with memory.

The distance between elements in memory is called the stride -- as in the length of your step. In this case, it is the length of the processor's steps through memory.

Register-Register Designs

Later vector processors operated register to register, RISC style (note that these designs came before RISC). There are two styles of register-register vector processor: parallel and pipelined. The former is really just a linear SIMD array, and is thus not usually classified as a traditional vector processor.

Pipelined vector units simply take advantage of the fact that there is a known source and sink for data that can transfer blocks of data at high speed. Once the data is in the registers, it can be streamed at a very high rate through the pipelined functional units.

Typical operations are:

Vector-vector: Operations between two vectors with a vector result

Vector-scalar: Operations between a vector and a scalar with a vector result

Vector-memory: Loads and stores to the registers -- specify base address, length, stride

Vector reduction: Operation on a vector that produces a scalar

Gather/Scatter: Specialized loads and stores for sparse arrays, arrays with unusual strides

Masking: Compress zeros out of a vector, and produce a new vector of indexes to the original non-zero elements.

Interleaved memory access can feed data to the processor in a pipelined manner, in parallel, or in a combination where fetch and load are pipelined while data is transferred in parallel.

Consider the implications of Amdahl's law for vector processors. Note that a fairly high percentage of the execution must be on vectors to show significant speedup. It is precisely for this type of architecture that Amdahl's law is most applicable, because it is really a scalar processor running sequential code with vector support. Thus, there are two modes of operation, and it is clear that a portion of sequential code is being accelerated.

Parallel algorithms are rarely employed on vector processors, although the Pavlov's Programmers approach effectively turns sequential codes into cumbersome parallel codes. Many programmers tried to hand Ðtune their Fortran code to match the optimum vector lengths of their machines, rewriting loops to work in vector-register sized chunks. These rather convoluted loop structures are difficult to port to other architectures, and are especially hard for compilers to reoptimize.

Multivector Designs

Just as superscalar designs can take advantage of fine-grained parallelism in scalar operations, it is possible to take advantage of similar parallelism in vector codes. Thus, it makes sense to provide multiple vector processors in a system.

Because of the already great demands on memory access, the majority of the effort in designing a multivector system is in the memory subsystem. Interleaved, multiport memory units can be connected to the processors via a crossbar switch (in the Cray, this is actually a variation on a Clos network with crossbars).

Because the operands are 64 bits (72 bits with error corrections), you can probably guess that a 32 x 8 crossbar is rather expensive, especially when it is operating with cycle times under 5 ns.

A Case Study: The Cray 1 and Family

The Cray 1 was first delivered in 1976. This was around the same time that 8-bit microprocessors were beginning to gain popularity, typical memory components were 1K bit SRAM and 4 K bit DRAM. Most popular machines were operating at about a 1 MHz clock rate, had 32-bit words, and large mainframes had 1 MB to 8 MB of RAM.

The Cray 1 had (Baron and Higbie CS manual)

64-bit words

8 MB of RAM

16-way interleaving on low-order bits

50 ns memory cycle

12.5 ns clock cycle (80 MHz)

12 pipelined functional units

The Cray 1 has 3 basic data types: addresses (24-bit integer), integers (64-bit), floating point (64-bit, 48-bit mantissa). The 12 functional units are divided into four groups.

Group 1 -- Vector units

Vecto (integer) Add: 3 stages

Vector Logical: 2 stages

Vector Shift: 4 stages

Group 2 -- Vector and scalar units

Floating Add: 6 stages

Floating Multiply: 7 stages

Floating Reciprocal Approximation: 14 stages

Group 3 -- Scalar units

Integer Add: 3 stages

Logical: 1 stage

Shift: 2 stages

Scalar population count and leading zero count: 3 stages

Group 4 -- Address units

Add: 2 stages

Multiply: 6 stages

The machine itself is divided into six major subsystems

Memory

Instruction component

Address component

Scalar component

Vector component

I/O component

Instruction Component

Cray 1 instructions are 32 or 16 bits, so from 2 to 4 instructions can be packed into a word. Instructions are thus addressed on 16-bit boundaries while data is addressed on 64-bit boundaries.

The instruction unit has four 16-word instruction buffers, three instruction registers, and one instruction counter. Each 16-bit field in a word is called an instruction parcel.

The three instruction registers are

Next Instruction Parcel -- holds first parcel of the next instruction, prefetched from buffer

Current Instruction Parcel -- holds the high-order portion of the instruction to be issued

Lower Instruction Parcel -- holds low-order portion of instruction to be issued

For a 32-bit instruction, the low-order portion is fetched to the NIP and then moved to the LIP.

There is no mechanism for discarding instructions in the pipe -- once in the CIP/LIP, they will be issued. At most they will be delayed for some time.

The instruction buffers are tied to the memory via the 16-way interleaving, so it is possible to fill a buffer in 4 clock cycles (recall that the clock is 12.5 ns and memory is 50 ns). Buffers are filled on a demand basis in a round-robin pattern. They thus act as an instruction cache of 256 instructions, organized into four lines of 64 instructions. Each buffer has its own address comparator, so we would call this a fully associative cache (easy to implement when there are only 4 lines). The buffers cannot be written to -- a write bypasses the instruction cache and only goes to main memory.

Scalar instruction issue requires that all of the instruction's required resources be free -- otherwise the instruction waits. Vector instruction issue in the Cray involves reserving functional units, including memory, operand registers and result registers, and then releasing an instruction once all of its resources are available. In addition, some data paths are shared between the vector and scalar components, and these must be available.

The control unit is able to detect when a result register for one vector operation is an operand for another vector operation and, if the two vector instructions do not conflict in any other resource requirements, it sets up a vector chaining operation between the two instructions.

Address Component

There are 8 24-bit address registers, 64 24-bit spill registers, an adder, and a multiplier in this component. Its purpose is to perform index arithmetic and send the results to the scalar and vector components so that they can fetch the appropriate operands.

Arithmetic is performed on the address registers directly. The spill registers are used to hold address values that do not fit into the address registers. A set of 8 addresses can be transferred between the address registers and their spill registers in a single cycle. Thus, they bear a certain similarity to the register windows of the SPARC (or vice versa). The spill registers can be thought of as an explicitly managed data cache with 8 lines. Their value is that they reduce the traffic to main memory, freeing that resource for vector operations.

Scalar Component

Similar to the address component, the scalar component has 8 64-bit registers and 64 64-bit spill registers. It has sole access to four functional units: Integer Add, Logical, Shift, and Population Count. The Scalar Component also has access to three functional units that are shared with the Vector Component: Floating Add, Multiply, and Reciprocal Approximation.

Because the scalar component has its own integer units, it can always execute integer operations in parallel with a vector operation. However, for floating point, the vector unit takes priority.

Vector Component

The are 8 64-word vector registers in the vector component. It takes four memory loads to fill a vector register. Normally, this would require 16 instruction cycles. However, careful pipelining in the memory unit reduces the time to just 11 cycles.

A vector mask register contains a bit-map of the elements in a register operand that will participate in an instruction. A vector length register determines whether fewer than 64 operands are contained in a set of vector operands. Manipulating these values is the primary reason for the population and leading zeros counter.

Vector loads and stores specify the first location, the length, and the stride.

I/O Component

The I/O component has 24 programmable I/O channel units. I/O has the lowest priority for memory access.

Cray XM-P

Extended the Cray-1 architecture to 4-way multiprocessing.

Cycle reduced to 8.5 ns (117 MHz)

Increased instruction buffers to 32 words

Added a multiport memory system.

Redesigned the vector unit to support arbitrary chaining.

Added Gather/Scatter to support sparse arrays.

Increased memory to 16 M words, 32-way interleave

Provides a set of shared registers to support fine-grained (loop-level) multiprocessing. There are N+1 sets of these registers for an N-processor system. They include eight address registers, 8 scalar registers, and 32 binary semaphores.

The I/O system was improved and a solid state disk cache was added.

Cray YM-P

Extends the XM-P architecture to 8 processors.

Cycle reduced to 6 ns (166 MHz)

Extends memory to 128 M words

Cray 2

One foreground and four background processors.

4.1 ns cycle (244 MHz)

Up to 256 M words of memory

64 or 128 way interleave depending on configuration

Eliminates the spill registers in favor of a 16K word cache

Cache feeds all three computational components with 4-cycle access time

Has 8 16-word instruction buffers

Foreground processor controls the I/O subsystem, which has up to 4 high speed communication channels (4 Gb/s).

Practical Considerations in Vector Supercomputer Design

To achieve such high speeds, high-power (i.e. hot) drivers are employed, signals are detected with specialized analog circuits, conductors are all shielded and precisely tuned in both impedance and length, and data is encoded with error-correcting so that losses can be recovered.

In addition, the circuits are usually designed to operate in balanced mode so that there is no change in power drawn as drivers switch. As one driver switches from low to high, another switches from high to low, so that the power supply sees a DC load and there is no coupling of switching noise back into the logic via the power supply. In addition, using balanced signal lines can increase the signal to noise ratio by 6dB, although these are not often used. In a design such as the Cray-1, roughly 40% of the transistors supposedly do nothing but balance the power loading.

Even so, these machines dissipate large amounts of heat. The IBM 3090 used special thermal conduction modules in which a multichip substrate is mounted in a carrier with built-in plumbing for a chilled water jacket. CDC used a similar system in its designs, and on one instance a maintenance crew pumped live steam through the building air conditioning system(to clean the lines), which crossed over to the processor, with predictable results. This raises the issue that these machines usually need thermal shut-down systems, and possibly even fire suppression gear.

The Cray-1 series uses piped freon, and each board has a copper sheet to conduct heat to the edges of the cage, where freon lines draw it away. The first Cray-1 was in fact delayed six months due to problems in the cooling system: lubricant that is normally mixed with the freon to keep the compressor running would leak through the seals as a mist and eventually coat the boards with oil until they shorted out.

The Cray-2 is unique in that it uses a liquid bath to cool the processor boards. A special nonconductive liquid (flourinert) is pumped through the system and the chips are immersed in this. Special fountains aerate the liquid, and reservoirs are provided for storing the liquid when it is pumped out for service. This is somewhat reminiscent of the oil cooling bath that was sometimes used in magnetic core memory units.

The ETA-10 was originally going to use a liquid nitrogen bath, but I believe this turned out to be too difficult to implement (on a side note, I have known scientific labs where the researchers deal with cooling problems in air-cooled machines by opening a tank of liquid nitrogen at the inlet, so that the supercold vapor is drawn inside, but that's not quite the same).

When Lawrence Livermore National Labs announced that it would henceforth buy no more vector supercomputers, that line of architectural development essentially came to an end. The handwriting was clearly on the wall for this breed of system, and all of the major manufacturers moved to parallel processing.

Parallel Vector Designs

Cray Research shifted its focus to a parallel processor called the T3D that employed a 3-dimensional torus topology to connect up to 1024 DEC (later Compaq) Alpha processors. This is really a major departure for Cray, and is essentially an entry into the multicomputer market. It required a dedicated Y-MP  or C-90 to serve as a host. The next generation (T3E)  eliminated the need for the expensive front end. In 1996, Cray Research was bought by Silicon Graphics, which used some of the T3E network technology in its Origin scalable shared memory multiprocessor architecture. In 2000, Cray was again sold to Tera computer, which makes the Multithreaded Architecture supercomputer. Cray/Tera markets their own machines as well as a Fujitsu supercomputer.

Cray Computer is a separate company that was formed by Seymour Cray before the SGI sale. It began building the Cray 3, as a successor to the Cray 2 architecture. In addition, it was to have certain banks of memory that were also SIMD processors. The SIMD processors, called Processor In Memory (PIM), are simple bit-serial devices arranged in a mesh with only a serial port between adjacent chips. The SIMD processor relies on the Gather-Scatter operation of the Cray 3 to provide routing capabilities.

The Cray-3 was not completed due to technical and financial problems, After reorganizing, Cray Computer began the design of the Cray-4, but this work ended with the untimely death of Seymour Cray as a result of injuries sustained in an automobile accident in 1996.

Fujitsu is took a less radical approach by building a parallel processor out of up to 222 vector processors assembled via a crossbar. This has the advantage that it retains some software compatibility with existing vector codes, but has limited scalability. NEC also continues to build vector machines, which it aggregates into parallel processors. In 2002, an NEC SX-6 with 5120 processors became the world's fastest supercomputer, with over 40 trillion floating point operations per second. The machine is used for weather and climate modeling.

Vector Accelerators

Not all vector processing systems are supercomputers. One can buy add-on vector coprocessors for workstations. At one point, most of these were custom designs. However, they went through a period in which they were largely based on Intel i860 processors, which have vector capability. However, because Intel is dropped support for the i860, the designers had to look elsewhere, and many selected the PowerPC as a replacement. The later versions of the PowerPC have a small vector unit, called AltiVec, that is especially useful for signal processing and graphics.

Kai Hwang's architecture book mentions the Stardent in this regard -- a machine whose architecture resulted from a corporate merger between Stellar and Ardent. Stellar had a custom pipelined vector processor, while Ardent had built an image rendering engine. The company has since failed. However, it might be argued that some of the products from SGI fill this particular market niche today. HP also produces machine that offer vector support. The niche is shrinking, however, as mainstream microprocessor makers are adding vector graphics support for game systems, and these offer enough of the same class of performance to satisfy some customers who would have previously been willing to pay a premium for graphics acceleration.

Another tradeoff that must be examined is whether backing off to a slightly slower but much less expensive processing node, with a correspondingly less expensive network can lead to a system with higher performance at lower cost. In the next section, we see this approach taken to its extreme.

SIMD Architectures

In a multicomputer, the nodes each replicate a von Neumann architecture: memory, instruction decoder, address decoder, control unit, ALU. If the same operation is being applied to many data values, as in vector operations, then the duplication of the instruction decoder, control unit, and program memory is redundant.

In a SIMD array, only one control unit and one program memory is employed, and so far more resources can be applied to the actual computational elements (usually increasing their number).

One philosophy of SIMD design has been to provide the maximum number of elements -- up to 256K have been built in one machine, athough 16K is more typical. This philosophy can drive the designer to use very simple processors, each with a small amount of data memory, in order to increase the number of elements.

Thus, many machines have been built with one-bit ALUs -- the MPP, DAP, CM-2, GAPP, CLIP-4, CAAPP, pixel planes, etc. For these machines, a byte add takes 8 operations. However, if it can be done on 256 values at once in a chip, then the effect is the same as if a microprocessor is doing 32 byte adds in those 8 cycles -- a factor of 4 more processing for similar silicon area. Recent microprocessors have approached this shortfall through the use of internal SIMD processing in the form of multimedia instructions that can split a data word into bytes or 2-byte units and operate simultaneously on the separate portions of a word. What is more important, however, is that the SIMD architecture scales -- 100 SIMD chips would deliver 100 times the power, whereas 100 microprocessors would likely fall far short of that figure.

Other machines, such as the MasPar MP-1 and MP-2 have wider data words (4 and 32 bits, respectively). This reduces the number of processors in the system, but more operations take place on each cycle (assuming that one can use 32-bit operations all of the time -- obviously, a 32-bit SIMD array is no faster than an 8-bit array at processing 8-bit pixel values, for example).

Programming Model

SIMD has one of the simplest programming models among parallel processors. Because there is just one thread of control, the model is similar to sequential processing. There is no synchronization to consider, no coherency to maintain. However, SIMD has been faulted for being difficult to program. Why?

Because to effectively use it, programs must be recoded. Even though the recoding may simplify the program, it still requires more initial human effort than an automatic vectorizer.

SIMD programs employ parallel data types. These are usually arrays, but may also be lists, trees, etc. Operations are specified on the arrays as a whole, on slices of arrays, scalars and arrays, and over relative neighborhoods of elements in arrays.

Thus we might define three parallel arrays A, B, and C and write an operation such as

A := B+C

to add them, or

A := B + B(North) + B(South) + B(East) + B(West)

to create a new array A containing the sum of each element of B with its four neighbors.

Most SIMD arrays also define the notion of Activity -- a binary mask array associated with every operation. The contents of the Activity array specify whether the corresponding locations in the data arrays take part in an operation.

For example, we might write

Activity := Odd_Columns

WHERE A > A(East) THEN Swap(A, A(East))

The elements in the odd numbered columns test to see if they are greater than their neighbors to the East, and if so they swap values. This shows two ways of setting activity: explicit setting of a pattern, and implicit activity resulting from a comparison.

Branching

The WHERE statement usually has a two-branch form

WHERE A >= 0 THEN B := A ELSEWHERE B := -A

Because there is just one instruction stream, the processor selects elements where the first condition is true and performs the branch, then selects elements where the condition is false and performs the other branch. Thus, the time to execute a branch in a SIMD system is equal to the sum of the times for the two branches.

It is possible to test whether any elements need to be processed in a particular branch, but the odds are high in a massively parallel system that at least some elements require each branch. Only when the cost of executing a branch is great and the likelihood of a null set of elements is high is it worth taking the time to test whether the branch needs to be executed.

For example, if the probability of a null branch set is 0.00001, and the cost of the test is five instructions, then the branch must cost at least 50,000 instructions to make the test worthwhile because the odds are that the test will fail 10,000 times before the rare case where the branch is skipped is encountered. The larger the processing array, the lower the probability that the set is empty.

Obviously, a structure such as a parallel CASE statement is expensive to execute in SIMD because all of the branches must be done sequentially. The reason, of course, is that the instruction decoder on each chip is shared among all of the processors on the chip.

We could consider a design in which the instruction decoder is duplicated some number of times so that a fixed number of instruction streams can be issued in parallel to the processors. As long as the branches in a SIMD control structure are mutually exclusive, it is simply a matter of determining which instruction stream each processor should "listen to."

The other problem is, however, that such a scheme requires the array control unit to issue multiple streams in parallel, making the instruction broadcast path wider, and placing more burden on the compiler to arrange the instruction streams optimally.

An interesting observation in the VLSI design for such a scheme is that it is cheaper to replicate the ALUs than it is to try to somehow reroute the decoded control signals for the multiple streams to a single ALU.

Indexing

Suppose you have a two-dimensional array and you want to use a one dimensional array to index into the second dimension of the 2-D array to extract a 1-D array. In a sequential machine, this is no problem because each access is independent, and the address decoder is used repeatedly.

In a SIMD array, however, there is just one source of addresses -- the control unit. Thus, an indexed lookup such as this must be done by broadcasting each index sequentially. At first you might think that this means broadcasting the index array. However, if the index array is already in the processor array, then we just have to broadcast the index values. That is, if it is an 8-bit index value, then 256 broadcasts are required.

Even so, local indexing is still a very costly operation. But it is also costly to provide support for indexing in hardware. To see why, we return to the VLSI floorplan of the SIMD processor chip. In order to index into the memory, each processor must have an address decoder. Such a circuit structure is usually larger than the individual memory. Thus, the amount of memory and the number of processors on a chip is greatly reduced.

In addition, if indexing is to operate on external memory, then the index must be transferred off of the chip to the external memory, and the outside memories must be addressed separately. In a simple SIMD system, one of the simplifying factors is that the external memory address can be broadcast to all memories. If each memory must receive a broadcast address as well as a local address, the circuitry is further complicated.

Also, indexing may require additional external pins proportional to the number of processors. But as we know, while the number of processors is proportional to chip area, pinout is proportional to chip circumference.

In practice, two approaches have been used to support local indexing. In the CM-2, the address was written out to a fixed memory location and the router, which had access to the external memory, simply took over the memory and sequentially rearranged the data so that another fixed location would contain the indexed values.

In the MasPar processors four addresses could be output at once and the memories operated in a special double-speed fetch mode so that 8 index values could be fetched per cycle. Thus, four cycles were required to perform an indexing operation for the 32 processors on a single chip.

In either case, the operation is still fairly costly. The CM-2 approach is sequential and the MasPar approach works only because the number of processors on a chip is limited.

Communication

By its nature, communication in a SIMD system is synchronous. Every processor simultaneously transfers data to its nearest neighbor in a given topology. Typical topologies for SIMD arrays are meshes, although others have been employed.

CLIP-4 had a hexagonal grid

Pixel Planes and MasPar used an X-net and a variation on a Clos network.

CM-1 used a mesh and hypercube, CM-2 dropped the mesh

ASP uses a comb structure

CAAPP, DAP, CLIP, Illiac III, Polymorphic Torus used a reconfigurable mesh

In the CM and MasPar the SIMD communication network was augmented with a powerful asynchronous router network. In the CM, the network consisted of a 12-D hypercube linking groups of 16 processors. Each router node was an independent finite state machine that could pass data from point to point -- essentially a MIMD computer with a fixed program. The router node also contained an arithmetic unit so that it could combine values during routing. The router and the processors did not operate in parallel, however -- essentially, the router's asynchronous operation was masked as a very long instruction cycle. In the MP-2, three stages of crossbar network were used to provide a similar routing capability. The MP-2 router also connected to a high-speed I/O subsystem.

Reconfigurable mesh networks provide a different form of communication by allowing simultaneous broadcasts within a group of elements. If the network also supports a wired-OR broadcast, then parallel maximum and minimum operations can be supported.

Global Summary

While a router or the SIMD network can be used to reduce an array to a scalar value, the time required is often significant. Thus, a dedicated tree network is often used to provide a global OR of a bit array. This Some/None response from the array can be quickly computed and is often the basis for global termination decisions for iterative algorithms.

WHILE Some DO

            Continue processing

In addition, a global count of the bit array can be useful. It provides a population count for compressing sparse arrays, if the bits in a scalar value are counted and summed appropriately, a global add-reduce operation can be computed more quickly than through a combining network in some cases, and it allows a histogram to be computed in time proportional to the number of buckets.

Associative Processing

One of the simplest forms of SIMD array is the content addressable parallel processor (CAPP). This is a content addressable memory that has the ability to broadcast write through a word-wide trit mask (0/1/don't care) into memory with an activity flag (often the same as the comparison response tag) controlling where the writes occur. In addition, a Some/None response is needed to control some loops.

Even though there are no ALUs in such a system, it can be programmed to perform arithmetic bit-serially. Of course, it can also execute a bit-parallel comparison with full parallelism. Because so many operations in a CAPP end up being bit-serial anyway, it makes more sense to provide a bit-serial ALU and use normal RAM instead of CAM. At that point, the processor essentially becomes a SIMD array without a communication network. Adding a network completes the conversion, and associative processing becomes more of a processing paradigm than an architecture. The only difference may be an emphasis on fast global summary of comparison responses.

In a reconfigurable mesh such as the CAAPP's, which supported wired-OR within network partitions, each partition can be considered as a separate associative processing array. Although all of the partitions operate under the same instruction stream, they can locally broadcast and summarize responses. This mode of processing is called multiassociative, and supplies a spatially distributed form of indexing that allows queries such as

How many red cars are there in each state?

Where the response is a sparse array of 50 counts that have all been computed in parallel. The CM router could perform a similar operation with a combining reduction to an array, but would then have trouble reversing the operation and multicasting the result to all of the elements with corresponding state values.

Furthermore, an operation such as labeling the pixels in each region of an image is straightforward in this scheme, but requires more complexity with a router.

Instruction Issue

One of the critical aspects of SIMD array design, as you might guess from previous observations of broadcast resources, is the distribution of instructions and addresses to all of the processors. By its nature, a SIMD array requires a high degree of fan out for instruction broadcast. This implies multiple levels of buffering, which in turn leads to high latency, and the potential for clock skew. Because the processors in a SIMD array are tightly coupled, they do not tolerate clock skew well.

For example, a machine with a 100 ns clock that can tolerate only 5% skew must keep all clock signals aligned to within 5 ns -- difficult to achieve when the signals must pass through several levels of logic that all have 1 - 2 ns variations.

Thus, most SIMD arrays must operate at a low clock rate. The majority have been in the 4 to 10 MHz range. After much effort, the DAP was eventually driven to 20 MHz, but it had only 1K to 4K processors.

Even without the clock skew problem, SIMD arrays suffer from an instruction dilation problem. In order to issue the instructions to the array, the control unit must compute data addresses and insert them into the array instructions, fetch and decode its own instructions, read status signals from the array, handle branches, and actually transmit the array instructions. This requires typically 20 instructions for standard microprocessor. Thus, in order to issue instructions at a 20 MHz rate, the control unit would have to execute at 400 MIPS sustained performance, doing nothing else. Even a 2 GHz processor would be challenged to actually deliver such a sustained level of performance. And, of course, with modern technology, it would be desirable to run the SIMD array at a much higher clock rate.

The other alternative is to use a wide microprogrammed processor, something like a VLIW machine, but where each portion of the instruction word drives either a piece of the control unit (instruction fetch, data address calculation, issue, status check, etc.) or the array (embedded array opcodes). These operations take place both in parallel and in a pipelined manner in order to sustain the 20 MHz issue rate without resorting to hot logic such as ECL or GaAs technologies.

Of course, such an architecture is difficult to program, and is costly to implement because of the wide memory interface. Without going to an exotic design, it is difficult to drive such a wide interface at very high speeds.

Asynchronous SIMD

Dealing with the problem of accelerating SIMD processing requires addressing two issues:

Reducing the impact of clock skew

Reducing the instruction issue rate while increasing clock speed

The simplest way to reduce the impact of clock skew is to make the processors asynchronous between chips. Communication between chips is buffered and uses handshaking.

Another way is to reduce the distance between the processors through packaging such as multichip modules, but this has practical limitations.

Reducing the instruction issue rate requires that each instruction be expanded to multiple clock cycles on the chips. One way to do this is to make instructions more CISC-like. Another is to use bit-serial processors and perform the bit-op sequences for other types on chip automatically (e.g. an 8-bit add is one instruction that executes for 8 cycles). Another approach it to use a superscalar or VLIW approach where the operations are done in sequence.

Finally, a way of significantly reducing the instruction issue rate is to recognize that most SIMD arrays must virtualize to the size of the data arrays by processing tiles in sequence. If an instruction buffer captures the instructions for one tile and can then reissue them locally for all other tiles, then it reduces the instruction issue rate.

If a processor is assumed to virtualize to a minimum degree, then the communication between chips is greatly reduced because it can be orchestrated so that tiles that send data off the chip can be processed first and tiles that receive data from off the chip can be processed last. Thus, the latency of communication can be hidden during the time that tiles which communicate only with others within the chip are processed.

Synchronous MIMD (SPMD)

Asynchronous SIMD is still limited to sequentializing branches and indexed addressing. If these restrictions are lifted so that each node can execute a different instruction sequence, and do local indexing, then a local instruction decoder, program memory, and address decoder must be added to the data memory and ALU.

The result is a node that is much like a microprocessor. However, a global controller still maintains synchronization at a coarse level of granularity among the nodes. Thus, such a machine is a synchronized MIMD multicomputer. The only advantage of this approach over a typical multicomputer is that the programming model is simplified somewhat.

The CM-5 could contain up to 16K SPARC processors, each with 32 MB of memory and an optional vector coprocessor. Sun workstations acted as control processors to manage the program execution in the array. The distinguishing factor between the CM-5 and a multicomputer was that in addition to a data network, the CM-5 had a control network to support global synchronization, broadcast, and coordinated communication such as scan. The data network was a truncated 4-ary fat tree topology. The control network was a complete binary tree.

It operated in either a tightly-coupled SIMD mode, a multiple SIMD mode in which the array is partitioned, or synchronized MIMD mode.

The optional vector units attached to the SPARC processor's bus between memory and the main processor. Unfortunately, this means that data must be routed first to the SPARC, and then is transferred through the vector unit to memory before it can be processed.

The control network and use of a SIMD model simplify programming and reduce the processor overhead for synchronization. However, the cost of such a system is even greater than that of a multicomputer. None of the SIMD cost savings is realized because all of the units are replicated for every node.

TMC went bankrupt in 1994, and later reorganized. They offered a single-chassis version of the CM-5 (called Scale 3), that was compatible with the larger machines. Later they dropped hardware production and became a technology holding company.

Traditional microprocessors have trouble operating in this manner, and there is extra logic and software required to force them into a SIMD mode.

An alternative is to design a special processor node for SPMD operation. This is the idea behind the Analog Devices SHARC processor, which is a DSP-based architecture that can either process instructions from a local memory, or take them from a global controller. The original version of the SHARC included 2 MB of on-chip SRAM to allow it to be configured as a SIMD array consisting only of SHARC processors and a controller. Each SHARC chip also had 6 40MB/s DMA I/O ports to enable connection into a 3D mesh topology. It has separate integer and floating adders and multipliers and can operate at 60 MHz for up to 240 MOPS. Newer versions of the SHARC are still in development.

Cluster Supercomputers

As we have seen, many different approaches have been taken to achieving performance through parallelism. The variety of viable approaches has dwindled as microprocessor performance has grown. Partly, the microprocessors have displaced some of the customer base for expensive custom machines. The software base and issues of portability also work against custom architectures. And, the cost of fabrication of state-of-the-art technology makes it difficult for minor players to compete.

Clustering commodity PCs also provides a low-cost form of parallelism that achieves the needs of many supercomputer users. Thus, there are just a few critical problems that really need the fine grained parallelism of vector or SIMD processors, and these are not enough to support an industry to build the hardware. It has been suggested that such systems might eventually be viable only as national-scale projects and research resources, the way that particle accelerators are constructed for nuclear physics research.

Clusters are typically built from off-the-shelf PC boxes. Usually, a node consists of a uniprocessor or a small scale SMP with up to 8 processors. In the simplest systems, built-in Ethernet (e.g., 100 BaseT) is used as the communication mechanism, with IP network routers serving as hubs that are connected via higher-bandwidth links. Public domain software packages such as Beowulf are used to manage the system, and message passing libraries such as MPI in conjunction with C or Fortran provide the programming model.

Home-grown clusters typically have 64 to 256 processors. One of the limitations on scaling up is that commodity PCs are sufficiently unreliable that a collection of 64 4-processor boxes has very short mean time to failure. Thus, the cost of maintenance and the down-time quickly reach unacceptable levels for an installation that is meant to be low-cost.

File written by Adobe Photoshop¨ 5.0

A next level up from the PC cluster is to use better quality PCs or workstations as the nodes. The latter may offer a per-node boost in performance that is likely to be more significant than the gains from parallelism for many applications. These are typically connected via a higher-performance network such as gigabit Ethernet or Myrinet. The software model remains the same, as these are still home-grown configurations. The higher-quality nodes offer greater mean time to failure, and thus configurations with hundreds to thousands of processors are feasible.

Above this intermediate cost/performance configuration are the vendor-supplied multicomputers. Machines like the IBM SP and SGI Origin. The former is a message passing cluster-style system, and the latter uses shared memory as its model for communication. In the IBM SP, for example, 64-bit processors are arranged as shared memory multiprocessors with 2 to 8 processors. These are connected via a high-performance, low-latency multistage switching network that is supported with light-weight OS primitives. Extensions of the basic MPI library are supplied that perform common parallel algorithms using code that is tuned to the hardware configuration. Even though the processors are commodity parts (used in workstations and small mainframes), they are still expensive. The network and housings for the processor chassis are also more costly. Thus, such a system may be more than an order of magnitude more costly than one built from PCs. Users of these systems value the higher degree of scalability, the added reliability, the more extensive software support, and the higher performance of the nodes and of the communication network.

Practical Considerations for Large Scale Clusters

Just as vector supercomputers had to overcome aspects of physics that are typically ignored in smaller scales of computing, large scale clusters run into similar obstacles. For example, we think nothing of the power and cooling for a workstation that consumes 100 watts. It is on the order of a light bulb in terms of its impact. But when we assemble 9,612 of them as in the Intel ASCI Red tereflop system, suddenly we have to think about supplying nearly a megawatt of power, and drawing off a similar amount of waste heat. With peripherals (disk and tape farms, graphics displays, printers, etc.), the ASCI Red installation needed 1.6 megawatts of power, and air conditioning units that were as large as the room-sized machine itself. (As a side note, the cooling requirements for vector supercomputers were not as high because they used chilled fluid coolants, which are more efficient than air cooling Ð but commodity processors aren't built this way.)

Large cluster computers thus require large buildings that are either custom-built or heavily modified. The computer sits on a raised floor with two feet of space below for running the thousands of communication and power cables. The air conditioning must be ducted to the positions where the cabinets will stand so that chilled air can be fed directly into them. Special air handlers are needed for the warm-air return, with automatic dampers that help to ensure an even temperature in different sections of the room.

File written by Adobe Photoshop¨ 4.0

A transformer yard like a small power substation must be built adjacent to the building, and high-capacity power buses (large bars of metal) carry the power into the building, where it is distributed via heavy-duty power conditioning units. The latter are there to stabilize the power and to avoid voltage swings as the power demands of the processors change. For example, when the processors are idle and go into sleep mode, they may draw only microwatts. Launching a task across the cluster can cause an upsurge of 25 watts per node, which translates into almost 25 kilowatts for the cluster. Even changing between an integer-oriented program and floating-point intensive program could result in a 10 kilowatt change in power consumption. For comparison, the maximum amount of power that can go into a home is only about twice that figure. So imagine how the lights would dim if you could simultaneously turn on every major appliance in a house, and you get a sense of why special power conditioning is needed.

File written by Adobe Photoshop¨ 4.0

With so much power and heat going through a small space, there is a definite risk of fire, and so the computer room must be equipped with special fire suppression gear, and emergency power disconnection capabilities. The fire system should be of a type that does not damage the electronics; otherwise a small fire could lead to a response that destroys the entire system. Such a scenario occurred at the National Hurricane Center, when a power supply in their Cray caught fire. The responding fire fighters ignored all of the signage and special equipment in the room, and rushed in to douse the whole machine with chemicals that destroyed its circuitry. Because this was after Cray had been sold to SGI, it was impossible to repair or replace the machine, and for the next two years the US had diminished capacity for predicting the paths of hurricanes, until the old software could be rewritten for an IBM SP.

Summary

When applications demand more processing power than is available from commodity processors, there are two directions that can be taken to higher performance. One is to build custom architectures, and the other is to use commodity processors in parallel. In the period from about 1965 to 1995, the popular approach was to build custom architectures. This led to vector supercomputers, SIMD arrays, and SPMD systems. Once microprocessors crossed a capability threshold that allowed them to approach the performance of expensive custom systems, the market shifted its focus in the other direction. Thus, we see today mainly machines that are clusters of standard processors in one form or another.

With either type of supercomputer, the physical implementation often faces additional practical considerations that drive up the total cost of the installation. These include power, cooling, space, construction, and safety issues.

There remain a few applications that do not yield well to this model of parallelism. Examples include intelligence analysis, certain types of data mining, and simulation of physical processes in which local events have nonlocal effects. Wherever the computations must be tightly coupled across entire data sets, the relatively high overhead of communication in a cluster limits the degree to which parallelism can be applied. Should the pressure to address those applications grow sufficiently, then there may be a return to research in alternative architectures.