Lecture 14: Networks and Multicomputers

Networks

In order to communicate, the processors in a parallel architecture must be connected in some manner. There are three principle aspects to any such set of connections:

What is the topology (arrangement) of the connections?

What is the configuration of an individual connection?

What is the data transfer mechanism (the programming model) for a connection or for the whole network?

Networks come in many different topologies, including those that can change their topology, each with its proponents. The links in the network themselves can vary from a single bit-serial data line in a synchronous system to paths that can carry multiple data words and control information in parallel and pass through dedicated intermediate nodes that have more computational power than the processors. The transfer mechanism may vary from a single instruction that moves data to an adjacent processor or shared memory to complex protocols involving significant overhead to establish and manage a dedicated link.

There are several terms that we often see used to describe networks:

Node degree -- the number of links connected to a node (processor)

Network diameter -- The longest minimum path between two nodes (i.e. the distance across the network)

Bisection width -- the number of links that get cut when you slice the network in half

Latency -- the time required to transfer a single datum from one place to another

Bandwidth -- the rate at which data can be transferred (including pipelining)

Symmetry -- the property that the network looks the same from every node

Homogeneity -- all the nodes and links are identical

In his 1984 thesis, Levitan showed that measures such as node degree, bisection width, diameter, etc. are irrelevant in many situations and that simple benchmarks were better predictors of performance than are these measures.

Can you think of situations in which degree does not matter? Diameter? Bisection width?

Symmetry and homogeneity affect the programming model of a network and the amount of overhead that must be invested in handling the special or boundary cases. Of course, if the application shares similar boundary conditions, then this overhead is not wasted.

Latency and bandwidth have the strongest influence of any of these measures on the performance of a network. When does latency dominate and when does bandwidth?

Latency dominates when communication involves large numbers of small transfers that cannot be pipelined. Bandwidth dominates when the transfers are large or can be pipelined, so that the overhead of the first transfer is amortized over time.

Let's look at some simple topologies.

Bus

Linear

Ring

Note the subtle distinctions between the bus, linear and ring. The bus has a linear topology, but the network does not pass through the nodes -- they connect to it. While this means that messages do not have to take multiple hops to reach distant nodes, it also implies that even messages to adjacent processors must contend with all others -- there is no exploitation of physical locality in a bus. All operations are essentially broadcasts, and the limited bisection bandwidth forces these to occur sequentially or at best a small constant factor faster than sequential.

In addition to diameter and bisection, buses have a measure called fan out. Because a bus is a passive device, the sending processor must drive the lines of the bus with sufficient current to activate all of the receivers. The number of receivers that can be driven is the fan out of the bus. The typical fan out limitations of a high speed bus are 10 to 20.

Partitionable buses have been proposed, but the speed of a high performance bus derives from the fact that the line lengths can be highly tuned, and partitioning destroys this so that all operations are slowed. In addition, it takes time to repartition a bus. Also, a partition is only useful if it arranges processors into groups that need to communicate. It does no good to partition a bus if the communicating processes end up in different partitions.

Hierarchical Buses

The basic idea here is to cluster processors on a local shared bus up to the saturation limit of the bus. Then connect that cluster through a cache or switch to another level of bus.

The presumption is that the application can be partitioned so that most of the accesses are within the clusters, and a smaller percentage go over the next level of bus. In some applications this assumption holds true, but in many it fails and the system merely saturates the higher level buses.

Rings

The ring is like the linear array, but the diameter of the network is cut in half if the links are bidirectional. If they are unidirectional, then the diameter is the same, but there is still a modest performance gain over the line. Why?

The line must be bidirectional, and must have a protocol to decide which direction to send a message. In a unidirectional ring, however, the direction is fixed and an equivalent number of wires could be used to double the bandwidth (or cost could be reduced by using half as many wires)

Which of these networks are symmetric? (bus, ring)

In order to reduce the diameter of the ring, we can add links that cross the interior. Whether the links hop 1 or k/2-1, we reduce the diameter by half. It takes additional links to do better. A fully connected network has diameter of 1 and is symmetric. We can think of this as a bus with sufficient bisection bandwidth to avoid sequentializing broadcasts.

Chordal Ring

Such networks are called chordal rings. The bisection bandwidth of such a network is computed by summing the distances spanned by the links present in the network and multiplying by 2.

when the distance n/2 links are included, and

when all of the links have distance < n/2.

For example, if a ring is augmented by links jumping 3, 7, and 9, then the bisection bandwidth is (1 + 3 + 7 + 9) X 2 = 40.

Full Connect

What is the degree of such a network? (k-1)

How many links does it have? k * (k-1) / 2  (e.g. O(k2) )

What is its bisection bandwidth? k2/4

Can we build it? Only for small values of k.

Does it improve performance? No. Why not? Because each communication operation requires the processors to examine k-1 inputs. To make this work, we must add more power to each node to process the link data in parallel. Then it is very powerful. Levitan showed it could reduce many N2 and N log N algorithms to constant time.

Let's think of some more topologies. What do we get if we extend the bus, line, and ring to 2 dimensions?

Hierarchical bus (comb)

Mesh

Torus, cylinder, and skewed torus and cylinder are obtained by wrapping one or both axes of the mesh around to the opposite side, with or without a shift factor. The Illiac IV network (an early research parallel processor) was a torus with one axis or the wrap-around skewed by one. Why is this useful? It makes it possible to also treat the entire mesh array as linear array or ring.

These networks can be extended to higher dimensions, but their advantage is that in two dimensions they map well onto the currently available two-dimensional (or 2.5 dimensional) technology.

The hierarchical bus has the advantage over a large linear bus that it does exploit some locality -- the transfers via the higher level bus are usually distinct from the local bus transfers. However, it has bandwidth limitations (bottleneck) for applications in which nonlocal communication dominates.

Meshes of buses have also been proposed in which each processor has two bus ports, one vertical and the other horizontal. But this requires each processor to manage two buses, and to handle traffic that is to switch from one to the other. Furthermore, the local buses still do not take advantage of locality in communicating with neighbors.

The mesh, on the other hand, enables local neighborhood communication to take place simultaneously so that all of its links can be in use at once. Of course, the mesh has root k diameter that offsets its root k bisection bandwidth. Thus, when nonlocal communication dominates, the latency to move data is proportional to root k. Pipelining can keep the bandwidth up, but that then depends on having enough data to move to keep the pipes full.

It has been proposed that meshes and buses be combined (mesh with row and column broadcast buses) to provide a combination of good local communication and low-diameter communication. The low diameter means that latency is low for sending single values across the network. However, the row and column buses end up being choked when all processors want to send a single value -- it takes root k transfers to get those values over the net, so the latency remains the same for dense communication patterns. In addition, root k quickly exceeds the fan out capabilities of typical technology, and so the buses must run more slowly.

In the period of 1985 to 1995, my research group actually built a SIMD processor that combined a mesh with a partitionable mesh-topology bus for use in image processing. Because communication in image processing tends to be dominated by locality that corresponds to image features, the partitionable mesh's ability to configure into buses that match the shape of image features is valuable. However, performance drops off for nonlocal communication that is dense and has no inherent connectivity that can be exploited.

Lines, rings, meshes, tori, etc. are even-degree networks whose degree grows with the dimension times two. What about odd degree networks and networks who's degree is not related to the dimensionality?

Tree

The tree has nodes of degree 1, 2, and 3. It is clearly not symmetric. For depth k, the number of nodes is 2k - 1, and the diameter is 2k (or ~ 2 log N). For example, a processor with 262,144 nodes would have diameter 512 in a mesh but only 36 in a tree. More important, however, is that its bisection bandwidth is 1. Thus, it suffers from a serious bottleneck. Where is the bottleneck the worst?

One would expect the root, but it is the two nodes that are children of the root. To see why, consider that the root has only degree 2, while these nodes are degree 3. The root has only to pass data from one half of the tree to the other, but its children must handle the same load plus all messages that are travelling between the subtrees that are rooted at the children.

While the tree has lower diameter than the mesh, it does not exploit physical locality as effectively. In particular, many algorithms can make use of direct communication between the leaves. An alternative, called the X-Tree, has been proposed in which cross links are added at each level, so that data can be moved sideways without having to pass through parent nodes (in particular, leaf 3 and leaf 4 must pass through the root).

Another approach is the fat tree, in which the link bandwidth is expanded at higher levels. This was actually used in the Connection Machine 5.

In some tree architectures, the processor nodes occupy all of the nodes of the tree, while in others, the processors are sited only at the leaves. The latter is more typical of the fat tree, and is really an example of a multistage switching network, which we'll examine shortly.

Hypercube

If we start with a 2 x 2 mesh, expand to the third dimension with a 2 x 2 x 2 cube, and then expand into higher dimensions with size 2, we are forming hypercubes.

Notice that these networks are symmetric, whereas the mesh is not. The mesh can be generalized into an asymmetric hypercube called a k-ary n-cube, where n is the dimensionality of the cube and k is the number of nodes on each edge. Thus a 4-ary cube would be a 4 x 4 mesh in 2 dimensions, a cubic 4 x 4 x 4 mesh in 3 dimensions, and a cube nested within a cube, with corresponding nodes linked, in 4 dimensions. So a mesh of m2 = M nodes would be an m-ary 2-cube.

What is the degree of a hypercube of dimension d?  (d)

How many nodes does it have? (2d)

What is its diameter? (d)

What is its bisection bandwidth? (2d-1)

How many links does it have? (d 2d-1)

Thus, while the network has many desirable properties, including the ability to map a mesh or tree into it, the hypercube suffers from a complexity that grows exponentially with the dimension. For a 64K node hypercube machine, there would be 512K links. With current technology, it is difficult to scale a hypercube beyond about 4K nodes, with about 24K links. The Connection Machine 1 and 2 both used hypercube networks, although it truncated the lower four dimensions of the physical network by having local memory-based simulation of routing among groups of 16 nodes.

One other problem with the hypercube is that as the number of dimensions increases, the nodes must do more work to keep up with the incident message traffic. In fact, because most processors can handle only one I/O transaction at a time, many hypercube algorithms operate on the principle of serializing processing by dimension. i.e. all pairs of nodes in one dimension communicate, then the next dimension, etc.

This limitation led to the concept of cube-connected cycles, in which each node in the hypercube is actually a ring of size d, with each element of the ring being responsible for one I/O channel. In that way, I/O can occur in all dimensions simultaneously, and data can be processed in a pipeline fashion as it circulates around the ring on its way to an outgoing link.

Cube connected cycles are difficult to program. For that matter, hypercubes are not especially natural to program either. For example, mapping a 2-D array to a hypercube can be done, but it leads to some nonintuitive implementations with loops that are the dimensionality of the hypercube. Fortunately, at least for arrays, compilers were able to hide this mapping, but there may still be a time penalty over a mesh, even though nodes have the proper adjacency.

In fact, the original CM-1 had a mesh in addition to the hypercube, and although the iPSC-2 (an early foray by Intel into the parallel processors business) was presented as a hypercube, it was really a mesh with circuit switching to support the hypercube simulation. Also, Seitz and Dally found that one can obtain better performance with a lower-dimensional, higher radix network that has higher bandwidth links than with higher dimensionality and lower link bandwidth.

Dynamic Networks

As we have noted before, parallel processors are only efficient when an application maps cleanly onto them. Otherwise there is considerable overhead in emulating the appropriate mapping. Although theoreticians may show that one topology can emulate another with constant slowdown, if the constant is more than about 10, the result is poor performance on an architecture. Because many scientific applications use arrays, many parallel processors such as the Intel Paragon, Cray T3 series, etc. have adopted a mesh network or a variation on the basic mesh.

As we have also seen, however, there are other situations where a mesh is inefficient. Even when processing an array, for example, a Fourier transform requires patterns of communication that are costly on a mesh. The problem is that for any fixed network short of a full-connect, there will be degenerate communication patterns that have low efficiency.

What is the obvious solution to this? Make the network able to reconfigure itself to match the algorithm.

We can view the network as a black box that connects either processors to processors, or processors to memories.

Such a network is usually abstracted as a permutation of the outputs of the processors to their inputs -- that is, a one-to-one mapping. For N processors, there are N! such mappings. This kind of switch is called a crossbar (terminology that dates back to its use in electromechanical phone switching systems). Another term is crosspoint switch -- we may think of this switch as a matrix with inputs on one edge and outputs on an orthogonal edge. The points at which the wires cross have switches that connect an input to an output.

The conceptual crossbar actually wastes switches (it has N2), and more complex networks have been designed that provide all permutations with fewer than N2 switches. Some of the more famous of these were discovered by Clos and Benes in research on telephone networks.

Some networks extend beyond the simple permutation to include nonoverlapping broadcasts (multicasts). There are NN such mappings, which obviously require more links and switches -- in fact, the N2 switch crossbar is precisely the network that supports permutations with broadcast. Such networks are configured by specifying the input independently for each output. Thus, if multiple outputs select the same input, the effect is a broadcast. Of course, the network must handle the fan out problem in these cases.

On the other hand, some networks do not provide all of these mappings. To conserve wires and switches, they may only support a limited subset of the mappings, usually those that fall into a regular pattern. It is assumed that the entire processor will be employed on a particular computation, which has a regular pattern. Irregular or data dependent operations, and multiple computations involving different patterns are not supported in these restricted networks.

These networks usually employ multiple stages of small switches (typically 2 x 2) with various interstage link patterns. There is a whole industry in computer science and engineering that has sprung up over developing yet another interstage pattern and then proving various properties about it and demonstrating its superiority over other patterns on algorithms that have little significance. The chief product of this industry is journal articles that are carefully constructed as least publishable units!

While a few machines have been built with multistage interconnection networks, most notably the IBM RP-3, and the IBM SP series, the approach is not especially popular for several reasons. Why?

The networks are often not easily partitionable into repeatable blocks (thus requiring different circuit board layouts, and complex backplanes and assembly and service procedures).

They assume regular patterns and no partitioning of the processors, making it difficult to map irregular applications to them or to be used in a shared mode.

They have high latency due to the intervening circuitry.

The cost of setting up links is high.

Because they support only one-to-one mappings, sharing of resources requires sequentialization with reconfiguration between accesses. Thus, sharing is very expensive.

The RP-3 tried to address the sharing issue by employing smart switches that identify duplicate accesses and actually compute the results of colliding replace-add type operations and return appropriate results to the processors. (The RP-3 connected processors to memory through the switches.) The effort was abandoned when it was realized that the switches were going to be more powerful than the processors, and the feature would only be of value in a small percentage of operations.

The SP series instead treats its multistage network routers very much like normal network routers. However, because the switches and the interface hardware are more tightly coupled than in an Ethernet environment, latency is greatly reduced, and bandwidth is higher. Thus, the programming model can be like that of a low-cost cluster, but delivering higher performance. The market for these systems thus rides on top of an established approach to parallelism, offering premium performance for a higher price to customers who can justify the benefits.

Network-Based Multicomputer Interconnects

Multiprocessors provide several opportunities for interconnecting system elements:

            Processor to processor

            Processor to memory

            Processor to I/O

            I/O to memory

In the broadest sense, all such interconnections are networks.

Networks may be synchronous or asynchronous, central or distributed control, and circuit or packet switching. Thus, there are 8 broad categories of networks:

Synchronous, central, circuit: e.g. a central crossbar

Synchronous, central, packet:

Synchronous, distributed, circuit:

Synchronous, distributed, packet:

Asynchronous, central, circuit: e.g. a central crossbar

Asynchronous, central, packet:

Asynchronous, distributed, circuit:

Asynchronous, distributed, packet:

Synchronous implies a global clock, while asynchronous implies no clock and handshaking.

Central implies that one unit controls the configuration or routing policy of the network. Distributed implies intelligence in the network itself.

Circuit implies that an electrical connection is established from sender to receiver, while packet requires no direct circuit.

Each of these options actually represents a spectrum of choices between extremes. For example, packet and circuit switching can be combined in a network. Synchronization can be at different levels of granularity. Control can be partially distributed.

Circuit-Switched Networks

A circuit-switched network provides a direct electrical connection between two devices. There are several general varieties of circuit-switched network:

Permutation networks -- these provide any one-to-one mapping between sources and destinations.

Strictly non-blocking -- Any attempt to create a valid connection succeeds. These include Clos networks and the crossbar. The Clos network acts like a crossbar except that it supports only the N! permutations -- not the NN patterns that include multicast.

Wide Sense non-blocking -- In these networks any connection succeeds if a careful routing algorithm is followed. The Benes network is the prime example of this class. Incidentally, it has been shown that 6 stages of Benes networks can support multicast.

Rearrangeably non-blocking -- Any attempt to create a valid connection eventually succeeds, but some existing links may need to be rerouted to accomodate the new connection. Batcher's bitonic sorting network is one example.

Blocking -- Once certain connections are established it may be impossible to create other specific connections. The Banyan and Omega networks are examples of this class.

Single-Stage networks -- Crossbar switches are single-stage, strictly non-blocking, and can implement not only the N! permutations, but also the NN combinations of non-overlapping broadcast.

If a crossbar has synchronous, central control, then the implementation is straightforward and as many connections as can fit on a chip can be built. (My group has built a 32 x 32 bit-serial crossbar on a single chip in an 84-pin package. With modern packaging, a 256 x 256 crossbar would be quite easy.) Essentially, an array of N, N:1 multiplexers is constructed where the setting of the mux determines which input goes to the particular output. If changes in the network configuration are infrequent, then the central control can set up each connection as it is requested.

When a crossbar has distributed control, implementation is more complicated because the switches must detect collisions between connection requests and arbitrate among them. Thus, in addition to data signals, the crossbar must support address and control signals. The detection and arbitration is straightforward for a synchronous network, and more complex for an asynchronous network. In these cases, the cost can grow prohibitive.

Designs have been proposed in which the address, data, and control are multiplexed (i.e. serialized), which both reduces performance and makes the switches more complex (having to include local buffering). However, the number of lines is small, so the number of connected devices can be large and thus the aggregate throughput is high.

It is possible to build expandable single-stage crossbars, if a mesh of crosspoints is used in place of an array of multiplexers. However, the number of connections into a block of the crossbar is doubled, because the inputs must be able to pass through to other blocks.

It is also possible to build multistage crossbars out of N x N crossbar modules. The resulting networks behave just like a larger crossbar, but require more modules and waste some inputs and outputs, and have higher latency.

Multi-Stage networks -- Clos networks are strictly non-blocking multistage networks. Benes newtworks are rearrangeably nonblocking. Others, such as the butterfly, shuffle exchange, omega, baseline, banyan, and delta are blocking networks -- and can be shown to either be topologically equivalent or emulatable by each other in constant time.

Combining does not entirely solve the blocking problem. Rather, it addresses the hot-spot problem.

In the Omega, the interstage connections are the same for all stages, and routing is simply a matter of using the destination's address bits to set switches at each stage. The Omega network is a single-path network -- there is just one path between an imput and an output. It is equivalent to the Banyan, Staran Flip Network, Shuffle Exchange Network, and many others tha have been proposed.

To remember how the Omega connects, just work from top to bottom connecting each output port to the next available upper port. When you run out of upper ports, start using lower ports.

The Omega can only implement NN/2 of the N! permutations between inputs and outputs, so it is possible to have permutations that cannot be provided -- i.e. paths that can be blocked.

For N = 8, there are 84/8! = 4096/40320 = 0.1016 = 10.16% of the permutations that can be implemented. It can take log N passes of reconfiguration to provide all links. Because there are log N stages, the worst case time to provide all desired connections can be log2N.

Thus, for N=8, worst case is 9, for N= 16, worst case is 16, for N=32, worst case is 25, for N= 64, worst case is 36, etc.

What is not often noted is that, in addition to these costs, it is not easy to obtain the optimum sequence of passes. The sequence has to be precomputed and the network centrally controlled in some manner (perhaps as simple as global synchronizaton of the processors).

Without central control, then given only the information that it hit a collision, a processor is most likely to take a greedy approach and reissue the request on the next cycle. If the network does not contain some additional logic to prevent the injection of new messages by already-served processors (or at least give them a lower priority), it is possible to starve a processor for an arbitrary length of time.

Hot-Spots occur when there is a lot of shared access to one destination, or a combination of source-destination pairs that result in the same collision within the network (the latter is uncommon).

The RP-3 designers claimed that hot spots were not actually a problem. The cynical response was that the performance of the system was so poor that the delay due to hotspot contention was simply masked out. This points out again that multistage switching networks suffer from considerable latency, especially in accessing memory vs. normal memory access speeds.

Fetch&Add can be used to provide N-way synchronization among processors over a multistage network. Basically, if Fetch&Add operations with the same destination collide, then one of the values is saved at the collision point, and the sum of the values is passed on. Then a single memory update retrieves the value at the location, adds the sum to it and returns the value stored there.

At each collision point on the return path, the stored value and the returned value are summed and passed back on the path opposite the one on which the stored value entered, and the returned value is sent back on the other path.

Thus, only one processor gets the original value from the memory location, and the others get unique sums of others values. Note that this requires the processors to send ID values that are unique to begin with.

Recall that in the case of the RP-3, the cost of adding combining made the network more costly than the processors, and memory latency increased as well. Thus, the combining network wasn't actually built.

It should be noted that in spite of the popularity of multistage switching networks in the academic literature, they have rarely been used in actual systems. For high performance, crossbars tend to dominate real designs in connecting processors to memory and to each other. Why? Because the whole point of circuit switching is usually to get low latency and this is precisely what is sacrificed by going to multiple stages.

By the way, it is a common misconception that Harold Stone discovered the perfect shuffle network. It was actually discovered by M.C. Pease and then popularized by Stone, who analyzed it in great detail.

Multiple Path Networks

The Data Manipulator, Augmented Data Manipulator, Gamma, and so on form a class of networks in which there are a limited number of alternate paths, usually 2 or so. In the spectrum from single-path to non-blocking, these networks fall into a useful intermediate range, not unlike the way in which 2-way set-associative caches are an effective compromise between direct-mapped and fully associative.

Recirculating Networks

If a network has the same topology between all stages, it can be built with a single recirculating stage and a memory at each switch. While less expensive than a full network, it is not circuit switched. Note also that it cannot directly pipeline the transfer of data, although it could be augmented to emulate pipelining.

Multicomputers

In spite of all of the work that had been done on networks, there were very few conscious design decisions involved with the first generation of multicomputers. For the most part, these machines (except for the Cosmic Cube and n-Cube) were built from whatever was at hand, with very little analysis of the design's performance.

In the case of the iPSC/1, Intel used 80286 processors together with off-the-shelf Ethernet transceivers wired in a 7-cube. The design was thrown together in less than a year in response to the realization that the group that spun out to form a company called nCube might be onto something and should be beaten to delivery of their first systems.

The original iPSC had horrible performance, with up to 100ms latency for message transmission, and I/O only through the host node via another Ethernet link which did not support broadcast.

The iPSC/2 and Ametek machines were a second commercial generation, implementing wormhole routing on a mesh. The iPSC/2 retained the fiction of a hypercube for upward compatibility, but was implemented with a mesh. The n-Cube was actually more sophisticated than the first generation, given that it supported direct DMA transfers through the hypercube, but used store-and-forward for multi-hop messages. It was thus somewhere between the generations.

All of these machines have since gone to the scrap heap, as low-cost clusters displaced the majority of the market for multicomputer parallelism.

Packet Switching

Store-and-forward

In this scheme, a message passes from node to node being stored at each intermediate point and retransmitted. The scheme is wasteful in that it fails to take advantage of the potential to traverse multiple nodes at once when there is no blockage.

Virtual cut through

In this scheme, store-and-forward is extended to allow circuit-switching (either actual or logical) in the network. A message makes connections through the mesh until it is blocked. The data is then forwarded and stored at that location, and held until the blockage disappears.

Wormhole routing

In this scheme, rather than require each processor to buffer whole messages, each processor buffers just a small amount of data called a flit. The message is injected into the network as a train of flits, with the header establishing the pathway and the trailer releasing the pathway. When a blockage occurs, the flits are buffered at their present locations until they can move again.

This method can deadlock in a plain mesh, but by going to dual (possibly virtual) channels so that messages can flow around a blockage, deadlock can be avoided.

For very large messages, the pipelined nature of wormhole routing hides the latency of transmission through multiple stages. However, it does not hide latency for typical memory-word accesses -- again a candidate for multithreading, in that processors can switch threads while awaiting memory references to complete. This was the design principle behind the Tera supercomputer.

Virtual Channels

The provision of virtual channels to avoid deadlock in wormhole routing leads to the notion of using more virtual channels to provide lower latency. The basic idea is that once a path has been established, its switch settings can be cached at the router nodes. Then a signal can be used to quickly reactivate the path, so that more data flits can be injected and passed quickly through the network.

It is especially useful when a pattern of non-intersecting connections can be established so that orchestrated data transfers can occur (as in matrix transforms). Then, very high network utilization can be obtained. This approach was used in the iWarp, and then adapted to the Paragon.

Wormhole routing suffers from some degenerate cases in which a high percentage of messages are blocked for a long period. However, these cases are unusual and require a high degree of orchestration among the injecting processors. For most typical communication patterns, and with asynchronous processors, the delays due to blocking are quite small and network utilization can be high. Nonetheless, the inherent limitations of the mesh are not overcome.

Thus, for larger numbers of processors, three-dimensional meshes have been employed with wormhole routing.

Fourth Generation Multicomputers

Designers now realize that uniprocessors make poor multicomputer nodes. However, the cost of developing a competitive node that is more appropriate is quite high. But as integration levels continue to grow, designers are starting to explore single-chip multiprocessors. Thus, research in the area may begin to focus on the nodes within these small-scale parallel configurations. Already, larger multicomputers are being built and clusters of 4-node multiprocessors, and it is likely that we'll see more integration of these architectures. Other major areas of research are in operating systems, compilation and languages.

Multithreading

Lightweight processes

Network access bypassing the OS, or more OS support in hardware

Languages and compilers that handle partitioning and placement of data.