Lecture 17

Multithreaded, VLIW, EPIC,  and Other Architectures

Latency Hiding Techniques

Whenever it is desired to share data between two processors that are spatially distant, there will be latency in transferring the data. Either the programmer must plan for the latency, or the compiler must try to compensate for it, or the architecture must hide it (or a combination of those approaches).

There are four approaches to hiding latency:

Prefetching -- Get it before you need it, so you wonÕt have to wait

Caching -- Keep a nearby (lower latency) copy

Relaxed memory consistency -- Reduce the number of high-latency cases

Multithreading -- Find something else to do while waiting

Prefetching

Obviously, if we can issue a fetch for data before it is needed, there is a high probability that it will arrive by the time we actually need it. In a 1992 paper, Gupta showed that prefetching can save nearly 50% of execution time in some cases vs. no prefetching in a distributed-memory directory-based cache coherence scheme (the Dash). Of course, such systems have a high latency anyway, so any hiding of the latency is likely to be dramatic. The authors did not compare performance against a system with zero latency, which would provide an absolute measure of the overhead induced by prefetching.

Nonetheless, prefetching is a means of hiding latency. Its drawbacks are that it can require sophisticated software (or specialized hardware that may generate unnecessary memory traffic due to the use of simplistic fetch-ahead mechanisms that fail to consider branches or non-unit strides), and if prefetch is a separate instruction, it can increase the number of instructions issued for a given computation.

Caching

If data is cached, it has minimal latency to the processor. However, if it is shared, then there is overhead associated with maintaining coherency. As we have seen previously, leaving data in main memory or another processor's memory is more expensive than caching it. The Dash researchers found that 60% of execution time can be saved by caching. Their examples do not show the amount of time spent in maintaining coherency, or even give any idea as to whether shared data required coherency -- presumably at least some did.


Relaxed Memory Consistency

The less we have to maintain consistency, the less we have to deal with latency. Thus, by taking the Gharachorloo approach to consistency, we reduce the effect of critical sections by shortening their entry and exit costs.

Similarly, when we use a mechanism such as detecting snooping (or keeping a directory that records whether a variable is shared), we reduce the number of consistency-maintaining operations.

This approach does not hide latency so much as avoid encountering it. The Dash paper shows examples in which an 8% to 35% gain is obtained with relaxed consistency.

Multithreading

The basic idea behind multithreading is to find something else to do while waiting for memory latency. If a task can be split up so that operations that are independent get scheduled to automatically execute whenever another is stalled due to a memory wait, then the effect of the latency is hidden from the view of the whole task.

Of course, these approaches can be combined and the Dash paper gives an example in which a combinaton achieves a factor of 6 speedup over a similar system with no caching, and about a factor of 3 speedup over a system with just coherent caching.

There are several possible policies for when to switch threads:

On a cache miss: only switch when it is known that latency will be high. Reduces context switching, but may stall the pipe because the switch is not started until the miss is detected.

On every load: simpler hardware implementation and is detected at an earlier stage so the pipe can be kept full. Causes a higher number of context switches.

On every instruction: Effectively instruction context interleaving. No decisions need to be made regarding context switching. Has the side benefit that the pipe sees streams of independent instructions, allowing more superscalar issue. Also allows hiding of branch penalties if branches are handled while contexts are inactive. Requires very fast context switching hardware. Also reduces cache locality, requiring a larger cache. Depends heavily on being able to keep more threads active.

Switch on instruction block: Like switching on every instruction, but increases cache locality and decreases switching speed requirements. Also reduces need for a large number of threads. Harder to hide branch penalties.

Tera

This is really the canonical example of a multithreaded architecture. While other architectures may support a few threads that are driven by local operations (such as sending and receiving over a network), the Tera is explicitly intended to operate with a large number of threads.

The concept for the Tera originated with the CDC-6600 peripheral processing units. To support I/O in the 6600, Cray designed a multithreaded channel controller. There was one I/O processor, but it had ten complete contexts (register/memory sets). A context would be executed until it had either set up an I/O transfer (in which case it would be waiting) or a timer interrupt occurred.

This idea was adapted for computation, rather than I/O in the Denelcor HEP. It has now been extended into the Tera.

The Tera can have up to 128 threads per processor. Each thread is supported by a complete register set. There is a status word, 32 general purpose registers, and 8 target registers per thread. Thus, each processor has 5248 registers. On each clock cycle, an instruction from a different thread may be issued.

If there are enough active threads at any given time, the latency of memory access can be completely hidden. Because the expected latency is about 70 cycles, at least that many threads must be independently executable at once. The excess number (128) of possible contexts is needed because some portion of the available threads at any one time may not be ready to execute.

Each instruction can cause up to three independent operations to occur (memory, arithmetic, and control/arithmetic). In addition, each instruction specifies the number of following instructions in the thread that can be executed independently of the current instruction. This information, inserted by the compiler, is used to cause the instructions to be issued for parallel execution.

Branch target registers enable prefetching of branch targets and reduce the size of loop branch instructions. Special branches are also provided, which allow the compiler to specify whether a branch is likely to be taken or not.

The goal of the Tera project is to have up to 256 processors, 256 I/O processors, 256 I/O caches, and 512 memory banks connected via a sparse 3-D torus. The planned clock rate was 400 MHz. Each processor was expected to consume 6KW of power, necessitating liquid cooling. The company subsequently scaled back its goals, to initially deliver systems with up to 16 processors, and physical designs for as many as 32 have been drawn up.

Programming the Tera and compiling for it are still open issues. It is expected to run Fortran, but it is not clear that even with the HPF extensions that it will be possible to extract enough threads and parallelism from typical codes to take advantage of the potential performance of the machine. However, the company is one of the few in the business that has taken the compilation issue seriously and it is expected that the initial software will be among the best.

The Tera is clearly a cost-no-object machine, like the largest Cray computers. Given that the limited numbers to be produced also increase the cost, it is not clear that such machines have a future market when competing against multicomputers.

Simultaneous Multithreading

The Tera processors can issue instructions from just one thread per cycle. It has been proposed by several researchers that it may be more efficient to allow instructions from multiple threads to be issued in a single cycle, making opportunistic use of each of the functional units in a superscalar core. Studies by Tullsen have shown the potential to achieve more than 5 instruction issues per cycle in an 8-way superscalar processor using this form of multithreading with independent tasks running on each thread, versus a typical issue rate of about 1.5 for an unenhanced equivalent superscalar design. These studies assume very large caches, and since they are based on partial execution of SPEC92 codes, it is possible that some memory effects were not observed that might reduce the reported performance. Tullsen also showed in an earlier study that an 8-way multiprocessor would be required to achieve a similar issue rate.

In order to implement such a design, it is necessary to replicate the issue units, reservation stations, registers, reorder buffers and renaming logic by the same factor as thenumber of threads. The result is that the registers are slower (necessitating longer pipelines and higher branch penalties), the chip would consume at least 4 times as much power, the off chip memory bandwidth must be increased, and although throughput is improved there is no benefit for a single thread (in fact, there is a slight penalty for any thread that increases as the number of threads grows).

Simultaneous multithreading also depends on a change in the instruction set architecture in the supervisor state to allow the OS to take advantage of the multiple threads. It also relies on a high degree of multitasking in the OS. Another drawback is that it thwarts a number of traditional compiler optimization techniques by randomizing the instruction issue schedule and intermixing memory reference streams so that explicit cache placement is impossible. Finally, the fact that individual threads can be unpredictably slowed by side effects from other threads means that such a design will be impossible to use for real-time applications such as native signal processing.

The Compaq Alpha 21464 is designed to support  two simultaneous threads. One thread is for user state and the other for supervisor state. This application isn't directly scalable, but it is far simpler to implement in terms of both the hardware and the OS support. Compaq has cancelled further Alpha development, and it is not clear whether the 21464 will go into full production.

VLIW

Superscalar processors  issue instructions opportunistically to the available pipes, using a limited range of rescheduling of instructions. They depend on the compiler to generate approximate schedules, and then locally the prefetch and dispatch logic considers the dependences among instructions in deciding which ones to issue in parallel.

An alternative to this approach is to throw all of the scheduling into the lap of the compiler, which has a bigger picture of the dependence relationships within the code. The instructions are then issued in groups that are fixed at compile time. An instruction word may thus contain multiple instructions (typically 4 to 8), and is hence very long in comparison to traditional instruction words. These architectures are thus known as VLIW.

A few such machines have been built over the years. The best known was the MultiFlow. None of them was a commercial success, and the reasons for their failures are complex. One is that development of such an architecture, as with multithreading, requires new compiler technology. The Multiflow was the only one to really address this issue, and its Trace scheduling compiler was considered one of the best at the time. It became the basis of the compiler used with the Intel IA-64.

A major drawback of VLIW is that it fully exposes the microarchitecture of the machine to the instruction set architecture. A VLIW with 6 pipes has a 6-instruction bundle instruction word. If it is desired to expand to eight pipes in a new generation, then the ISA must change to reflect this, and all of the existing software becomes unusable without recompilation.

Any change that affects scheduling of instructions, such as differing memory latency, changing the branch detection point in the pipes, etc. requires recompilation of the code to optimize the fixed schedule. This in turn necessitates a complex process of software vendors, who will be delivering executables to different generations of machines at the same time.

Lastly, the compiler does not have full knowledge of run-time for the code. Because it lacks access to the data, and also lacks access to higher level semantics, there are many cases where it must be conservative in its optimizations, under the assumption that exceptional cases will occur with the same probability as regular cases. So the schedule cannot be fully optimized at compile time.

VLIW is thus, by itself, not sufficient to boost performance to levels that would outweigh its drawbacks. However, many of the architects from Multiflow went to Intel, and heavily influenced the development of their 64-bit architecture.

EPIC

The next step beyond VLIW is called Explicitly Parallel Instruction Code. Like VLIW, it bundles multiple instructions per instruction word. This is mainly a convenience for fetching, and to permit the use of an odd-sized instruction length. The EPIC instruction word contains three 41-bit instructions, and a five-bit control field.

The control field is used to indicate the parallel issue ability of the instructions, and to mark the end of instruction bundles. Thus, unlike VLIW, the bundle is not fixed in length or related to the microarchitecture. Instead, the compiler schedules dependence-free instructions into a bundle, which may span multiple words, and marks the end of the bundle.

The hardware prefetches instruction words, identifies bundles, and then schedules groups of independent instructions from a bundle to issue in parallel, according to the available resources. In this way, EPIC overcomes to some extent the limitations of the compiler's knowledge of the run-time environment.

The instructions within a bundle pass through the CPU with parallel rather than sequential semantics. This simplifies exception handling, as there is no need to relate an exception to an instruction within a particular sequence. The exception is tied to the bundle, and to an instruction within the bundle, but there is no ordering within the bundle that must be preserved as there are no dependences. All of this can make assembly-level debugging more challenging, as instructions within bundles execute in random order. Thus, relating the assembly language to the sequential semantics of the programming language is difficult.

Parallel semantics also relieve some of the burden for managing the completion of instructions. Completion order is only enforced between bundles.

The IA-64 implementation of EPIC adds various features to enhance the flow of instructions. For example, virtually all instructions can be predicated. A bank of 128 1-bit predicate registers allow computation of conditionals (which can be done as parallel operations) in parallel with execution of dependent instructions. When the instruction is ready to commit its result, it tests its predicate register to determine whether its condition was met. This means that many instructions are executed speculatively with no effect, but it avoids many conditional jumps that could disrupt the flow of the pipelines.

The IA-64 also supports speculative data operations. Loads and stores that are dependent are allowed to proceed out of order, and as they complete, they write their status in an associative memory. If an instruction discovers that it has reached completion without conflicting with another instruction, it goes ahead and commits its result. When a conflict occurs, then "fix-up" logic comes into play to resolve the violated dependence. This is in contrast with out-of-order RISC processors that work to prevent the possibility of any dependence violations from occurring in the first place.

The registers of the IA-64 are also more extensive than the typical RISC processors. In addition to the predicate registers, there are 128 integer registers, 128 floating point registers, and 128 control registers. The top 32 integer and FP registers are fixed, and the remainder are part of a register window system, somewhat similar to the SPARC. However, the size of the IA-64 window is variable. On a call, the number of in, out, and local registers is allocated explicitly. The 96 registers act as the top of the register stack and the hardware supports automatic spilling. In early implementations, spilling is done synchronously Š with the same problems we've noted earlier of scheduling blocks of slow memory references in an atomic and tightly-coupled manner. It is said that the architecture will allow asynchronous spilling in the future. Presumably a spill buffer will enable local writing and reading, with actual writing and reading to memory taking place in the background as cycles are available.

Such a large register set may make it difficult to exploit multithreading, which would either require multiple sets or complex control to manage the sharing of a larger set among multiple threads.

The IA-64 also provides special looping instructions, and can pipeline execution of loop iterations in hardware. Compilers support software pipelining of loops, but often with a significant increase in code size. The  hardware implementation avoids the need for code expansion, and matches the pipelining of the loop to the available resources.

The McKinley version of the Itanium has a die that is 464 square millimeters in size (roughly a square inch, and about four times larger than contemporary processors). It includes a 3 MB on-chip L3 cache, a 256K L2, and 32K L1 cache. Target speed is 1 GHz. Performance is expected to be similar to the Pentium 4, or slightly better. One of the problems with introducing a new architecture is that it can take several generations to work out all of the kinks and really start to take advantage of its new features. At the same time that the Itanium team will be struggling to make all of the basics operational, the IA-32 team is running fast to keep ahead of its competition, and they are building on established technology. Thus, it may take several more generations for Itanium to pull distinctively ahead of Pentium and other competitors (such as the IBM Power series). A determining factor in its eventual success or failure may be Intel's willingness to stay the course in spite of huge losses, on the promise of eventually beating their competition and locking up the high-end market.

Other notable architectures

J-Machine

The MIT Jellybean Machine, so called because it was built entirely of "jellybean" components in the sense of there being a large number of low-cost processor chips.  The initial version used an 8 x 8 x 16 cube network (1K nodes), with possibilities of expanding to 64K nodes.

The "jellybeans" are message driven processor (MDP) chips, each of which has a 36-bit processor, a 4K word memory, and a router with communication ports for 6 directions. External memory of up to 1 M words can be added per processor. The MDP creates a task for each arriving message. In the prototype, each MDP chip has 4 external memory chips that provide 256K memory words. However, access is through a 12-bit data bus, and with an error correcting cycle, the access time is four memory cycles per word.

Each communication port has a 9-bit data channel. The routers provide support for automatic routing from source to destination. The latency of a transfer is 2 microseconds across the prototype machine, assuming no blocking. When a message arraives, a task is created automatically to handle it in 1 microsecond. Thus, it is possible to implement a shared memory model using message passing, in which a message provides a fetch address and an automatic task sends a reply with the desired data.

Mosaic-C

Like the J-Machine, the Mosaic-C attempted to construct a massively parallel array by reducing the cost of a node -- it is a fine-grained multicomputer. Unlike the J-Machine, there is no external memory outside of the node chips. It also uses a two-dimensional mesh rather than 3 dimensions.

Mosaic-C was not multithreaded although future versions had that possibility. However, the Mosaic group left Caltech to start Myricom, which builds the Myrinet LAN technology based on the Mosaic chips.

Stanford Dash

The DASH was a shared distributed memory multiprocessor using directories. The prototype used 16 clusters of 4 MIPS R3000 processors. The clusters were connected by a pair of wormhole routing meshes. One mesh was used to request memory and the other is used to reply to requests.

The DASH keept a directory in main memory for every frame with a presence bit for every processor cache. In addition each directory had status information indicating whether the frame is uncached, shared, or exclusively cached. Thus, there is no directory chaining, and updates take a fixed amount of time.

Of course, keeping the entire directory in main memory requires that the directory memory be increased as the number of processors increases. In the prototype system, only two words were required to record the presence of each frame in all of the caches. If it was scaled up to thousands of processors, then both the amount of directory memory per frame and the time to access it would be prohibitive.

Note that the DASH is not multithreaded, except that multiple memory references can be in the network at once. The lessons learned from the DASH were the basis for the SGI Origin series.

SGI Origin

The Origin series from SGI is a distributed shared memory multiprocessor derived from the Stanford DASH and the Cray T3D. Each of its nodes consists of a pair of MIPS  processors (R10K and up) connected by a common memory controller/network interface called a hub. The hub contains a 4x4 crossbar that connects its four I/O ports. The network interface connects to a router (with a 6x6 crossbar, 4.8 GB/s, synchronous, 200 MHz, 20-bits wide, packet-switched, 128 bit data + 8 bit sideband (control) packets) which can directly connect two nodes (4 processors) within a chassis. Beyond this number, the quad-processor dual-node chassis units are connected by a hypercube of up to 5 dimensions, for a maximum of 128 processors in the original version (larger versions were specifically developed for one of the national labs). The hubs include a separate directory memory for managing coherency.

The original router chip, called SPIDER, was 160 square mm, 5-layer, 0.5 micron CMOS in a custom 624 pin ceramic package. It operated on 3.3 volts and dissipated 29 watts. Three of its ports were designed for intra-chassis (single-ended) wiring and three were designed for inter-chassis (differential) signaling. Latency through the chip is about 50 ns.

An interesting physical feature of the Origin is that the hypercube interconnect uses a front-plane bus (ribbon cable) to connect separate chassis. This was always thought to be a problem from a marketing perspective (users dislike having ugly cables running across the fronts of their machines) but SGI has incorporated a clean-looking integral inter-chassis cover that hides the ribbon cables and yet fits appropriately with the exterior design of the machine.

KSR-1

This was a cache-only multiprocessor based on a wide ring with a hierarchy of rings for expanding the system. It used a custom 64-bit processor running at 20 MHz. The majority of each processor was a search engine that maintained a distributed directory and controled the interface to the ring.

The processor contained only caches -- up to 32 MB per node. Of course, there is no attempt here to save cost by using a memory hierarchy.

The basic idea is that, as data is requested, it goes onto the ring and cycles around to its destination. As it passes other processors that also have outstanding requests for it, they may be granted ownership for long enough to access it before it moves on. Once it reaches the requesting node, it is owned by that node until another request arrives.

Data is transmitted in 128 byte blocks (i.e. 1 K-bit) at a time.

The KSR was designed as a total system -- each rack of processors had its own power supply, cooling, disk bays, and I/O channels. In fact, even though it was designed to be a supercomputer, its best applications turned out to be in the area of online transaction processing. The company is now out of business.

Dataflow

We have examined static dataflow in which a graph specifies the operations and tokens flow through the graph in a staged pipelined manner. In a dynamic dataflow processor, tokens are tagged with context information. A processing node fires whenever it finds a complete set of input tokens with the same context.

In a dynamic dataflow system, the tokens are stored in a memory, and on each clock the memory is searched for sets of tokens that are ready to fire. This is, of course, a good application for an associative memory. However, more recent designs have used more sophisticated storage mechanisms to avoid the expense of associative memory.

The result of an operation is a new token that is placed back into memory with an appropriate context. The original input tokens are destroyed (consumed).

Thus, it is possible for groups of tokens to fire out of order, depending on when they are ready.