Lecture 4

Instruction Set Architectures

Architectural Taxonomies

The literature is filled with attempts to describe a design space of architectures as a space of N orthogonal dimensions. Unfortunately, this popular approach oversimplifies a highly interrelated set of factors. The usual goal of such an exercise is to reduce the complexity of architectural design to a comprehensible level so that predictions can be made. As one would expect, it is of limited value, unless used with great care.

For example, one popular graph is to show clock rate versus cycles per instruction:.

These are independent axes and are not especially representative of any useful architectural factor. The graph also tends to be more indicative of commercial offerings than of particular architectural parameters. For example, the scalar unit of a vector supercomputer may actually be a scalar RISC processor running at 1 CPI and a clock rate of 1 GHz, yet it is lumped in with a different category of processor and not shown explicitly (probably because it messes up the graph).

This is the sort of graph that the RISC community uses to support their case by subtly making us think that lower CPI equates to higher performance. If instead we were to plot cycles per operation, where an operation includes memory fetches, then the graph is not so dramatic.

Here we see that CISC is only slightly behind RISC, and there is no change in the performance of the RISC processors because they are already running near their maximum cycles per operation.

Future Factors Influencing Instruction Set Architectures

What happens as technology improves?

Here we see that around 20 GHz, technology hits a brick wall beyond which major changes in the technology are required and expense increases exponentially. This brick wall is the fact that at 20 GHz, light travels only about 1.5 cm in a clock cycle, and electricity in aluminum and copper is still slower. Of course, shrinking the size of the chip, as a result of smaller feature sizes, permits signals to continue to propagate across a chip in one cycle. But achieving higher performance requires either new architectures that can tolerate wire latency, or significantly new hardware technology to further shrink features on the chip.

Assuming that the technology for even smaller chips is too costly, then the majority of processors will operate at around 15 to 20 GHz, and there will be a continuum of cycles per operation to choose from. Generally, cost increases as CPO decreases. The overlap between the classes of machines has increased so that the distinctions have blurred. New architectural approaches may enable even lower CPOs, and may become the more cost effective path forward for increasing performance. A likely scenario is that the shrinkage of chip feature sizes will continue, but at a slower pace than present, with the difference being made up by new architecture.

Economically, a driving force in the shift away from technology as the prime factor in advancing performance is the cost of new fabrication lines. Because production line construction costs have been growing exponentially, it makes sense that at some point their turnover rate will slow to provide more time to pay off the larger investment. Without slowing the turnover rate, the sales volume must grow in proportion to the construction cost, which requires an exponential growth in developing new markets and applications for processors, while maintaining comparable profit margins per unit. Most models for increasing market volume actually depend, however, on reducing the price point and the margin. So the volume would need to increase at an even greater rate. However, improved performance also opens new markets, so when advances in technology slow, there is an increased interest in improving performance through architecture.

The CISC vs. RISC Dichotomy

The CISC/RISC dichotomy, like so may other artificial divisions of the architectural design space, is mostly marketing hyperbole. The first computer architectures were RISC in nature because it was too costly to build complex instruction sets. The fact is that Seymour Cray was building RISC supercomputers back in the 1960's that included most of the features found in today's designs, and even after the move toward more complex instruction sets in mainframes, IBM began to return to the simplified instruction sets before Berkeley and Stanford coined the terms RISC and CISC.

So what really drives the shift from complex to reduced instruction sets?

The recognition that compilers are unable to take advantage of complex instruction sets.

Why is this?

When there are many alternative instruction sequences for performing an operation, it takes a great deal of domain knowledge to decide which to use. That knowledge is not present in typical cacheless von Neumann programming models. The compiler writer can make only so many of the choices based on the very small contexts of programming language statements, basic blocks and interprocedural analysis, and then must simply take a straightforward implementation.

The complex instruction set machine developers based their designs on observations of instruction sequences, and tried to collapse common sequences into single instructions. The trouble was that, although those sequences fell out of execution, the compiler was unable to predict many of those patterns, and so could not take advantage of the collapsed instructions.

In part, the reason for the failure is that the sequences cannot be predicted because they are not fixed. That is, even though a sequence might be executed a million times in a loop, there might be a thousand cases where it varies slightly (perhaps in a data-dependent fashion) and in such a manner that the collapsed instruction cannot be used. If the branch that handles the variation is in the middle of the sequence, then it is difficult for the compiler to recognize that it can use the instruction if it can reorder the branch and construct an alternate sequence.

In addition, some CISC designers tried to be proactive in their designs by anticipating operations that would be useful to collapse. It is unfortunately quite common for architects to think that they can do something better in hardware than it can be done in software. For example, the VAX provided an operating system call instruction that was supposed to accelerate this operation. The decision to include this operation was not based on observations of a specific instruction sequence, but rather on statistics that indicated the frequency of OS calls. The architects then designed an instruction that they thought would become a standard for all OS calls in the machine. But the inflexibility of having the operation done in hardware made it difficult to use for unforeseen cases, and the complexity of the instruction (due to attempts to make it as general as possible) made it slower to execute than the simplest software OS calls (which account for a large portion of all OS calls). And because the semantics of the instruction are so complex, only an assembly language programmer could judge whether its use was the most efficient for a given instance. Given the trend toward compiled OS code, this instruction was never used.

In a CISC architecture, many of the instructions are never executed. In one study of the DEC PDP-10, it was found that around 70 instructions accounted for 99% of all those executed and something like 50 account for about 95%. Similar statistics have been seen in other CISC designs. The argument that was made in the early days of RISC was that reducing the complexity of the instruction set had two benefits:

1. Reduced number of clock cycles per instruction and reduced levels of gate delay in the control unit.

2. Enabled designers to devote more chip area to enhancements that improve CPI.

The former was especially true when CISC architectures were primarily microprogrammed -- that is, each instruction involved execution of a microprogram from an on-chip ROM, using an internal microengine. While it is true that decoding a CISC instruction still requires more gate delays, designers have gotten clever by implementing the commonly used instructions as a RISC core with fast decode and dispatch. The rest of the instructions still execute slowly, but are used so infrequently (if at all) that they have little impact on performance. For example, the Intel Pentium takes this design approach. AMD translates the Intel CISC instructions into RISC operations internally. Even so, the portion of the chip that deals with the CISC instructions still takes up area that could be used for other enhancements -- although this area is actually just a tiny percentage in a large chip like the Pentium.

So why is it that RISC processors outperform CISC processors? What is it that distinguishes their instruction sets, if we ignore the CISC instructions that go unused? Let's look at some of the characteristics of CISC and RISC processors and see if we can identify a pattern.

The Intel 386/486/Pentium has 12 addressing modes:

Register

Immediate

Direct

Base

Base + Displacement

Index + Displacement

Scaled Index + Displacement

Based Index

Based Scaled Index

Based Index + Displacement

Based Scaled Index + Displacement

Relative

Operands in the Intel processors can be 8, 16, 32, 48, 64, or 80 bits long. The instruction set also supports string operations over the entire address range.

Instructions in the Intel design can be as small as one byte or as long as 12 bytes and any combination in between. The first bytes generally contain opcode, mode specifiers, and register fields, while the remainder are for address displacement and immediate data.

The Motorola 680X0 has 18 addressing modes:

Data register direct

Address register direct

Immediate

Absolute short

Absolute long

Address register indirect

Address register indirect with postincrement

Address register indirect with predecrement

Address register indirect with displacement

Address register indirect with index (8-bit)

Address register indirect with index (base)

Memory inderect postindexed

Memory indirect preindexed

Program counter indirect with index (8-bit)

Program counter indirect with index (base)

Program counter indirect with displacement

Program counter memory indirect postindexed

Program counter memory indirect preindexed

Operands in the Motorola 68K CISC architecture are 1 to 32 bits, 1, 2, 4, 8, 10, or 16 bytes.

Instructions in the 680X0 are stored in 16-bit chunks, with the smallest being a single 2-byte quantity and the largest being 5 of these words in length.

Now let's look at some RISC architectures

The MIPS R4000 family has four addressing modes:

Base + immediate offset (loads and stores)

Register direct (arithmetic)

Immedate (jumps)

PC relative (branches)

Memory accesses in the MIPS architecture are to any multiple between 1 and 8 bytes.

There are three instruction formats, all of which are 32 bits in length.

The SPARC architecture has 5 addressing modes:

Register indirect with immediate displacement

Register inderect indexed by another register

Register direct

Immediate

PC relative

Operands can be 1, 2, 4 or 8 bytes in size

There are 3 basic instruction formats, with 3 minor variations, all of which are 32 bits.

The PowerPC architecture has 8 addressing modes

Register direct

Immediate

Register indirect

Register indirect with immediate index (loads and stores)

Register indirect with register index (loads and stores)

Absolute (jumps)

Link register indirect (calls)

Count register indirect (branches)

There are four operand sizes: 1, 2, 4 or 8 bytes.

The instruction set has 15 different formats with many minor variations, but all are 32 bits in length.

The DEC Alpha AXP has four addressing modes

Register direct

Immediate

Register indirect with displacement

PC-relative

Operands in the Alpha can be 1, 2, 4 or 8 bytes in length.

The instruction set has 7 formats, all of which are 32 bits long.

The HP Precision Architecture has 7 addressing modes

Register

Immediate

Base with displacement

Base with scaled index and displacement

Predecrement

Postincrement

PC-relative

There are five operand sizes ranging in powers of two from 1 to 16 bytes.

The PA-RISC has 12 instruction formats, all 32 bits in length.

RISC processors explicitly move data between memory and registers with separate instructions, while CISC processors embed the transfers in other instructions. In addition, RISC processors tend to have just a few addressing modes while CISC processors have many.

The result is that memory accesses and address arithmetic are more tightly bound to data processing instructions in a CISC architecture than in a RISC architecture.

Architecturally, this drives the RISC designer to rely on a larger amount of memory at the top of the hierarchy (registers), while the CISC designer uses a smaller amount of register memory and relies more on memory references.

And what trend have we already seen in memory technology versus processor technology?

Memory technology has been increasing in density but not in speed, while processors have doubled their speed every few years. Thus there is a widening gap in performance between memory and processors. But if they still execute the same number of operations, why should RISC be faster? The answer is scheduling.

Schedulability of an Instruction Set

In terms of software, making the loads, stores and address arithmetic distinct from the data processing operations gives the compiler writer the flexibility (and the responsibility) to reorder memory fetches to hide the access latency. When a memory transfer is bound to an operation, that flexibility does not exist for the compiler, and the architect is forced to use clever tricks to try to predict fetches ahead of time and rearrange them during execution.

Because of the frequency of branches encountered in code, the hardware can only reorder operations over a small window of instructions. But a compiler can examine a much larger scope in its reordering efforts.

A CISC architecture is particularly handicapped in this regard by variable length instructions, because it is impossible to decouple these fetches. When an instruction is fetched, there is very little time to determine whether it requires subsequent fetches, and if one of those fetches is a cache miss, then it is difficult to hide the latency. In a RISC architecture, every fetch gets a complete instruction so that no decisions need to be made and the instruction can be dispatched. Although both employ prefetching, the RISC architecture is guaranteed that a prefetch will always load complete instructions while a CISC prefetch can lead to partial instructions that cannot be fully decoded until another prefetch operation can occur.

As the number of pipelines in an architecture grows, the ability to reschedule operations other than loads and stores becomes more important. Consider a superscalar architecture with four integer pipes, two floating-point pipes, two branch pipes, and four load-store pipes. This is a total of 12 pipelines, all of which need to be fed an instruction on each cycle in order to be 100% efficient. In a CISC instruction set, each instruction may contain multiple operations. For example, a branch could require address arithmetic, floating arithmetic, and multiple operand loads. All of these could enter the pipes at once. But we’d really like to use the rest of the pipes too. That requires an instruction that uses a complementary subset of the pipes. If the next instruction needs a pipe that the first one is using, then it has to wait for the next cycle. We don’t want to release just some of the operations in an instruction because then we could encounter problems if an exception occurs – the pipe would be stopped while containing pieces of an instruction that’s supposed to be atomic.

He point to take away from all of this is that coupling multiple operations for disparate functional units into atomic instructions reduces the flexibility for scheduling those operations. This makes it harder to take advantage of the available functional units, and to reorder operations to hide latencies that would stall operations. This is why we see CISC architectures providing a fast-path instruction subset that is more like RISC, and supporting remaining instructions in a slower path for backwards compatibility.

Memory Footprint of Code in an ISA

It is interesting to note, however, that as RISC processors have pushed their clock rates higher and CPIs lower, they have begun to suffer an increasing penalty due to the gap in performance between memory and the processor. Partly this effect is due to the fact that RISC designs have outpaced CISC designs in terms of CPI and clock rate. However, there is also the factor that RISC code is 1.3 to 1.6 times larger than CISC code, necessitating a larger number of fetches. Although compression of instructions was a secondary goal of most CISC designs, it turns out to be a possible saving grace of the approach as the memory-processor speed gap widens.

What could RISC designers do to alleviate the problem of increased code size slowing execution by wasting memory bandwidth?

They could build compression into the memory and corresponding decompression into the processor. Because the compression is dynamic, rather than static as in CISC, it might achieve higher levels of bandwidth utilization, although any such advantage will be context dependent. Experimental work in this area has shown, however, that memory compression works only under limited circumstances. The compression factor must be great enough to save a whole cache-line transfer. Otherwise there is no actual savings. That is, if the memory hardware always has to load four words at once (because there are that many wires physically present), it does no good to compress out the equivalent of one word. All that happens is that some of the wires go unused for a cycle – it still takes the same time to transfer the compressed data. In fact, the compression/decompression  overhead usually lengthens the cycle so the overall performance is worse.

Thus, we should really distinguish between explicit memory transfer and address calculation and implicit memory transfer and address calculation instruction sets, rather than RISC and CISC.

As a final note, it should be kept in mind that in many applications the code is quite small in comparison to the data. Most applications have extremely high hit rates in cache. Thus, as they may represent a small fraction of main memory references, reducing the cost of instruction fetches my not be the best place to expend resources. Remember Amdahl’s law.

The Fallacy of Using Code from the Current ISA to Guide a New ISA Design

It is ironic that implicit memory reference, which assumes the compiler is not sophisticated enough to schedule loads and stores, is associated with CISC, which grew out of the assumption that compilers would be smart enough to recognize complex sequences of instructions that could be mapped to a corresponding instruction set.

In a sense, this error was a self-induced bias of the CISC approach. The instruction sequences that were observed were mostly for machines that had few registers and used memory operands. Even had sequences been examined for early RISC supercomputer processors, it is likely that there would have been little decoupling between fetches and operations because at that time the gap between memory and processor speeds was quite small, so programs could be written with a flat memory model in mind. Thus, it was reasonable to assume that a basic instruction sequence involved fetching operands from memory as part of an operation.

In addition, the compilers that generated the instructions were lagging behind the architectures. Hence, we saw studies of instruction sequences on IBM 360 class machines that indicated that FORTRAN would run more efficiently on an architecture that looked strangely like the old IBM 7090 -- because the compiler was still better tuned for the older architecture.

Compilers and the ISA

What we have seen in explicit memory transfer models is that compilers can be smart enough to schedule loads and stores between cache and registers. This is possible because the granularity of the context in the registers corresponds well to the granularity at which the compiler works -- that is, expressions and statements within subroutines.

Beyond this level of granularity, it is more difficult for the compiler to predict the memory access patterns because the language does not carry semantics for operations of larger granularity. Thus, we fall back on the same approach that is used in the implicit memory transfer model for lower levels of the hierarchy, i.e. we assume that scheduling memory access is solely the responsibility of the architect. Hence we have caching schemes that try to dynamically take advantage of the probability of certain forms of locality in programs.

Only recently have compilers become sufficiently sophisticated that they can optimize over larger code segments, and architects are beginning to make provisions for explicit interactions between the code and the memory hierarchy by allowing special prefetch and eviction operations.

Forward Paths for ISAs

Instruction sets today are continuing to advance. In some cases, they still take CISC-like paths, as when graphics instructions are added that process multiple pixels at once. These are typically used only in hand-coded library routines, and are not generated by a compiler. Another path is to provide ways for a compiler to provide hints to the processor, such as indicating the likely direction of a branch or the point at which a data value is no longer needed in cache. Instruction set extensions such as these provide program-related information that the processor can use or ignore.

One particularly dangerous path is to provide instruction set extensions that try to take advantage of features in an implementation of an instruction set architecture. Such an implementation is called the processor microarchitecture. As an example, and early implementation of the HP PA-RISC provided a set of “shadow registers” into which a subset of the architectural registers could be copied in one operation. The designers viewed this as enabling a fast context switch for lightweight tasks such as interrupt handlers. The design took advantage of the arrangement of the registers as a block of four columns on the chip, and placed the additional shadow registers next to the ones on the exterior of the block, enabling all of the registers in the outside columns to be copied directly to adjacent register elements. The problem with this is that henceforth, the PA-RISC must support this arrangement, even if it becomes more effective to place the registers into a single column.

Other examples include arranging for the instruction slots in a pipeline behind a branch to be filled with operations that are independent of the branch. The number of these slots depends on the length of the pipeline and the branch detection point. If the pipeline is lengthened in a manner that changes the branch detection point, then existing code is no longer correct – hence future designs are restricted to keeping the detection point at the same relative place.

Because of long-term implications such as these, architects are reluctant to change an instruction set architecture unless there are compelling reasons to do so. Such reasons may include a demand by an important customer (several architectures include bit population count operations as a result of government requests, because such operations are useful in code breaking), strong evidence that the change will improve performance, and the desire to provide new capabilities (such as improved memory protection).