Lecture 5

Register and Instruction Set Design

The number of registers in an architecture has a profound impact on the remainder of its design. Increasing the number of registers

            • reduces the number of memory references

            • increases the size of an instruction word

            • places greater demands on the compiler to schedule registers

            • increases the cost of a context swap

It has been shown empirically that the majority of arithmetic expressions can be scheduled into 8 or fewer registers, and that 16 are almost always sufficient for a basic block or procedure execution. However, as compilers increase in sophistication and are more able to optimize interprocedurally, they are able to take advantage of a larger number of registers. Thus, we have seen a trend of increasing the number of registers in architectures.

Most early architectures had a single computational register called an accumulator. Some machines added special-purpose registers to this (an accumulator extension for supporting multiplication and division was common, then came stack pointers, index registers, etc.). The trouble with special purpose registers is that they have limited uses, yet they take up as much real-estate as any other register.

Thus, it was realized that a better use of the real estate was to make the registers general purpose, giving them all the same capabilities. This has the further advantage that it is simpler for a compiler to maximize their use, because the algorithm for scheduling of the registers has no special rules that account for different types of usage.

General purpose registers also have the property of being scalable. We can simply add more of them if we wish, and the changes to the rest of the architecture and the software are modest. It is much harder to scale up specialized registers except by designating one of the register types (e.g. the accumulator) as general purpose and adding more of them.

A general purpose register set makes it easier to construct an instruction set that is orthogonal and regular. Orthogonal being the property that each operation on each data type is independent of the potential set of operand sources and destinations. Regular is the property that the instruction set has relatively complete symmetry of operations and operaands. For example, we do not see a transfer that can go in one direction but not the direct reverse.

Inevitably there are a few specialized registers in most designs -- a program counter, status register, configuration register, etc. These are usually either accessed by special instructions that are outside of the normal formats, or by mapping the special registers to particular general purpose register addresses or even memory addresses.

As an interesting aside, an early microprocessor, the TI990, kept all but one of its registers actually in memory. The one register was a register frame pointer and was used as a base to which all other register offsets were added. Of course, because memory speeds did not track processor speeds, the design could not improve in performance with other processors. However, it had a major advantage for context switching because it had only one register to save.

Let's look at some current designs and compare their register sets:

Intel 32-bit Architecture

"General Purpose" registers:

EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP

The first four can be treated as 32 or 16 bit registers, with the lower 16 bits being further divided into bytes that are designated separately and which provide limited compatibility with the 8-bit Intel architectures.

Segment registers

CS, SS, DS, ES, FS, GS, with CS being Code Segment, SS being Stack Segment, DS being Data Segment (along with the other three)

Instruction Pointer (16 or 32 bits), Flags Register (32 bits, with some undefined, although each new generation seems to define more of them)

Memory Management Registers

Global Descriptor Table Register (48 bits)

Interrupt Table Descriptor Register (48 bits)

Task Register (16 selector + 32 base address+32 limit +8 attributes)

Local descriptor table register (same as Task Register)

Control Registers

CR0 - CR4, each with a different set of control bits.

Floating Point Registers

R0 - R7, each 80 bits plus a 2-bit tag aggregated into the Tag Word.

Control Register (16 bits), Status Register (16 bits), Tag Word (16 bits), Instruction Pointer (48 bits), Data Pointer (48 bits)

Debug Registers

D0 - D8 , D0 - D4 hold breakpoints while the other four are for control and status.

Intel 64-bit Architecture

128 general 65-bit registers, of which 32 are static and 96 are stacked

128 floating point 82-bit registers

128 application-specific 64-bit registers

64 predicate 1-bit registers

8 branch 64-bit registers

Motorola 680X0

Data Registers D0 - D7 (32 bit)

Address Registers A0 - A6 (32 bit)

            A7 -- Stack Pointer (32 bit)

Program Counter (32 bit)

Condition Code (8-bits user, 16-bits supervisor)

Floating Point FP0 - FP7 (80-bits)

FP Control Register (16 bits user, 32 bits supervisor)

FP Status Register (32 bits)

FP Instruction Address (32 bits)

Supervisor Registers: Interrupt Stack pointer, Master Stack Pointer, Status, Vector Base, Alternate Source and Destination Function Code, Cache Control, User Root Pointer, Supervisor Root Pointer, Translation Control, Data Transparent Translation (2), Instruction Transparent Translation (2), MMU Status. All but two are 32 bit.

MIPS R4000

General Purpose R0 - R31, 64 bits. R0 is constant 0, R31 is a link register for link and jump instructions.

Multiply and divide High/Low -- 64 bits each.

Program Coutner (64 bits)

Floating Point: FPR0 - FPR31 (32 or 64 bit), Control/Status (32 bit), Implementation/Revision (32 bit)

Control Coprocessor 32 named registers, each 32 bits. Cache and MMU control, exception processing, debugging.

HP Precision Architecture

General Purpose GR0 - GR31, 64 bits. GR0 is constant 0, GR1 is the target of add immediate left, GR31 is a link register for branch and link instructions.

Shadow registers for context switch. Automatically copy GR1, 8, 9, 16, 17, 24, 25.

Space registers SR0 - SR7, used in virtual addressing.

Control registers CR0 - CR31, with 1 - 7 reserved.

Floating point: 32 64 bit (or 64 32 bit). R0 - R3 are for status, control , and exception processing.

Program status (32 bit)

Alpha AXP

General Purpose R0 - R31, with R31 constant 0 and R30 special purpose, all 64 bit

Program counter 64 bit

Floating point F0 - F31, with F31 set to 0.

PowerPC

General Purpose GPR0 - GPR31 (32 or 64 bit, depending on model)

Link register (32 bit)

Condition register (32 bit)

Count register (32 or 64 bit)

Exception register (32 bit)

Floating point FPR0 - FPR31 (64 bit)

FP status and Control (32 bit)

Real-time clock register (64 bit)

The supervisor mode adds 16 segment registers (32 bit), a machine status register, and numerous special purpose control and status registers. (27 out of 1K implemented).

SPARC

Register Window System

Minimum 2 windows, Maximum 32, typically 8

Each window has 8 in, 8 out, 8 local, 8 global general purpose integer windows. The global registers (r0 - r7) remain fixed, with r0 being 0. On a context switch, the outs become the ins of the new window and 16 new registers are obtained (out, local). R15 is used as a link register.

State, Window invalid mask, trap base, Y, PC/nPC

Floating point: 32 general purpose, state, queue

Coprocessor: 32 general purpose, state, queue

It was thought that the register windows would provide for a rapid context switch in subroutine calls. Unfortunately, they have several drawbacks:

-- The fixed size of the register windows rarely matches exactly the requirements of a given subroutine call, forcing parameters to either be unnecessarily spilled to memory or registers to be wasted for many subroutine calls.

-- The In/Out/Local/Global designations of the registers within a window effectively make each window a collection of four special-purpose register sets. Thus, the compiler must consider special rules for the use of the registers when scheduling them.

-- Even though there may be many free registers at a given time, they cannot be used if they are outside of the current window. Thus unnecessary spills can be generated.

-- Register windows provide little benefit for recursion because the number of windows is fixed, and stack frames must be spilled once the limit is reached.

What we see in the SPARC is a set of semantic constraints on the use of the top level of the storage hierarchy that result in forced memory accesses that are coupled with operations (in this case a subroutine call). Each subroutine call has the potential to force a shower of memory accesses whose temporal relationship to the call is excessively constrained. The lesson to be learned from the SPARC experience is that even with a reduced instruction set, it is still possible to fall prey to the CISC pitfall of creating implicit relationships between memory access and computations.

The Intel 64-bit architecture uses a mechanism similar to register windows by providing 32 static registers, and 96 stacked registers. However, the IA64 avoids several of the SPARC’s pitfalls by providing instructions to partition the stacked registers arbitrarily on a procedure call. Thus, for example, one call might have 5 local registers and 2 output registers, and the next could take 12 local and 6 output registers. The stacked registers are automatically renamed to start at 32 for a call. They don’t physically move their contents. When the stacked registers are filled, some are spilled automatically to memory. This has both advantages and disadvantages. Intel points out that it avoids the need for explicit load-store instructions, and the hardware can schedule the loads and stores. However, it also means that the compiler, which has a bigger picture of the code, does not control when spills occur. Thus, the hardware may generate spill operations that are poorly scheduled (too late) or excessive (if aggressive spilling is used).

Instruction Set Design

Given a set of registers, the next step in formulating an ISA is likely to be a first pass at the instruction set design. The instruction set and the register set may go through several design iterations -- special registers may be needed to support certain operations such as setting operational modes, and these in turn may require changes to the fields in the instruction set.

Almost all instructions contain three types of information:

-- instruction type

-- address(es) of operand(s) or jump target

-- operation code

There is usually at least one instruction type that also contains immediate data. Thus, we may consider the basic contents of an instruction to be the operation and the information required to obtain its operand(s), if we consider the type to be part of the operation code. A few instructions such as Halt or Trap may have no explicit operand and thus merely consist of an operation code.

Trying to create an instruction set becomes a problem in bin-packing, where one tries to fit in all of the necessary pieces in an optimal manner. Optimality is not well-defined, however. The regularity (or symmetry) of the instruction set together with its orthogonality (the independence of operands and operations) often results in operations that make little sense (such as assigning a value to the zero register). It is always tempting to try to make some irregular use of these nonsensical operations, so as to fit more operations into the limited number of bits that are available (recall the desirability of a fixed-length instruction and the implicit limitation of fitting all instructions into a single word).

Opposing this temptation is the fact that more complex combinations of bit patterns result in more complex decoding logic, which either implies a larger decoder or a slower one. In either case, we are paying a cost that is repeated with each copy of the processor that directly stems from the instruction set design.

Instruction Types

Instruction types are a particular source of decoding complexity, as the type field must first be decoded before the other fields, which slows the decoding slightly. One way around this is to simultaneously decode all of the possible interpretations of an instruction and then use the type field to gate out the correct one -- but this is expensive when there are more than a few instruction types.

Another approach is to have the type carry specific meaning, rather than simply be a field that identifies formats arbitrarily. If the type identifies the instruction as being a specific type of operation (e.g. arithmetic, jump, load/store) then the instruction can be dispatched to an appropriate unit of the machine without being fully decoded. The decode is thus parallelized through pipelining.

Another trick is to partially decode instructions as they are loaded into the instruction cache. Because a cache miss is a slow operation, the extra time to decode some aspects of the instruction can be hidden. Branches are particularly useful to predecode, because early detection of a branch outcome reduces the pipeline penalty associated with mispredicting the branch. Some systems keep branch prediction or branch target addresses together with the predecoded branches in the cache. Thus, once a branch has been executed, its target and past behavior are stored back into the cache so that they can be automatically used as soon as the line containing the branch is loaded into the instruction prefetch buffer.

Each additional bit in the type field doubles the number of possible instructions but also takes one more bit away from the space available for every instruction. A single type bit is probably too limiting and forces the two resulting types to hold instructions with unrelated functions. Four instruction types (two bits) are easily distinguished as arithmetic, load/store, branch/jump, and other control. Often within an "other" type there may be subtypes. Some instruction sets also distinguish floating point as a different type, while others make it a subtype of arithmetic. Because modern designs have separate floating point pipes, it is desirable to be able to identify these operations early so that they can be scheduled independently of other arithmetic operations.

Another way to extend the number of potential instructions is to define operational modes for the processor. For example, rather than having distinct big-endian and little-endian operations, a mode bit in a control register could be set to change the interpretation of the arithmetic statements. The mode bits can be viewed as an implicit extension of the instruction field. They should be orthogonal to the rest of the instruction set, such as controlling cache protocol or floating point format, so that they do not have to be changed frequently (another example of avoiding implicit relationships between operations and storage transfers).