Register
and Instruction Set Design
The
number of registers in an architecture has a profound impact on the remainder
of its design. Increasing the number of registers
•
reduces the number of memory references
•
increases the size of an instruction word
•
places greater demands on the compiler to schedule registers
•
increases the cost of a context swap
It has been shown empirically that the majority of
arithmetic expressions can be scheduled into 8 or fewer registers, and that 16
are almost always sufficient for a basic block or procedure execution. However,
as compilers increase in sophistication and are more able to optimize
interprocedurally, they are able to take advantage of a larger number of
registers. Thus, we have seen a trend of increasing the number of registers in
architectures.
Most early architectures had a single computational
register called an accumulator. Some machines added special-purpose registers
to this (an accumulator extension for supporting multiplication and division
was common, then came stack pointers, index registers, etc.). The trouble with
special purpose registers is that they have limited uses, yet they take up as
much real-estate as any other register.
Thus, it was realized that a better use of the real
estate was to make the registers general purpose, giving them all the same
capabilities. This has the further advantage that it is simpler for a compiler
to maximize their use, because the algorithm for scheduling of the registers
has no special rules that account for different types of usage.
General purpose registers also have the property of
being scalable. We can simply add more of them if we wish, and the changes to
the rest of the architecture and the software are modest. It is much harder to
scale up specialized registers except by designating one of the register types
(e.g. the accumulator) as general purpose and adding more of them.
A general purpose register set makes it easier to
construct an instruction set that is orthogonal and regular. Orthogonal being
the property that each operation on each data type is independent of the
potential set of operand sources and destinations. Regular is the property that
the instruction set has relatively complete symmetry of operations and
operaands. For example, we do not see a transfer that can go in one direction
but not the direct reverse.
Inevitably there are a few specialized registers in
most designs -- a program counter, status register, configuration register,
etc. These are usually either accessed by special instructions that are outside
of the normal formats, or by mapping the special registers to particular
general purpose register addresses or even memory addresses.
As an interesting aside, an early microprocessor, the
TI990, kept all but one of its registers actually in memory. The one register
was a register frame pointer and was used as a base to which all other register
offsets were added. Of course, because memory speeds did not track processor
speeds, the design could not improve in performance with other processors.
However, it had a major advantage for context switching because it had only one
register to save.
Let's look at some current designs and compare their
register sets:
Intel 32-bit Architecture
"General Purpose" registers:
EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP
The first four can be treated as 32 or 16 bit
registers, with the lower 16 bits being further divided into bytes that are
designated separately and which provide limited compatibility with the 8-bit
Intel architectures.
Segment registers
CS, SS, DS, ES, FS, GS, with CS being Code Segment, SS
being Stack Segment, DS being Data Segment (along with the other three)
Instruction Pointer (16 or 32 bits), Flags Register
(32 bits, with some undefined, although each new generation seems to define
more of them)
Memory Management Registers
Global Descriptor Table Register (48 bits)
Interrupt Table Descriptor Register (48 bits)
Task Register (16 selector + 32 base address+32 limit
+8 attributes)
Local descriptor table register (same as Task
Register)
Control Registers
CR0 - CR4, each with a different set of control bits.
Floating Point Registers
R0 - R7, each 80 bits plus a 2-bit tag aggregated into
the Tag Word.
Control Register (16 bits), Status Register (16 bits),
Tag Word (16 bits), Instruction Pointer (48 bits), Data Pointer (48 bits)
Debug Registers
D0 - D8 , D0 - D4 hold breakpoints while the other
four are for control and status.
128 general 65-bit registers, of which 32 are static
and 96 are stacked
128 floating point 82-bit registers
128 application-specific 64-bit registers
64 predicate 1-bit registers
8 branch 64-bit registers
Motorola 680X0
Data Registers D0 - D7 (32 bit)
Address Registers A0 - A6 (32 bit)
A7
-- Stack Pointer (32 bit)
Program Counter (32 bit)
Condition Code (8-bits user, 16-bits supervisor)
Floating Point FP0 - FP7 (80-bits)
FP Control Register (16 bits user, 32 bits supervisor)
FP Status Register (32 bits)
FP Instruction Address (32 bits)
Supervisor Registers: Interrupt Stack pointer, Master
Stack Pointer, Status, Vector Base, Alternate Source and Destination Function
Code, Cache Control, User Root Pointer, Supervisor Root Pointer, Translation
Control, Data Transparent Translation (2), Instruction Transparent Translation
(2), MMU Status. All but two are 32 bit.
MIPS R4000
General Purpose R0 - R31, 64 bits. R0 is constant 0,
R31 is a link register for link and jump instructions.
Multiply and divide High/Low -- 64 bits each.
Program Coutner (64 bits)
Floating Point: FPR0 - FPR31 (32 or 64 bit),
Control/Status (32 bit), Implementation/Revision (32 bit)
Control Coprocessor 32 named registers, each 32 bits.
Cache and MMU control, exception processing, debugging.
HP Precision Architecture
General Purpose GR0 - GR31, 64 bits. GR0 is constant
0, GR1 is the target of add immediate left, GR31 is a link register for branch
and link instructions.
Shadow registers for context switch. Automatically
copy GR1, 8, 9, 16, 17, 24, 25.
Space registers SR0 - SR7, used in virtual addressing.
Control registers CR0 - CR31, with 1 - 7 reserved.
Floating point: 32 64 bit (or 64 32 bit). R0 - R3 are
for status, control , and exception processing.
Program status (32 bit)
Alpha AXP
General Purpose R0 - R31, with R31 constant 0 and R30
special purpose, all 64 bit
Program counter 64 bit
Floating point F0 - F31, with F31 set to 0.
PowerPC
General Purpose GPR0 - GPR31 (32 or 64 bit, depending
on model)
Link register (32 bit)
Condition register (32 bit)
Count register (32 or 64 bit)
Exception register (32 bit)
Floating point FPR0 - FPR31 (64 bit)
FP status and Control (32 bit)
Real-time clock register (64 bit)
The supervisor mode adds 16 segment registers (32
bit), a machine status register, and numerous special purpose control and
status registers. (27 out of 1K implemented).
SPARC
Register Window System
Minimum 2 windows, Maximum 32, typically 8
Each window has 8 in, 8 out, 8 local, 8 global general
purpose integer windows. The global registers (r0 - r7) remain fixed, with r0
being 0. On a context switch, the outs become the ins of the new window and 16
new registers are obtained (out, local). R15 is used as a link register.
State, Window invalid mask, trap base, Y, PC/nPC
Floating point: 32 general purpose, state, queue
Coprocessor: 32 general purpose, state, queue
It was thought that the register windows would provide
for a rapid context switch in subroutine calls. Unfortunately, they have
several drawbacks:
-- The fixed size of the register
windows rarely matches exactly the requirements of a given subroutine call, forcing
parameters to either be unnecessarily spilled to memory or registers to be
wasted for many subroutine calls.
-- The In/Out/Local/Global designations
of the registers within a window effectively make each window a collection of
four special-purpose register sets. Thus, the compiler must consider special
rules for the use of the registers when scheduling them.
-- Even though there may be many free
registers at a given time, they cannot be used if they are outside of the
current window. Thus unnecessary spills can be generated.
-- Register windows provide little
benefit for recursion because the number of windows is fixed, and stack frames
must be spilled once the limit is reached.
What we see in the SPARC is a set of semantic
constraints on the use of the top level of the storage hierarchy that result in
forced memory accesses that are coupled with operations (in this case a
subroutine call). Each subroutine call has the potential to force a shower of
memory accesses whose temporal relationship to the call is excessively
constrained. The lesson to be learned from the SPARC experience is that even
with a reduced instruction set, it is still possible to fall prey to the CISC
pitfall of creating implicit relationships between memory access and
computations.
The Intel 64-bit architecture uses a mechanism similar
to register windows by providing 32 static registers, and 96 stacked registers.
However, the IA64 avoids several of the SPARC’s pitfalls by providing
instructions to partition the stacked registers arbitrarily on a procedure
call. Thus, for example, one call might have 5 local registers and 2 output
registers, and the next could take 12 local and 6 output registers. The stacked
registers are automatically renamed to start at 32 for a call. They don’t
physically move their contents. When the stacked registers are filled, some are
spilled automatically to memory. This has both advantages and disadvantages.
Intel points out that it avoids the need for explicit load-store instructions,
and the hardware can schedule the loads and stores. However, it also means that
the compiler, which has a bigger picture of the code, does not control when spills
occur. Thus, the hardware may generate spill operations that are poorly
scheduled (too late) or excessive (if aggressive spilling is used).
Instruction Set Design
Given a set of registers, the next step in formulating
an ISA is likely to be a first pass at the instruction set design. The
instruction set and the register set may go through several design iterations
-- special registers may be needed to support certain operations such as
setting operational modes, and these in turn may require changes to the fields
in the instruction set.
Almost all instructions contain three types of
information:
-- instruction type
-- address(es) of operand(s) or jump target
-- operation code
There is usually at least one instruction type that
also contains immediate data. Thus, we may consider the basic contents of an
instruction to be the operation and the information required to obtain its
operand(s), if we consider the type to be part of the operation code. A few
instructions such as Halt or Trap may have no explicit operand and thus merely
consist of an operation code.
Trying to create an instruction set becomes a problem
in bin-packing, where one tries to fit in all of the necessary pieces in an
optimal manner. Optimality is not well-defined, however. The regularity (or
symmetry) of the instruction set together with its orthogonality (the
independence of operands and operations) often results in operations that make
little sense (such as assigning a value to the zero register). It is always
tempting to try to make some irregular use of these nonsensical operations, so
as to fit more operations into the limited number of bits that are available
(recall the desirability of a fixed-length instruction and the implicit
limitation of fitting all instructions into a single word).
Opposing this temptation is the fact that more complex
combinations of bit patterns result in more complex decoding logic, which
either implies a larger decoder or a slower one. In either case, we are paying
a cost that is repeated with each copy of the processor that directly stems
from the instruction set design.
Instruction types are a particular source of decoding
complexity, as the type field must first be decoded before the other fields,
which slows the decoding slightly. One way around this is to simultaneously
decode all of the possible interpretations of an instruction and then use the
type field to gate out the correct one -- but this is expensive when there are
more than a few instruction types.
Another approach is to have the type carry specific
meaning, rather than simply be a field that identifies formats arbitrarily. If
the type identifies the instruction as being a specific type of operation (e.g.
arithmetic, jump, load/store) then the instruction can be dispatched to an
appropriate unit of the machine without being fully decoded. The decode is thus
parallelized through pipelining.
Another trick is to partially decode instructions as
they are loaded into the instruction cache. Because a cache miss is a slow operation,
the extra time to decode some aspects of the instruction can be hidden.
Branches are particularly useful to predecode, because early detection of a
branch outcome reduces the pipeline penalty associated with mispredicting the
branch. Some systems keep branch prediction or branch target addresses together
with the predecoded branches in the cache. Thus, once a branch has been
executed, its target and past behavior are stored back into the cache so that they
can be automatically used as soon as the line containing the branch is loaded
into the instruction prefetch buffer.
Each additional bit in the type field doubles the
number of possible instructions but also takes one more bit away from the space
available for every instruction. A single type bit is probably too limiting and
forces the two resulting types to hold instructions with unrelated functions.
Four instruction types (two bits) are easily distinguished as arithmetic,
load/store, branch/jump, and other control. Often within an "other"
type there may be subtypes. Some instruction sets also distinguish floating
point as a different type, while others make it a subtype of arithmetic. Because
modern designs have separate floating point pipes, it is desirable to be able
to identify these operations early so that they can be scheduled independently
of other arithmetic operations.
Another way to extend the number of potential
instructions is to define operational modes for the processor. For example,
rather than having distinct big-endian and little-endian operations, a mode bit
in a control register could be set to change the interpretation of the
arithmetic statements. The mode bits can be viewed as an implicit extension of
the instruction field. They should be orthogonal to the rest of the instruction
set, such as controlling cache protocol or floating point format, so that they
do not have to be changed frequently (another example of avoiding implicit
relationships between operations and storage transfers).