MIPS R4400 Pipeline Case Study
The MIPS R4400 has an 8-stage pipeline with the
following stages
Instruction First IF
Instruction Second IS
Register File RF
Execute EX
Data First DF
Data Second DS
Data Tag Check TC
Write Back WB
In IF, the branch logic selects the instruction
address and the I-cache fetch begins. The instruction TLB begins the
virtual-to-physical address translation.
In IS, the fetch and translation complete.
In RF, the instruction decode occurs and the processor
checks for interlocks. The instruction cache tag is checked against the page
frame from the instruction TLB. Operands are fetched from the register file.
In EX, one of the following occurs: The ALU performs a
register-to-register operation; the ALU calculates the data's virtual address
for a load or store; the ALU determines whether a branch condition is true and
calculates the virtual target address if the operation is a branch.
In DF, one of the following occurs: The data cache
fetch and data TLB translation begin for a load or store; the branch target
instruction address translation and TLB update begin for a branch; nothing
happens for a register-to-register operation.
In DS, one of the following happens: the data cache
fetch and TLB translation complete for a load/store and the shifter aligns data
to a word or double word boundary; the branch instruction address translation
and TLB update complete for branches, nothing happens for a
register-to-register operation.
In TC, for a load/store the data cache performs a tag
check -- the TLB physical address is checked against the cache tag to determine
whether it hit. Nothing happens for a register-to-register operation.
In WB, the result of a register-to-register operation
is written back to the register file. Branches do nothing during this stage.
The following diagram shows what happens at each stage
in the pipeline for load/store and branch operations.

IC1 Instruction
cache access part 1
IC2 Instruction
cache access part 2
ITLB1 Instruction
address translation part 1
ITLB2 Instruction
address translation part 2
ITC Instruction
cache tag check
IDEC Instruction
decode
RF Register
operand fetch
ALU ALU
operation (register-to-register)
DVA Data
virtual address calculation
DC1 Data
cache access part 1
DC2 Data
cache access part 2
LSA Load/store
allignment with shifter
JTLB1 Data/address
translation part 1
JTLB2 Data/address
translation part 2
DTC Data
cache tag check
IVA Instruction
virtual address calculation
WB Write
back to register file
Notice that the load/store unit and the branch unit
each do their own virtual address calculation, but after that they both use the
same TLB.
Branch Delay
The R4400 detects whether a conditional branch will be
taken in stage 4 (EX). Thus, the appropriate fetch address cannot be determined
until after three subsequent instructinos have entered the pipe behind the
branch. Normally these would have to be flushed, but it is also possible to tag
them as independent of the branch and have them proceed normally, in order to
fill the branch delay slots with useful work. It is also possible for the
processor to move instructions into the branch slots.
Load Delay
A load completes in DS. Thus, the operand cannot be
used until the instruction following two slots behind the load (i.e. when it
reaches its EX stage). The result of the load at the DS stage is automatically
redirected to the RF of the instruction using the operand so that it is
available for its EX.
Pipeline Faults
The R4400 is designed around a hierarchy of potential
fault conditions that are arranged as follows:

Stalls are interlocks that halt the entire pipeline
while slips allow part of the pipeline (usually the part already past the
offending stage) to proceed.
Each exception or interlock condition is detected in
just one stage, permitting the system to identify the offending instruction
uniquely. For example:
In IS, Instruction TLB misses are detected.
In RF, Instruction cache misses are detected, as are
Load interlock, multiply busy, divide busy, mul/div slip, shift > 32 bits,
and FPU busy, and instruction translation exceptions.
In EX, exceptions such as interrupt, bus error,
illegal instruction, breakpoint, system call, etc. are detected.
When an exception occurs, the offending instruction
and all instructions following it in the pipe are cancelled. Any stalled
instructions or other exceptions referencing this one are also cancelled. The
following are the MIPS R4400 exceptions, with the stage in which they are
detected indicated:
Instruction translation or address exception (RF)
External interrupt (EX)
Instruction bus error (EX)
Instruction virtual address coherent (EX)
Illegal instruction (EX)
Breakpoint (EX)
System call (EX)
Coprocessor unusable (EX)
Instruction error-correcting code error (EX)
Integer overflow (DS)
Floating point interrupt (DS)
Execute stage (programmed) traps (DS)
Data translation or address exception (TC)
Translation lookaside buffer modified (TC)
Data bus error (WB)
Memory reference address debugger comparison (WB)
Data virtual address coherent (WB)
Data error correcting code error (WB)
Non-maskable interrupt (WB)
Hardware reset (WB)
For a stall, the entire pipeline is frozen until the
interlock is resolved. A restart sequence starts two cycles and inserts
corrected information into the pipe before it is released.
Stalls are caused by the following events and detected
in the stage indicated:
Instruction TLB miss (IS)
Instruction cache miss (RF)
Coprocessor possible exception (DF)
Integer sign extend (DF)
Store interlock (DF)
Data cache miss (TC)
Watch address exception (TC)
In a slip, pipeline stages that depend on the
condition being resolved are held, and the rest are allowed to continue. Slips
all occur in the RF stage. They are
Load interlock
Multiply unit busy
Divide Unit busy
Multiply/Divide single cycle slip
Variable shift or shift > 32 bits
FPU busy
In the following pipeline trace, the ALU operation
depends on the Load and is scheduled two operations behind it. However, the
cache miss is not detected until the ALU operation has already executed on
incorrect data (EX-). Thus the
pipeline stalls and must be backed up before it is restarted.
|
Run/Stall |
R |
R |
R |
R |
R |
R |
R |
S |
S |
S |
S |
S |
R |
R |
R |
R |
R |
|
Restart |
Ð |
Ð |
Ð |
Ð |
Ð |
Ð |
Ð |
Ð |
Ð |
Ð |
R |
R |
Ð |
Ð |
Ð |
Ð |
Ð |
|
Load |
IF |
IS |
RF |
EX |
DF |
DS |
TC |
Ð |
Ð |
DF |
DS |
TC |
WB |
|
|
|
|
|
??? |
|
IF |
IS |
RF |
EX |
DF |
DS |
Ð |
Ð |
Ð |
DF |
DS |
TC |
WB |
|
|
|
|
??? |
|
|
IF |
IS |
RF |
EX |
DF |
Ð |
Ð |
Ð |
Ð |
DF |
DS |
TC |
WB |
|
|
|
ALU
op |
|
|
|
IF |
IS |
RF |
EX- |
Ð |
Ð |
Ð |
RF |
EX+ |
DF |
DS |
TC |
WB |
|
|
??? |
|
|
|
|
IF |
IS |
RF |
Ð |
Ð |
Ð |
Ð |
Ð |
EX |
DF |
DS |
TC |
WB |
Notice that the instructions ahead of the ALU
operation where the stall was detected all restart with their DF stage because
of the miss. The ALU operation, however, must repeat its RF stage at the same
time as the missed load completes its DS (the data is redirected). The EX+
represents a recomputation of the result with the corrected data. Note that,
because the miss was just serviced, the TC for the load should not detect any
problems.
Instruction Abort After Interlock
Suppose that an ALU operation results in an overflow
but the next instruction also has an I-cache miss. Because the miss is detected
in RF and the overflow is detected after this at DF, we have the overflow being
detected after an interlock for an instruction following it is serviced. Thus,
several instructions have entered the pipe, stalled and been restarted, but
must now be cancelled because the overflow calls for an exception handler to
execute.
|
Run/Stall |
R |
R |
R |
R |
S |
S |
S |
S |
S |
R |
R |
R |
R |
R |
|
R |
|
Stall |
|
|
|
|
ICM |
ICM |
ICM |
ICM |
ICM |
|
|
|
|
|
|
|
|
Restart |
|
|
|
|
|
|
|
R |
R |
|
|
|
|
|
|
|
|
ALU |
IF |
IS |
RF |
EX |
|
|
|
|
|
DF |
DS |
TC |
WB |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
OVF |
|
|
|
|
|
|
|
??? |
|
IF |
IS |
RF |
|
|
IF |
IS |
RF |
EX |
DF |
DS |
TC |
WB |
|
|
|
|
|
|
|
ICM |
|
|
|
|
|
|
|
|
|
|
|
|
|
??? |
|
|
IF |
IS |
|
|
|
IF |
IS |
RF |
EX |
DF |
DS |
TC |
WB |
|
|
??? |
|
|
|
IF |
|
|
|
|
IF |
IS |
RF |
EX |
DF |
DS |
TC |
WB |
All four of the write back operations must be
cancelled before the exception handler is called. It may seem wasteful to
complete the servicing of the miss, only to jump to the exception handler, but
assuming that the exception handler is short, it probably won't force this line
from cache before its return. At that point, the cancelled instructions can be
reissued. Even if the line is flushed before return, it is important to keep in
mind that exceptions only occur rarely anyway.
Fault Handling
One problem that a pipeline designer faces is that, at
every cycle of the pipe one or more faults could potentially occur. These must
be detected, the control logic must prioritize them and choose one, and then
the control signals must be distributed to the pipe. These steps can easily
take longer than a single pipeline cycle to complete. Thus, it becomes
necessary to pipeline the handing of faults.
In the MIPS R4400, the fault handing pipe is 3 stages
long, with the time for these three stages corresponding to 2 stages of the
instruction pipe.

As an example of this process, consider that an
Execute Trap does not cause the pipeline to enter exception mode until the DS
stage. Another example is the Instruction Cache Miss, which is detected in the
RF stage, even though the miss occurs in IF. Notice that all of the slip
conditions are detected in the RF stage, which probably means that they are
checked as soon as an instruction cycle begins.
Special Cases
Address acceleration is a situation that bypasses the
normal pipeline state machine and sends a miss directly to the secondary cache,
rather than waiting for the primary cache to forward the miss. Thus the
secondary cache has already started to fetch the data or process its own miss
when the primary cache issues its load.
Address prediction increments and transmits the miss
address under the assumption that the next address will also be a miss. Address
prediction is done during the stall time of the original miss. The increment
and issue actually extends the stall by 3 cycles, but it is assumed that this
will save a subsequent longer miss. (Typical miss time for a stall is 5 cycles
and can be longer when the processor is running with a multiplied internal
clock. For example, it would be 9 cycles for a 2X clock.)
FPU Pipeline
The R4400 has a second 8-stage pipeline for floating
point, with the same stages. However, the FPU pipe can also stall for FPU
operations. A typical stall is 3 cycles with as few as 0 and as many as 112
(double-precision square root).
The FP pipe can overlap load/store and move with other
operations.
There are separate FP multiply and divide units, each
with their own arithmetic pipelines, although they share the same add unit in
their last cycle. The divider can handle just one operation at a time, whereas
the multiplier can handle two operations at once as long as they are separated
by 2 cycles (for 32-bit operands) or 3 cycles (for 64-bit operands).
There are 8 floating point operational units that can
be associated with the stages of each arithmetic pipeline:
Adder Mantissa Add A
Adder Exception Test E
CPU Exception EX
Multiplier First Stage M
Multiplier Second Stage N
Adder Result Round R
Adder Operand Shift S
Unpack U
For example, a multiply might use these operations in
sequence: U M M M N N/A R, while an add uses U S+A A+R R+S.
Note that the A and R units are used in the last two
stages of a multiply. Thus, resource conflicts can occur between multiply and
add. However, adds can be issued after the multiply because they complete in
fewer cycles than the multiply. For example:
|
Mult |
U |
M |
M |
M |
N |
N/A |
R |
|
Add |
|
U |
S+A |
A+R |
R+S |
|
|
|
Add |
|
|
U |
S+A |
A+R |
R+S |
|
|
Add |
|
|
|
U |
S+A |
A+R |
R+S |
The last Add in this example conflicts with the Mult
in its last two cycles and is thus an illegal instruction to issue. Note that
the number of cycles taken by the multiply depends on the instruction and the
operand sizes, so varying numbers of adds can follow different multiplies.
The floating point unit is fully bypassed, meaning
that when one instruction depends on the result of another, the result can be
sent directly to the other unit without having to save it to a register first.
Another Case Study: UltraSPARC
The UltraSPARC 1 has a nine-stage integer pipeline
with a parallel four-stage floating-point execution pipeline. Instructions pass
through the first three stages of the integer pipe and may then be sent to the
FP pipe or continue through the integer pipe, depending on their operation. The
FP instructions rejoin the integer pipe in its next to last stage.
The nine stages of the integer pipe are as follows:
Fetch -- instructions fetched from I-cache
Decode --instructions decoded and sent to the I-buffer
Group -- Up to four instructions are grouped for
dispatch, and the register file is accessed
Execute -- Integer instructions executed and virtual
addresses calculated for other types
Cache access -- D-cache/TLB is accessed and branches
are resolved
N1 -- D-cache miss detected, deferred load enters load
buffer
N2 -- Integer pipe waits for FP pipe
N3 -- Traps resolved
Write back -- Results written to register file
The four stages of the floating point pipe are:
Register -- Decode completes and register file is
accessed
X1 -- Execute stage 1
X2 -- Execute stage 2
X3 -- Execute stage 3 (complete)
Instructions are issues in order to this pipeline,
rather than use dynamic reordering of the instruction stream. The UltraSPARC
design team made this choice based on multiple factors.
Out of order execution boosts performance by 30% on
standard integer code. However, they found that aggressive compiler
optimization to schedule the pipeline would cut this difference to 15% on
typical (SPECint92) code, and that there was no difference on many scientific
codes.
They also determined that out-of-order execution could
require an extra pipeline stage, resulting in a 2% to 4% loss due to additional
branch penalty. With the increase in the size and complexity of the chip, they
estimated approximately a 20% increase in cycle time. They also determined that
the time to develop the chip would increase by 3 to 6 months, resulting in a
slip of 12% to 24% with respect to the performance of competing chips. Thus,
they concluded that there would be no advantage to incorporating out-of-order
execution, and that an in-order processor with a highly optimizing compiler
would actually deliver more performance.
What they don't explain is that they were under a
limited budget, and that a competitor with more resources would not suffer a
slippage in schedule, but could rather spend more on the development to
accelerate it. In addition, exposing the pipeline structure to the compiler for
more aggressive optimization results in machine specific optimizations that may
not port to the next generation. Further, they ignore the cost of developing
such a compiler and the delay in its deployment that could result in
sub-optimal performance during the initial period that the processor is used.
The UltraSPARC thus issues the instructions in the
order that they are fetched. However, instructions may complete out of order,
so that shorter instructions do not have to wait for longer ones to finish
first. It is a four-way superscalar architecture feeding nine functional units:
2 64-bit integer ALUs
1 Load/Store unit
1 FP adder
1 FP multiplier
1 FP divider/square root
1 Graphics adder (4 16-bit integer ops in parallel)
1 Graphics multiplier (same)
Their simulations showed that the improvement due to
instruction level parallelism fell off at around 4-way issue, even with
compiler optimization. The following table shows their results, comparing
superscalar issue to scalar issue.
Unoptimized Optimized
1-way 1
1
2-way 1.14 1.46
3-way 1.26 1.78
4-way 1.33 1.83
They also found only modest gains for adding an extra
integer ALU to the design, and then only with proper compiler optimization and
at least 3-way issue. Thus, they elected to use 4-way issue and to provide two
integer ALUs,
Unoptimized Optimized
2-way 1.06 1.12
3-way 1.13 1.25
In order to address control hazards in the pipeline,
the UltraSPARC stores additional information in its instruction cache. Each
line of the cache contains four instructions. It also contains a pair of 2-bit
branch prediction bits (at most two branches should be placed into a single
cache line). They note that binding the branch prediction bits to the cache
avoids the problem of aliasing that results with a branch history buffer that
is smaller than the number of branches that may be in the cache. Each line also
contains a single Next Fetch Address, which typically indicates the next line,
but can also store a branch target address if the line contains a branch that
is likely to be taken. They report that these mechanisms result in prediction
success rates of 88% for the SPECint92 codes, 94% for the SPECfp92 codes, and
90% for database applications.