MIPS R4400 Pipeline Case Study

The MIPS R4400 has an 8-stage pipeline with the following stages

Instruction First           IF

Instruction Second      IS

Register File                RF

Execute                        EX

Data First                    DF

Data Second                DS

Data Tag Check           TC

Write Back                  WB

In IF, the branch logic selects the instruction address and the I-cache fetch begins. The instruction TLB begins the virtual-to-physical address translation.

In IS, the fetch and translation complete.

In RF, the instruction decode occurs and the processor checks for interlocks. The instruction cache tag is checked against the page frame from the instruction TLB. Operands are fetched from the register file.

In EX, one of the following occurs: The ALU performs a register-to-register operation; the ALU calculates the data's virtual address for a load or store; the ALU determines whether a branch condition is true and calculates the virtual target address if the operation is a branch.

In DF, one of the following occurs: The data cache fetch and data TLB translation begin for a load or store; the branch target instruction address translation and TLB update begin for a branch; nothing happens for a register-to-register operation.

In DS, one of the following happens: the data cache fetch and TLB translation complete for a load/store and the shifter aligns data to a word or double word boundary; the branch instruction address translation and TLB update complete for branches, nothing happens for a register-to-register operation.

In TC, for a load/store the data cache performs a tag check -- the TLB physical address is checked against the cache tag to determine whether it hit. Nothing happens for a register-to-register operation.

In WB, the result of a register-to-register operation is written back to the register file. Branches do nothing during this stage.

The following diagram shows what happens at each stage in the pipeline for load/store and branch operations.

IC1                  Instruction cache access part 1

IC2                  Instruction cache access part 2

ITLB1 Instruction address translation part 1

ITLB2 Instruction address translation part 2

ITC                  Instruction cache tag check

IDEC               Instruction decode

RF                   Register operand fetch

ALU                ALU operation (register-to-register)

DVA                Data virtual address calculation

DC1                Data cache access part 1

DC2                Data cache access part 2

LSA                 Load/store allignment with shifter

JTLB1 Data/address translation part 1

JTLB2 Data/address translation part 2

DTC                Data cache tag check

IVA                 Instruction virtual address calculation

WB                 Write back to register file

Notice that the load/store unit and the branch unit each do their own virtual address calculation, but after that they both use the same TLB.


Branch Delay

The R4400 detects whether a conditional branch will be taken in stage 4 (EX). Thus, the appropriate fetch address cannot be determined until after three subsequent instructinos have entered the pipe behind the branch. Normally these would have to be flushed, but it is also possible to tag them as independent of the branch and have them proceed normally, in order to fill the branch delay slots with useful work. It is also possible for the processor to move instructions into the branch slots.

Load Delay

A load completes in DS. Thus, the operand cannot be used until the instruction following two slots behind the load (i.e. when it reaches its EX stage). The result of the load at the DS stage is automatically redirected to the RF of the instruction using the operand so that it is available for its EX.

Pipeline Faults

The R4400 is designed around a hierarchy of potential fault conditions that are arranged as follows:

Stalls are interlocks that halt the entire pipeline while slips allow part of the pipeline (usually the part already past the offending stage) to proceed.

Each exception or interlock condition is detected in just one stage, permitting the system to identify the offending instruction uniquely. For example:

In IS, Instruction TLB misses are detected.

In RF, Instruction cache misses are detected, as are Load interlock, multiply busy, divide busy, mul/div slip, shift > 32 bits, and FPU busy, and instruction translation exceptions.

In EX, exceptions such as interrupt, bus error, illegal instruction, breakpoint, system call, etc. are detected.

When an exception occurs, the offending instruction and all instructions following it in the pipe are cancelled. Any stalled instructions or other exceptions referencing this one are also cancelled. The following are the MIPS R4400 exceptions, with the stage in which they are detected indicated:

Instruction translation or address exception (RF)

External interrupt (EX)

Instruction bus error (EX)

Instruction virtual address coherent (EX)

Illegal instruction (EX)

Breakpoint (EX)

System call (EX)

Coprocessor unusable (EX)

Instruction error-correcting code error (EX)

Integer overflow (DS)

Floating point interrupt (DS)

Execute stage (programmed) traps (DS)

Data translation or address exception (TC)

Translation lookaside buffer modified (TC)

Data bus error (WB)

Memory reference address debugger comparison (WB)

Data virtual address coherent (WB)

Data error correcting code error (WB)

Non-maskable interrupt (WB)

Hardware reset (WB)

For a stall, the entire pipeline is frozen until the interlock is resolved. A restart sequence starts two cycles and inserts corrected information into the pipe before it is released.

Stalls are caused by the following events and detected in the stage indicated:

Instruction TLB miss (IS)

Instruction cache miss (RF)

Coprocessor possible exception (DF)

Integer sign extend (DF)

Store interlock (DF)

Data cache miss (TC)

Watch address exception (TC)

In a slip, pipeline stages that depend on the condition being resolved are held, and the rest are allowed to continue. Slips all occur in the RF stage. They are

Load interlock

Multiply unit busy

Divide Unit busy

Multiply/Divide single cycle slip

Variable shift or shift > 32 bits

FPU busy

 

In the following pipeline trace, the ALU operation depends on the Load and is scheduled two operations behind it. However, the cache miss is not detected until the ALU operation has already executed on incorrect data (EX-).  Thus the pipeline stalls and must be backed up before it is restarted.

 

Run/Stall

R

R

R

R

R

R

R

S

S

S

S

S

R

R

R

R

R

Restart

Ð

Ð

Ð

Ð

Ð

Ð

Ð

Ð

Ð

Ð

R

R

Ð

Ð

Ð

Ð

Ð

Load

IF

IS

RF

EX

DF

DS

TC

Ð

Ð

DF

DS

TC

WB

 

 

 

 

???

 

IF

IS

RF

EX

DF

DS

Ð

Ð

Ð

DF

DS

TC

WB

 

 

 

???

 

 

IF

IS

RF

EX

DF

Ð

Ð

Ð

Ð

DF

DS

TC

WB

 

 

ALU op

 

 

 

IF

IS

RF

EX-

Ð

Ð

Ð

RF

EX+

DF

DS

TC

WB

 

???

 

 

 

 

IF

IS

RF

Ð

Ð

Ð

Ð

Ð

EX

DF

DS

TC

WB

Notice that the instructions ahead of the ALU operation where the stall was detected all restart with their DF stage because of the miss. The ALU operation, however, must repeat its RF stage at the same time as the missed load completes its DS (the data is redirected). The EX+ represents a recomputation of the result with the corrected data. Note that, because the miss was just serviced, the TC for the load should not detect any problems.

Instruction Abort After Interlock

Suppose that an ALU operation results in an overflow but the next instruction also has an I-cache miss. Because the miss is detected in RF and the overflow is detected after this at DF, we have the overflow being detected after an interlock for an instruction following it is serviced. Thus, several instructions have entered the pipe, stalled and been restarted, but must now be cancelled because the overflow calls for an exception handler to execute.

 

Run/Stall

R

R

R

R

S

S

S

S

S

R

R

R

R

R

 

R

Stall

 

 

 

 

ICM

ICM

ICM

ICM

ICM

 

 

 

 

 

 

 

Restart

 

 

 

 

 

 

 

R

R

 

 

 

 

 

 

 

ALU

IF

IS

RF

EX

 

 

 

 

 

DF

DS

TC

WB

 

 

 

 

 

 

 

 

 

 

 

 

 

OVF

 

 

 

 

 

 

???

 

IF

IS

RF

 

 

IF

IS

RF

EX

DF

DS

TC

WB

 

 

 

 

 

 

ICM

 

 

 

 

 

 

 

 

 

 

 

 

???

 

 

IF

IS

 

 

 

IF

IS

RF

EX

DF

DS

TC

WB

 

???

 

 

 

IF

 

 

 

 

IF

IS

RF

EX

DF

DS

TC

WB

All four of the write back operations must be cancelled before the exception handler is called. It may seem wasteful to complete the servicing of the miss, only to jump to the exception handler, but assuming that the exception handler is short, it probably won't force this line from cache before its return. At that point, the cancelled instructions can be reissued. Even if the line is flushed before return, it is important to keep in mind that exceptions only occur rarely anyway.

Fault Handling

One problem that a pipeline designer faces is that, at every cycle of the pipe one or more faults could potentially occur. These must be detected, the control logic must prioritize them and choose one, and then the control signals must be distributed to the pipe. These steps can easily take longer than a single pipeline cycle to complete. Thus, it becomes necessary to pipeline the handing of faults.

In the MIPS R4400, the fault handing pipe is 3 stages long, with the time for these three stages corresponding to 2 stages of the instruction pipe.

As an example of this process, consider that an Execute Trap does not cause the pipeline to enter exception mode until the DS stage. Another example is the Instruction Cache Miss, which is detected in the RF stage, even though the miss occurs in IF. Notice that all of the slip conditions are detected in the RF stage, which probably means that they are checked as soon as an instruction cycle begins.

Special Cases

Address acceleration is a situation that bypasses the normal pipeline state machine and sends a miss directly to the secondary cache, rather than waiting for the primary cache to forward the miss. Thus the secondary cache has already started to fetch the data or process its own miss when the primary cache issues its load.

Address prediction increments and transmits the miss address under the assumption that the next address will also be a miss. Address prediction is done during the stall time of the original miss. The increment and issue actually extends the stall by 3 cycles, but it is assumed that this will save a subsequent longer miss. (Typical miss time for a stall is 5 cycles and can be longer when the processor is running with a multiplied internal clock. For example, it would be 9 cycles for a 2X clock.)

FPU Pipeline

The R4400 has a second 8-stage pipeline for floating point, with the same stages. However, the FPU pipe can also stall for FPU operations. A typical stall is 3 cycles with as few as 0 and as many as 112 (double-precision square root).

The FP pipe can overlap load/store and move with other operations.

There are separate FP multiply and divide units, each with their own arithmetic pipelines, although they share the same add unit in their last cycle. The divider can handle just one operation at a time, whereas the multiplier can handle two operations at once as long as they are separated by 2 cycles (for 32-bit operands) or 3 cycles (for 64-bit operands).

There are 8 floating point operational units that can be associated with the stages of each arithmetic pipeline:

Adder Mantissa Add               A

Adder Exception Test  E

CPU Exception                       EX

Multiplier First Stage              M

Multiplier Second Stage          N

Adder Result Round                R

Adder Operand Shift               S

Unpack                                    U

For example, a multiply might use these operations in sequence: U M M M N N/A R, while an add uses U S+A A+R R+S.

Note that the A and R units are used in the last two stages of a multiply. Thus, resource conflicts can occur between multiply and add. However, adds can be issued after the multiply because they complete in fewer cycles than the multiply. For example:

 

Mult

U

M

M

M

N

N/A

R

Add

 

U

S+A

A+R

R+S

 

 

Add

 

 

U

S+A

A+R

R+S

 

Add

 

 

 

U

S+A

A+R

R+S

The last Add in this example conflicts with the Mult in its last two cycles and is thus an illegal instruction to issue. Note that the number of cycles taken by the multiply depends on the instruction and the operand sizes, so varying numbers of adds can follow different multiplies.

The floating point unit is fully bypassed, meaning that when one instruction depends on the result of another, the result can be sent directly to the other unit without having to save it to a register first.

Another Case Study: UltraSPARC

The UltraSPARC 1 has a nine-stage integer pipeline with a parallel four-stage floating-point execution pipeline. Instructions pass through the first three stages of the integer pipe and may then be sent to the FP pipe or continue through the integer pipe, depending on their operation. The FP instructions rejoin the integer pipe in its next to last stage.

The nine stages of the integer pipe are as follows:

Fetch -- instructions fetched from I-cache

Decode --instructions decoded and sent to the I-buffer

Group -- Up to four instructions are grouped for dispatch, and the register file is accessed

Execute -- Integer instructions executed and virtual addresses calculated for other types

Cache access -- D-cache/TLB is accessed and branches are resolved

N1 -- D-cache miss detected, deferred load enters load buffer

N2 -- Integer pipe waits for FP pipe

N3 -- Traps resolved

Write back -- Results written to register file

The four stages of the floating point pipe are:

Register -- Decode completes and register file is accessed

X1 -- Execute stage 1

X2 -- Execute stage 2

X3 -- Execute stage 3 (complete)

Instructions are issues in order to this pipeline, rather than use dynamic reordering of the instruction stream. The UltraSPARC design team made this choice based on multiple factors.

Out of order execution boosts performance by 30% on standard integer code. However, they found that aggressive compiler optimization to schedule the pipeline would cut this difference to 15% on typical (SPECint92) code, and that there was no difference on many scientific codes.

They also determined that out-of-order execution could require an extra pipeline stage, resulting in a 2% to 4% loss due to additional branch penalty. With the increase in the size and complexity of the chip, they estimated approximately a 20% increase in cycle time. They also determined that the time to develop the chip would increase by 3 to 6 months, resulting in a slip of 12% to 24% with respect to the performance of competing chips. Thus, they concluded that there would be no advantage to incorporating out-of-order execution, and that an in-order processor with a highly optimizing compiler would actually deliver more performance.

What they don't explain is that they were under a limited budget, and that a competitor with more resources would not suffer a slippage in schedule, but could rather spend more on the development to accelerate it. In addition, exposing the pipeline structure to the compiler for more aggressive optimization results in machine specific optimizations that may not port to the next generation. Further, they ignore the cost of developing such a compiler and the delay in its deployment that could result in sub-optimal performance during the initial period that the processor is used.

The UltraSPARC thus issues the instructions in the order that they are fetched. However, instructions may complete out of order, so that shorter instructions do not have to wait for longer ones to finish first. It is a four-way superscalar architecture feeding nine functional units:

2 64-bit integer ALUs

1 Load/Store unit

1 FP adder

1 FP multiplier

1 FP divider/square root

1 Graphics adder (4 16-bit integer ops in parallel)

1 Graphics multiplier (same)

Their simulations showed that the improvement due to instruction level parallelism fell off at around 4-way issue, even with compiler optimization. The following table shows their results, comparing superscalar issue to scalar issue.

        Unoptimized    Optimized

1-way       1              1

2-way       1.14           1.46

3-way       1.26           1.78

4-way       1.33           1.83

They also found only modest gains for adding an extra integer ALU to the design, and then only with proper compiler optimization and at least 3-way issue. Thus, they elected to use 4-way issue and to provide two integer ALUs,

        Unoptimized    Optimized

2-way       1.06           1.12

3-way       1.13           1.25

In order to address control hazards in the pipeline, the UltraSPARC stores additional information in its instruction cache. Each line of the cache contains four instructions. It also contains a pair of 2-bit branch prediction bits (at most two branches should be placed into a single cache line). They note that binding the branch prediction bits to the cache avoids the problem of aliasing that results with a branch history buffer that is smaller than the number of branches that may be in the cache. Each line also contains a single Next Fetch Address, which typically indicates the next line, but can also store a branch target address if the line contains a branch that is likely to be taken. They report that these mechanisms result in prediction success rates of 88% for the SPECint92 codes, 94% for the SPECfp92 codes, and 90% for database applications.