Lecture 6

Pipelining, Hazards, and Branch Prediction

Linear Pipeline Processors

We are all familiar with the basic notion of pipelining to increase throughput via staged temporal parallelism. We now consider how the basic idea can be modified to be more effective in certain situations.

Pipelines can be synchronous or asynchronous. We mostly focus on synchronous designs within a processor --Why? -- because:

¥ Processors are generally synchronous already.

¥ The stages are tightly coupled and in close proximity, making clock distribution easy and signal delays short and equal.

¥ The technology within the processor is homogeneous and under the control of the designer.

When are asynchronous pipelines useful?

¥ When we are dealing with asynchronous stages (i.e. stage times can vary).

¥ When the stages are not tightly coupled.

¥ When the signals must cross technology boundaries, or the stages are not in close proximity.

Systolic arrays are examples of asynchronous pipelines. Dataflow is an example of a nonlinear asynchronous pipeline. A possible direction for architectures in an era where wire delay is a major concern is to introduce asynchrony. However, the area of the typical pipeline is sufficiently small, that it is unlikely to need to switch to asynchronous control. Forwarding of values and dependence information between pipelines is a more attractive option, and a potential area for architecture research.

Synchronous Linear Pipelines

Now let's consider how we would determine the clock period for a pipeline. In the following formulas we assume that the times can be determined through some appropriate means such as simulation or measurement.

The clock period for a pipeline is the maximum time for any stage (Tmax) plus the time to latch the data into the register that separates it from the next stage (TL). TL is the time during which the clock transitions. Because signals do not transition instantly, the clock takes some time to switch state. This sum is the minimum clock period for the pipeline.

Clock = Tmax + TL

Because there can also be skew in the arrival time (early or late) of the clock pulse at the different stages, there needs to be some compensation added to the clock period of the pipe.

Thus, we must add the skew time (S) to get the minimum clock period:

Clock >= Tmax + TL + S

A given stage may have multiple logic paths that are of different lengths. Tmax represents the maximum path among all of the stages. However, it is possible for a stage to have a short path that is much smaller than this maximum. In designing the clock, we have to be certain that the time to latch a value (TL) plus the skew (S) is less than the time of the shortest path. Otherwise it would be possible for an input from a prior stage to propagate through the shortest path and affect its stage's output before its current output can be latched. Note that it this is an easy problem to correct, simply by adding some gates to the short path that merely pass the signals through with the necessary delay. The real problem is that it can be difficult to get the clock to transition fast enough. In that case, the simple solution would be to lower the clock rate (lengthen the clock period), but for obvious marketing reasons, this isnÕt a good option. Thus, engineers work very hard to ensure that the clock transitions fast enough to avoid the stage overrun problem.

Thus the length of TL must be less than or equal to the minimum delay through a stage (Tmin), minus the possible skew.

TL <= Tmin - S

 

So the range of possible clock periods is Tmax + S plus a pulse length chosen to be less than Tmin - S, up to Tmax plus the maximum pulse length (Tmin - S)

TL + Tmax + S <= Clock <= Tmax + Tmin - S

Essentially there is no upper bound on Tmax so the maximum could be any length (although whatever value is chosen for Tmax, the total time will be no more than that plus Tmin - S).

We try to push T as close to the minimum as we can in aggressive designs and keep it near the middle for more conservative designs. The variability of the fabrication process determines how closely we can approach the minimum period -- if the variability is too great, then the yield will be reduced and cost increases.

Speedup

The best possible processing time from a pipeline of K stages on an input stream of N values with a clock period of C is

Tk = (K + (N - 1))C

That is, the first one takes K cycles to complete and after that, successive results emerge with each clock cycle.

A nonpipelined processor where the delay for one operation is KC would take time NKC. Thus the speedup is

As N drops to 1, we get a speedup factor of unity. As N grows toward infinity, the speedup approaches K.

The efficiency of the pipeline is the speedup divided by the number of stages

The throughput is the efficiency times the rate (frequency (F) -- inverse of clock period) at which the pipe operates

Thus, as N approaches infinity, efficiency approaches 1 and throughput approaches F.

Nonlinear Pipelines

A pipeline need not be a simple linear chain of stages. There are instances where it is useful to have a collection of functional units that can be wired into a particular pattern of flow, even with loops and skips in the chain. This may allow more than one function to be computed with the same pipeline. A typical case would be built-in floating-point square root, which chains together the floating-point adder and multiplier, rather than having separate functional units for this rarely used operation. Depending upon how the square root operation operates, it might leave holes in the schedule that would admit independent floating adds or multiplies.

The problem with trying to utilize a nonlinear pipeline is that it is difficult to keep it full unless the functions do not collide with each other or themselves.

For example, given the following pipeline from HwangÕs Advanced Computer Architecture book:

These reservation tables show the sequence in which each function utilizes each stage. (For example, think of X as being a floating square root, and Y as being a floating cosine. A simple floating multiply might occupy just S1 and S2 in sequence.) We could also denote multiple stages being used in parallel, or a stage being drawn out for more than one cycle with these diagrams.

We determine the next start time for one or the other of the functions by lining up the diagrams and sliding one with respect to another to see where one can fit into the open slots.

Once an X function has been scheduled, another X function can start after 1, 3 or 6 cycles. A Y function can start after 2 or 4 cycles.

Once a Y function has been scheduled, another Y function can start after 1, 3 or 5 cycles. An X function can start after 2 or 4 cycles.

After two functions have been scheduled, no more can start until both are complete.

Instruction Pipelines

When instructions are flowing through a pipeline, the assumption is that the order of execution follows the linear order of the instructions in memory. A typical program, however, includes branches to other locations. Whether a conditional jump is taken may not be recognized until late in the pipeline (usually at the execute stage). At that point, all of the instructions behind the jump must be flushed and the pipeline must be restarted, resulting in wasted cycles called a branch penalty.

Given that an instruction takes unit time to pass through a stage of the pipeline, that the penalty for a branch is B stage times, that the probability that a given instruction is a jump is Pj and that the probability that the jump is taken is Pt , then the average time for an instruction (with the pipeline operating in steady state -- ignoring start-up cost) is

1 + B * Pj * Pt

and the efficiency of the instruction pipe is

1 / (1 + B * Pj * Pt)

For a branch penalty of 3 (i.e., branches are detected in stage 4), a probability of a branch of 0.2 and a probability of being taken of 0.4, the average instruction time is 1.24 cycles and the efficiency is only 80.6%. Thus, we have lost nearly 20% of the potential performance of the instruction pipe when one in five instructions is a branch and 40% of the time it is taken (which is actually a better than the average case).

Branches are one of three types of hazards (control hazards) that can occur in a pipeline. The other two types are structural hazards in which the hardware runs out of resources to handle all of the operations in the pipe simultaneously, and data hazards in which instructions depend on each other's results such that they cannot be overlapped.

Another example from Hwang:

Given a 7-stage pipe with the following stages:

Fetch               (F)

Decode            (D)

Issue                (I)

Execute 1        (1)

Execute 2        (2)

Execute 3        (3)

Write Back      (W)

If two assignment statements are executed:

X = Y + Z;

A = B + C;


Then the pipeline operates as follows

 

R1 <- Y

F

D

I

1

2

3

W

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

R2 <- Z

 

F

D

I

1

2

3

W

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

R3 <- R1 + R2

 

 

F

D

Ð

1

Ð

I

1

2

3

W

 

 

 

 

 

 

 

 

 

 

 

 

 

 

X <- R3

 

 

 

F

Ð

2

Ð

D

Ð

3

-

I

1

2

3

W

 

 

 

 

 

 

 

 

 

 

R4 <- B

 

 

 

 

 

 

 

F

Ð

4

-

D

I

1

2

3

W

 

 

 

 

 

 

 

 

 

R5 <- C

 

 

 

 

 

 

 

 

 

 

 

F

D

I

1

2

3

W

 

 

 

 

 

 

 

 

R6 <- R4 * R5

 

 

 

 

 

 

 

 

 

 

 

 

F

D

Ð

5

Ð

I

1

2

3

W

 

 

 

 

A <- R6

 

 

 

 

 

 

 

 

 

 

 

 

 

F

Ð

6

Ð

D

Ð

7

-

I

1

2

3

W

Notice the gaps in this execution sequence. The efficiency of the pipeline's 26 cycles compared to its optimal performance of 14 cycles for the 8 instructions is 54%. Where do these come from? Let's look at each one in turn (D indicates a data hazard, S indicates a structural hazard):

1    The Add cannot be issued until the second operand is written into the register (D)

2    Because the decoder is stalled, the Store cannot be decoded (S)

3    Because the Store depends on the result of the Add, it can't be issued before the write back of the add (D)

4    Even though loading B to R4 is independent, the decoder is stalled waiting to issue the Store, which is waiting for the Add to write back. (S)

5    The multiply can't be issued until the operands are ready. (D)

6    The decoder is stalled waiting to issue the multiply (S)

7    The multiply result is not available yet. (D)

Reordering the instructions, however, can greatly improve the efficiency of the pipeline:

 

R1 <- Y

F

D

I

1

2

3

W

 

 

 

 

 

 

 

 

 

 

 

R2 <- Z

 

F

D

I

1

2

3

W

 

 

 

 

 

 

 

 

 

 

R4 <- B

 

 

F

D

I

1

2

3

W

 

 

 

 

 

 

 

 

 

R5 <- C

 

 

 

F

D

I

1

2

3

W

 

 

 

 

 

 

 

 

R3 <- R1 + R2

 

 

 

 

F

D

Ð1

I

1

2

3

W

 

 

 

 

 

 

R6 <- R4 * R5

 

 

 

 

 

F

Ð2

D

Ð3

I

1

2

3

W

 

 

 

 

X <- R3

 

 

 

 

 

 

 

F

-4

D

-5

I

1

2

3

W

 

 

A <- R6

 

 

 

 

 

 

 

 

 

F

-6

D

Ð5

I

1

2

3

W

The pipeline takes 18 cycles to execute the same sequence, improving its efficiency to 78%. Here are the reasons for the remaining stalls:

1    The Add must wait for the write back of the second operand. (D)

2    The decoder is busy holding back the Add (S)

3    The multiply has to wait for the write back of its second operand (D)

4    The decoder is busy holding the multiply (S)

5    The store must wait for the write into R3 (D)

6    The decoder is busy holding the first store (S)

7    The second store cannot be issued until the multiply writes back its result. (D)

Note that if we had a third operation, we could issue its loads just prior to each of the other two operations, and we could issue it just before the second store. The pipeline would then be at 94% efficiency (one wasted cycle). Of course, the store for the third operation would have a two-cycle delay waiting for the write back of the result, but presumably we could also find something to fill those cycles (or perhaps the value would not be written back immediately anyway).

Forwarding and Data Hazards

Sometimes it is possible to avoid data hazards by noting that a value that results from one instruction is not needed until a late stage in a following instruction, and sending the data directly from the output of the first functional unit back to the input of the second one (which is sometimes the same unit). In the general case, this would require the output of every functional unit to be connected through switching logic to the input of every functional unit.

Data hazards can take three forms:

Read after write (RAW): Attempting to read a value that hasn't been written yet. This is the most common type, and can be overcome by forwarding.

Write after write (WAW): Writing a value before a preceding write has completed. This can only happen in complex pipes that allow instructions to proceed out of order, or that have multiple write-back stages (mostly CISC), or when we have multiple pipes that can write (superscalar).

Write after read (WAR): Writing a value before a preceding read has completed. These also require a complex pipeline that can sometimes write in an early stage, and read in a later stage. It is also possible when multiple pipelines (superscalar) or out-of-order issue are employed.

The fourth situation, read after read (RAR) does not produce a hazard.

Forwarding does not solve every RAW hazard situation. For example, if a functional unit is merely slow and fails to produce a result that can be forwarded in time, then the pipeline must stall. A simple example is the case of a load, which has a high latency. This is the sort of situation where compiler scheduling of instructions can help, by rearranging independent instructions to fill the delay slots. If an architecture supports data prefetching, then the compiler can also try to force the value to be promoted to the L1 cache before it is loaded, so that the Load doesnÕt stall.

The processor can also rearrange the instructions at run time, if it has access to a window of prefetched instructions (called a prefetch buffer). It must perform much the same analysis as the compiler to determine which instructions are dependent on each other, but because the window is usually small, the analysis is more limited in scope. The small size of the window is due to the cost of providing a wide enough datapath to predecode multiple instructions at once, and the complexity of the dependence testing logic.

Out of order execution introduces another level of complexity in the control of the pipeline, because it is desirable to preserve the abstraction of in-order issue, even in the presence of exceptions that could flush the pipe at any stage. But we'll defer this to later.

Branch Penalty Hiding

We've noted that control hazards due to branches can cause a large part of the pipeline to be flushed, greatly reducing its performance. One way of hiding the branch penalty is to fill the pipe behind the branch with instructions that would be executed whether or not the branch is taken. If we can find the right number of instructions that precede the branch and are independent of the test, then the compiler can move them immediately following the branch and tag them as branch delay filling instructions. The processor can then execute the branch, and when it determines the appropriate target, the instruction is fetched into the pipeline with no penalty.

Of course, this scheme depends on how the pipeline is designed. It effectively binds part of the pipeline design for a specific implementation to the instruction set architecture. As we've seen before, it is generally a bad idea to bind implementation details to the ISA because we may need to change them later on. For example, if we decide to lengthen the pipeline so that the number of delay slots increases, we have to recompile our code to have it execute efficiently -- we no longer have strict binary compatibility among models of the same "ISA".

The filling of branch delays can be done dynamically in hardware by reordering instructions out of the prefetch buffer. But as we'll see, this leads to other problems.

Another way to hide branch penalties is to avoid certain kinds of branches. For example, if we have

IF A < 0

  THEN A = -A

we would normally implement this with a nearby branch. However, we could instead use an instruction that performs the arithmetic conditionally (skips the write back if the condition fails). The advantage of this scheme is that, although one pipeline cycle is wasted, we do not have to flush the rest of the pipe (also, for a dynamic branch prediction scheme, we need not put an extra branch into the prediction unit). These are called predicated instructions, and the concept can be extended to other sorts of operations, such as conditional loading of a value from memory.

Branch Prediction

Branches are the bane of any pipeline, causing a potentially large decrease in performance as we saw earlier. There are several ways to reduce this loss by predicting the action of the branch ahead of time.

Simple static prediction assumes that all branches will be taken or not. The designer decides which way is predicted from instruction trace statistics. Once the choice is made, the compiler can help by properly ordering local jumps.

A slightly more complex static branch prediction heuristic is that backward branches are usually taken and forward branches are not (backwards taken, forwards not or BTFN). This assumes that most backward branches are loop returns and that most forward branches are the less likely cases of a conditional branch.

Compiler static prediction involves the use of special branches that indicate the most likely choice (taken or not, or more typically taken or other, since the most predictable branches are those at the ends of loops that are mostly taken). If the prediction fails in this case, then the usual cancellation of the instructions in the delay slots occurs and a branch penalty results.

Local Dynamic Branch Prediction

Dynamic prediction remembers the action taken by recent jumps and uses this information to predict the likely action on the next encounter. A small (4-state) finite state machine can be used to keep track of the two most recent jump outcomes and use this information to predict with fair accuracy what will happen on the next jump.

The storage for these finite state machines can be kept in a shared pool, or it can be attached to the instruction cache. The latter case ensures that there are unique history bits for each branch, but is inefficient as many cache lines may have just a single branch or none at all (schemes that implement this usually compromise by supporting history for just one or two branches per line). The former case is thus more efficient, as branch instructions can be mapped to a smaller pool of history bits. However, the mapping function must be fast and thus simple, so it cannot make the extra effort to ensure that branches do not collide in their use of the history bits. Such collisions are known as aliasing. They result in mispredictions if aliased branches have different behaviors. However, execution still proceeds correctly Ð just with a performance penalty.

The popular approach taken in modern microprocessors is to have a cache called a branch target buffer. This cache is indexed by the address of each branch and it stores its target address. Note that the branch target buffer cannot be aliased. Like a normal cache, it must record the remainder of the branchÕs address as a tag that can be checked. When the tag misses, then the target address is invalid and must be recalculated. The new address then replaces the old address in the cache.

Most branch prediction units are designed to reduce the size of hardware by indexing only on partial branch addresses. It turns out that in a small section of code, the addresses of branches are sufficiently different that many of their bits can be discarded without affecting the pattern of predictions. Of course, in a heavily branch-laden code, branches can end up aliased to the same partial address, but the only effect is that a misprediction may occur, which forces the usual pipeline flush and a restart with the target address calculation.

After a prediction is made by any of these schemes, it must be verified once the instruction executes. If the prediction is wrong, then the pipe must be flushed and restarted, just as with any other unpredicted branch.

Global Shared History Branch Prediction

The simple address-indexed two-bit history can be augmented with a second level of history information. It has been found that an individual branch may have multiple distinct behaviors, which interact with each other in a manner that isnÕt captured by simply recording the most recent actions. In many cases, these behaviors can be distinguished by identifying the control path that preceded the branch. For example, an earlier branch may set a value that controls the outcome of the current branch. That is, if branch A is taken, branch B is more likely to also be taken. We could imagine all sorts of complex schemes for identifying the control paths of a program, but because we are looking at hardware on a critical path, we must keep our scheme simple.

The easiest way to record the preceding path is to simply keep a bit-vector of taken/not-taken branch outcomes. If we arrive at the current branch via taken/taken/not-taken/taken, then this 1101 bit vector indicates one of 16 different paths that could have brought us here. This is, of course, a crude approximation. The bit positions in different vectors do not necessarily indicate the same set of branch points. Many paths could be converging, and there could be impossible combinations. But the bit vector is really just a convenient hash function that is indirectly related to arrival path. It is typically combined with the branchÕs address via an XOR operation to generate a hash into the table of two-bit predictors.

The result is that some branches use multiple locations in the table, while others may occupy just one. Using multiple table locations enables separate two-bit counters to be trained for each of the arrival paths. The disadvantage of using multiple counters is that it takes more encounters with a branch to train them, and begin to predict effectively. In addition, a branch that uses many counters is more likely to suffer from aliasing. There is also no direct correlation between the behavior of a branch and the number of arrival paths. For example, an exception check at the end of a function may follow the convergence of many paths, but is essentially always not taken. On the other hand, a branch with one arrival path may be very unpredictable because it is purely dependent on input data. The two-level predictor is thus an improvement that addresses a particular kind of behavior correlation, and is not a panacea for branch prediction.

The standard two-level predictor is known as G-share, meaning global history is used to index into a shared pool of two-bit predictors. The name comes from research that explored various combinations of using global history, local history, shared counters, and separate sets of counters. The G-share predictor was not found to be optimal among the options, but it is the best that has reasonable resource requirements. For example, more accurate prediction can be obtained by having a separate set of counters for each branch, to avoid aliasing. But, suppose we are keeping 16 bits of global history (outcomes from the last 16 branches), we then need 64K counters (16K bytes) into which we can index. Replicating this space for every branch is simply not possible, even though it eliminates aliasing.

With a G-share predictor, the most significant source of misprediction (although not the only source) is aliasing, for most applications (some applications, with very simple control flow graphs, have little aliasing and get very good performance from G-share). We can reduce aliasing by increasing the size of the pool of counters, but this is costly and may also make the branch predictor too slow. Another way to reduce aliasing is to keep some branches out of the pool. For example, a branch that is always taken doesnÕt need dynamic prediction.

Hybrid Branch Predictors

The latter approach has led to research in ÒhybridÓ predictors that seek to identify easy branches and assign them a fixed behavior, thereby keeping them out of the shared pool. This can be done, for example, by setting a bit associated with the branch in the instruction cache that indicates ÒsimpleÓ and another bit that indicates ÒtakenÓ or Ònot-taken.Ó  Thus, the pool is reserved for the more complex branches.

Deciding which branches are Òsimple,Ó however, is not so easy. The compiler can identify some likely candidates, but without run-time information it cannot definitively partition branches. Among other problems, the definition of ÒsimpleÓ may shift with respect to the level of aliasing in the pool. Consider that we are trying to maximize the overall accuracy of branch prediction by splitting branches among two or more modes of prediction. Thus, in a situation where there are many complex branches that all use many of the counters, the level of aliasing can be so high that none of them are well predicted. Overall prediction can thus be enhanced, by changing the definition of ÒsimpleÓ to keep some of these branches out of the pool. The prediction accuracy for the branches that are kept out is not likely to be much worse, because it was already bad. However, the branches that remain in the pool see a significant increase in accuracy.

At run time, a hybrid predictor thus needs a variable threshold for deciding how to classify branches. We still need to identify the metric that we are going to threshold. A simple metric is just to use the misprediction rate for the branch. The rate can be estimated by merely counting the number of mispredictions over a fixed number of references to the branch (when the reference counter overflows, the miss counter is compared to the threshold). If the miss rate exceeds the threshold, then the branch is promoted into the G-share predictor.

Setting the threshold is yet another problem. One approach is to use profiling, in which a program is executed and the amount of aliasing is empirically determined. When the program is recompiled, it then includes instructions to set the threshold for the entire application. Another approach is to set the threshold on-line by gathering aliasing statistics. Either technique sets a static threshold for the entire application, which is often adequate. A few applications change their behavior in a gross manner while running Ð known as a phase change. In that case, it can be useful to extend the hybrid predictor with a phase change detector that adjusts the threshold. However, care must be taken in changing the threshold because the predictor tables take time to become trained, and it will be counterproductive to change the threshold with a frequency that disrupts the training.

Hybrid predictors add little in the case of applications that have few branches or branches with only simple behavior. They only improve performance on applications that are heavily aliased in the G-share table. In those cases, the prediction accuracy can rise significantly (say, from 65% to 90%).

Up to this point, we have considered a hybrid that selects static or G-share for each branch. We could also include a separate pool of local 2-bit counters, and a branch could start out with a static prediction, be promoted to using one of these local predictors if it crosses one threshold, and then move up to the G-share predictor if it crosses a second threshold. It turns out that having the third option can improve prediction rates by another 5% or so in the case of difficult codes (e.g., 95% vs. 90% for just a two-way hybrid).

Speculative Execution of Branches

Another approach is to fetch two streams of instructions after a branch and execute them in parallel, throwing away the one that fails to be taken. This is somewhat costly because it requires the first few stages of the pipeline to be duplicated, and possibly an extra set of register file ports and buffers so that the data can be fetched from the registers and be ready for use if either branch proceeds. In the case of wide superscalar processors, which may have multiple pipelines of each type, the speculation merely starts to fill the additional parallel pipes with the instructions from the alternate branch. Once the branch is resolved, then the instructions in the other pipe(s) are squashed.

Even the use of existing pipes is clearly a costly approach as it generates extra instructions into the pipelines. It should only be used for branches that cannot otherwise be accurately predicted, and that execute with sufficient frequency that the disruption of the pipes is going to be significant.

Other Branch Prediction Approaches

An interesting research approach to prediction has been explored in which a neural network is used to detect patterns of branching at run time and to further guide the prediction. It showed the ability to significantly improve prediction accuracy in some cases. However, the cost (in terms of hardware and delay) probably isn't justified by the performance improvement that would result. A more likely approach would incorporate the neural network into the run time profiler, and the compiler would use the patterns that were detected to rewrite the code to improve static prediction accuracy.

Another technique that has been explored in the literature to address branches with repeated, but complex, patterns is to use profiling to identify specific patterns of branching. The patterns can then be efficiently coded through a table lookup. Each branch can then be assigned its pattern index in the table, which enables the predictor to track the behavior of the branch and predict it accurately.

Data mining has been employed to try to explore patterns of branching. The interesting result from this work is that it can identify patterns in the data that is driving data-dependent branches. Of course, that is not especially useful for prediction, as a new data set wonÕt have the same pattern. It does illustrate a potential pitfall of using data mining in architectural research.

Some architectures, such as the PowerPC, include loop count registers that are connected to the branch predictors. This enables the predictor to recognize when a loop is on its last iteration and to flip the prediction direction to achieve 100% accuracy even for the loop exit. This branch can also be kept out of the shared predictor to reduce aliasing.

Exceptions

Thus far we've considered only control hazards that originate from branches. These are actually the simpler control hazards to handle because we have some control over when they occur. With exceptions, we have far less control. Events such as overflow, page faults, breakpoints, and I/O interrupts can occur almost anywhere. Some of them happen asynchronously, and except for the most dire (such as power failure) we convert them into synchronous operation by delaying their recognition to the end of a clock cycle.

Some exceptions can also occur within an instruction (rather than between instructions). For example, a load instruction may generate a page fault that has to be serviced. The usual situation is to roll back the pipeline so that the load (and all subsequent instructions in the pipe) can be restarted cleanly once the fault has been serviced.

However, note that if a processor issues instructions out of order, (as in the case of filling branch delay slots -- and what better place to generate a page fault than at a branch?), it is possible to have an exception reach the pipe when a simple roll-back won't restore the state in a manner that is consistent with in-order execution of the instructions. Thus, we have to keep additional state information that allows us to unravel the out-of-order issue and return to a proper pipeline state that can pick up after the exception is handled.

It might seem that we could simply save the state of the pipe in a temporary buffer. But there is no guarantee that multiple exceptions won't occur before we can restore the saved state. As a result, we really do have to back out the pipeline.