Lecture 6
Pipelining, Hazards, and Branch
Prediction
Linear Pipeline Processors
We are all familiar with the basic notion of pipelining to increase throughput via staged temporal parallelism. We now consider how the basic idea can be modified to be more effective in certain situations.
Pipelines can be synchronous or asynchronous. We mostly focus on synchronous designs within a processor --Why? -- because:
¥ Processors are generally synchronous already.
¥ The stages are tightly coupled and in close proximity, making clock distribution easy and signal delays short and equal.
¥ The technology within the processor is homogeneous and under the control of the designer.
When are asynchronous pipelines useful?
¥ When we are dealing with asynchronous stages (i.e. stage times can vary).
¥ When the stages are not tightly coupled.
¥ When the signals must cross technology boundaries, or the stages are not in close proximity.
Systolic arrays are examples of asynchronous pipelines. Dataflow is an example of a nonlinear asynchronous pipeline. A possible direction for architectures in an era where wire delay is a major concern is to introduce asynchrony. However, the area of the typical pipeline is sufficiently small, that it is unlikely to need to switch to asynchronous control. Forwarding of values and dependence information between pipelines is a more attractive option, and a potential area for architecture research.
Synchronous Linear Pipelines
Now let's consider how we would determine the clock period for a pipeline. In the following formulas we assume that the times can be determined through some appropriate means such as simulation or measurement.
The clock period for a pipeline is the maximum time for any stage (Tmax) plus the time to latch the data into the register that separates it from the next stage (TL). TL is the time during which the clock transitions. Because signals do not transition instantly, the clock takes some time to switch state. This sum is the minimum clock period for the pipeline.
Clock = Tmax + TL
Because there can also be skew in the arrival time (early or late) of the clock pulse at the different stages, there needs to be some compensation added to the clock period of the pipe.
Thus, we must add the skew time (S) to get the minimum clock period:
Clock >= Tmax + TL + S
A given stage may have multiple logic paths that are of different lengths. Tmax represents the maximum path among all of the stages. However, it is possible for a stage to have a short path that is much smaller than this maximum. In designing the clock, we have to be certain that the time to latch a value (TL) plus the skew (S) is less than the time of the shortest path. Otherwise it would be possible for an input from a prior stage to propagate through the shortest path and affect its stage's output before its current output can be latched. Note that it this is an easy problem to correct, simply by adding some gates to the short path that merely pass the signals through with the necessary delay. The real problem is that it can be difficult to get the clock to transition fast enough. In that case, the simple solution would be to lower the clock rate (lengthen the clock period), but for obvious marketing reasons, this isnÕt a good option. Thus, engineers work very hard to ensure that the clock transitions fast enough to avoid the stage overrun problem.
Thus the length of TL must be less than or equal to the minimum delay through a stage (Tmin), minus the possible skew.
TL <= Tmin - S

So the range of possible clock periods is Tmax + S plus a pulse length chosen to be less than Tmin - S, up to Tmax plus the maximum pulse length (Tmin - S)
TL + Tmax + S <= Clock <= Tmax + Tmin - S
Essentially there is no upper bound on Tmax so the maximum could be any length (although whatever value is chosen for Tmax, the total time will be no more than that plus Tmin - S).
We try to push T as close to the minimum as we can in aggressive designs and keep it near the middle for more conservative designs. The variability of the fabrication process determines how closely we can approach the minimum period -- if the variability is too great, then the yield will be reduced and cost increases.
Speedup
The best possible processing time from a pipeline of K stages on an input stream of N values with a clock period of C is
Tk = (K + (N - 1))C
That is, the first one takes K cycles to complete and after that, successive results emerge with each clock cycle.
A nonpipelined processor where the delay for one operation is KC would take time NKC. Thus the speedup is
![]()
As N drops to 1, we get a speedup factor of unity. As N grows toward infinity, the speedup approaches K.
The efficiency of the pipeline is the speedup divided by the number of stages
![]()
The throughput is the efficiency times the rate (frequency (F) -- inverse of clock period) at which the pipe operates
![]()
Thus, as N approaches infinity, efficiency approaches 1 and throughput approaches F.
Nonlinear Pipelines
A pipeline need not be a simple linear chain of stages. There are instances where it is useful to have a collection of functional units that can be wired into a particular pattern of flow, even with loops and skips in the chain. This may allow more than one function to be computed with the same pipeline. A typical case would be built-in floating-point square root, which chains together the floating-point adder and multiplier, rather than having separate functional units for this rarely used operation. Depending upon how the square root operation operates, it might leave holes in the schedule that would admit independent floating adds or multiplies.
The problem with trying to utilize a nonlinear pipeline is that it is difficult to keep it full unless the functions do not collide with each other or themselves.
For example, given the following pipeline from HwangÕs Advanced Computer Architecture book:

These reservation tables show the sequence in which each function utilizes each stage. (For example, think of X as being a floating square root, and Y as being a floating cosine. A simple floating multiply might occupy just S1 and S2 in sequence.) We could also denote multiple stages being used in parallel, or a stage being drawn out for more than one cycle with these diagrams.
We determine the next start time for one or the other of the functions by lining up the diagrams and sliding one with respect to another to see where one can fit into the open slots.
Once an X function has been scheduled, another X function can start after 1, 3 or 6 cycles. A Y function can start after 2 or 4 cycles.
Once a Y function has been scheduled, another Y function can start after 1, 3 or 5 cycles. An X function can start after 2 or 4 cycles.
After two functions have been scheduled, no more can start until both are complete.
Instruction
Pipelines
When instructions are flowing through a pipeline, the assumption is that the order of execution follows the linear order of the instructions in memory. A typical program, however, includes branches to other locations. Whether a conditional jump is taken may not be recognized until late in the pipeline (usually at the execute stage). At that point, all of the instructions behind the jump must be flushed and the pipeline must be restarted, resulting in wasted cycles called a branch penalty.
Given that an instruction takes unit time to pass through a stage of the pipeline, that the penalty for a branch is B stage times, that the probability that a given instruction is a jump is Pj and that the probability that the jump is taken is Pt , then the average time for an instruction (with the pipeline operating in steady state -- ignoring start-up cost) is
1 + B * Pj * Pt
and the efficiency of the instruction pipe is
1 / (1 + B * Pj * Pt)
For a branch penalty of 3 (i.e., branches are detected in stage 4), a probability of a branch of 0.2 and a probability of being taken of 0.4, the average instruction time is 1.24 cycles and the efficiency is only 80.6%. Thus, we have lost nearly 20% of the potential performance of the instruction pipe when one in five instructions is a branch and 40% of the time it is taken (which is actually a better than the average case).
Branches are one of three types of hazards (control hazards) that can occur in a pipeline. The other two types are structural hazards in which the hardware runs out of resources to handle all of the operations in the pipe simultaneously, and data hazards in which instructions depend on each other's results such that they cannot be overlapped.
Another example from Hwang:
Given a 7-stage pipe with the following stages:
Fetch (F)
Decode (D)
Issue (I)
Execute 1 (1)
Execute 2 (2)
Execute 3 (3)
Write Back (W)
If two assignment statements are executed:
X = Y + Z;
A = B + C;
Then the pipeline operates as follows
|
R1 <- Y |
F |
D |
I |
1 |
2 |
3 |
W |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
R2 <- Z |
|
F |
D |
I |
1 |
2 |
3 |
W |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
R3 <- R1 + R2 |
|
|
F |
D |
Ð |
1 |
Ð |
I |
1 |
2 |
3 |
W |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
X <- R3 |
|
|
|
F |
Ð |
2 |
Ð |
D |
Ð |
3 |
- |
I |
1 |
2 |
3 |
W |
|
|
|
|
|
|
|
|
|
|
|
R4 <- B |
|
|
|
|
|
|
|
F |
Ð |
4 |
- |
D |
I |
1 |
2 |
3 |
W |
|
|
|
|
|
|
|
|
|
|
R5 <- C |
|
|
|
|
|
|
|
|
|
|
|
F |
D |
I |
1 |
2 |
3 |
W |
|
|
|
|
|
|
|
|
|
R6 <- R4 * R5 |
|
|
|
|
|
|
|
|
|
|
|
|
F |
D |
Ð |
5 |
Ð |
I |
1 |
2 |
3 |
W |
|
|
|
|
|
A <- R6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
F |
Ð |
6 |
Ð |
D |
Ð |
7 |
- |
I |
1 |
2 |
3 |
W |
Notice the gaps in this execution sequence. The efficiency of the pipeline's 26 cycles compared to its optimal performance of 14 cycles for the 8 instructions is 54%. Where do these come from? Let's look at each one in turn (D indicates a data hazard, S indicates a structural hazard):
1 The Add cannot be issued until the second operand is written into the register (D)
2 Because the decoder is stalled, the Store cannot be decoded (S)
3 Because the Store depends on the result of the Add, it can't be issued before the write back of the add (D)
4 Even though loading B to R4 is independent, the decoder is stalled waiting to issue the Store, which is waiting for the Add to write back. (S)
5 The multiply can't be issued until the operands are ready. (D)
6 The decoder is stalled waiting to issue the multiply (S)
7 The multiply result is not available yet. (D)
Reordering the instructions, however, can greatly improve the efficiency of the pipeline:
|
R1 <- Y |
F |
D |
I |
1 |
2 |
3 |
W |
|
|
|
|
|
|
|
|
|
|
|
|
R2 <- Z |
|
F |
D |
I |
1 |
2 |
3 |
W |
|
|
|
|
|
|
|
|
|
|