CmpSci 535 Notes from Lecture 8

Pipelining

In the previous chapter we saw that a datapath can be divided into multiple cycles. The advantage of this division is that instructions that do not use the entire datapath can complete in fewer cycles. In addition, the clock rate can be increased. However, each instruction must wait for the one ahead of it to complete before it can begin to execute. This approach could be likened to a washing machine: different types of loads require different sequences of steps, but each load of clothes must be completely washed and removed before we can put in another load.



Suppose we wanted to be able to wash more clothes in a given time. Notice that, at any given time, the machine could be doing one of four different steps. Why can't it do these all at once for multiple loads?

Because it can only hold one load at a time.

So how can we enable it to perform multiple steps at once?

If we have separate units for each of the steps, we can transfer the clothes from one step to the next and start a new load of clothes right away.



What is it about this change that enables us to run multiple loads at once?

The steps are independent, and they each can hold a load of clothes.

In a datapath, the same basic idea can be used to process multiple instructions at once. It requires that the hardware for each of the steps be independent -- able to operate separately from the hardware of the other steps -- and that it have storage for intermediate state information.

There is one other aspect of this approach that isn't obvious at first. The loads of clothes, and the instructions, can only start if they are independent of each other. For example, if you know you need to run a sweatshirt through twice to get rid of a stain, you can't start the second load until the first one is completed. To make this approach worthwhile, you also need to have enough loads of wash to keep all of the units busy. Fortunately, in a datapath, there are usually plenty of instructions, and we take different (independent) instructions from memory in a simple sequential order.







Notice that the first example requires 10 cycles to complete while the second example takes only 6 cycles. This is a 40 percent speed increase. Notice also that if more and more instructions were to be processed, the average time per instruction remains the average number of steps in an instruction (in this case, something between 3 and 4) but the rate at which instructions are completed is much closer to one per cycle (roughly a 60 to 75 percent improvement).

Thus, instructions flow through the datapath in a steady stream, something like fluid flowing through a pipe. Hence this style of datapath is called a pipeline.

When we pour fluid into a pipe, it takes some time (proportional to the length of the pipe) for the first drops to emerge from the other end. But if we pour steadily, then fluid continues to emerge from the pipe at whatever rate it enters.

Pipeline Performance

For an infinite sequence of instructions that keep the pipeline full, the rate at which instructions are executed is one per cycle, where a cycle is a pipeline stage time. Thus, if we were to compare the instruction execution rate of a pipelined vs. a non-pipelined version of the same datapath on such an instruction stream, the pipelined processor would execute at N times the rate of the non-pipelined processor, where N is the number of stages in the pipe.

For a finite sequence of S instructions, the effective execution time is S/(N + S - 1) because the first instruction takes N cycles and the remaining S - 1 instructions each take another cycle to be completed. The seepdup is thus SN/(N + S - 1). As you can see, if S approaches infinity, the execution time approaches 1 cycle per instruction and the speedup approaches N. Unfortunately, we can't keep a pipeline full forever.

If we stop pouring fluid for a while, then air enters the pipe and a bubble forms in the fluid flow. If we start pouring again, the other end of the pipe will have a pause in its output that equals the length of the pause in the pouring.

The pipelined datapath in a computer behaves the same way -- it shows the same initial delay in emitting the first result, and any gaps in the input produce corresponding gaps in the output that are referred to as "bubbles".

What might cause bubbles?

Gaps in the instruction stream due to memory delays.

Jumps -- notice that the jump occurs at the end of an instruction, but it invalidates the fetches that went on after it was fetched. The jump operation is not an independent instruction -- the instruction that follows it depends on its execution.

Jumps account for a large fraction of the inefficiency seen in actual pipelines. In some programs, especially those written in object oriented languages, jumps account for 25% to 30% of the instructions executed. How does this affect the efficiency of a pipeline?

When a jump is executed, it takes some number of cycles for the control unit to detect that the instruction is a jump. With an unconditional jump, it can immediately switch to fetching from the target location, and the gap is just one or two pipeline stages (whatever stage the decode occurs plus one more to transfer the target address to the PC and simultaneously start the next fetch).

With a conditional jump (or branch), the number of cycles before it is detected is typically greater. Even though the control unit can identify the instruction as a branch, it can't tell whether it will be taken until some computation is performed. Thus, the delay can be almost the full length of the pipeline. For example, in a 5-stage pipeline, a branch might not be determined until the fourth stage. Thus, the three stages behind it in the pipeline are invalid and another instruction must enter the pipe at that point. The three invalid stages represent three instructions that were partially executed on the speculation that the branch would not be taken. They also represent three wasted pipeline cycles that are called the jump penalty.



If the jump penalty in a pipeline is P, the probability that an instruction is a jump is J, and the probability that the jump is taken is T, then the average instruction time (in cycles) for the pipeline 1 + PJT. The efficiency is the inverse of this quantity: 1/(1 + PJT).

The probabilities in this formula are dependent on the specific program being executed. Tight loops and frequent subroutine calls increase the probabilities while long stretches of straight-line code decrease the probabilities.

Let's look at some examples to see the magnitude of this effect. Suppose we have a pipe as above, and a code has 30% branches and 25% of them are taken. What is the average instruction time and the efficiency?

Average time = 1 + 3 x 0.3 x 0.25 = 1.225

Efficiency = 1/average time = 81.6%

Speedup = Efficiency x Number of stages = 4.08

If the jumps are taken 50% of the time

Average time = 1 + 3 x 0.3 x 0.5 = 1.45

Efficiency = 1/average time = 68.9%

Speedup = Efficiency x Number of stages = 3.45

So, for fairly conservative scenarios, we see reductions in performance of nearly a third. Suppose the probability that jump is taken rises to 70%.

Average time = 1 + 3 x 0.3 x 0.7 = 1.63

Efficiency = 1/average time = 61.3%

Speedup = Efficiency x Number of stages = 3.07

Branch Prediction and Changing the Detection Point

What simple change could we make in this case to reach an 81.6% level of efficiency?

Always assume that a branch is taken.

Making such an assumption (one way or the other) is called static branch prediction. We are using stastical analysis of our programs to make a design decision regarding the treatment of branches in order to reduce the loss of efficiency due to branches. Of course, if our sample set of programs isn't representative, then our static strategy fails. On the other hand, once we make such a decision, we can often have a compiler provide some help.

Let's look at what happens if the branch detection point changes. What would you expect to be the effect if the detection point occurs later? Let's say it happens in stage 6 of an 8- stage pipe, and we have 30% branches with a 50% probability of being taken.

Average time = 1 + 6 x 0.3 x 0.5 = 1.90

Efficiency = 1/average time = 52.6%

Speedup = Efficiency x Number of stages = 4.21

As we could guess, moving the detection point later increases the penalty and decreases the efficiency. With a penalty of 3, the efficiency is 68.9% but it drops to 52.6% when the penalty is 6. Note however that, because we increased the length of the pipeline, the speedup still increases. If the penalty is reduced to 2 in a 5-stage pipe, then

Average time = 1 + 2 x 0.3 x 0.5 = 1.30

Efficiency = 1/average time = 76.9%

Speedup = Efficiency x Number of stages = 3.85

So earlier detection is better, as we would expect.

Several techniques can be used to improve the efficiency of a pipeline in the presence of jumps. One is to tag jumps as being likely to be taken or not. For example, in a FOR loop, the jump that returns to the top of the loop could be tagged as likely to be taken. That is, a bit in the opcode could be reserved to indicate the likelihood of the jump being taken. This allows us to move the detection point up to the decode stage (perhaps reducing the penalty to one). Of course, when we guess wrong (the loop does eventually exit) then we pay a larger penalty, but hopefully we do so infrequently.

With a true FOR loop, we can even use a special counter that signals when the loop is about to exit, and thus we pay no penalty at all for the jump at the end of the loop. Unfortunately, C and C++ do not contain true FOR loops (they can have additional conditions) so this applies mostly to Pascal, Ada and Fortran programs.

Another approach that is used is to tag instructions as being independent of the branch and move them after the branch. For example, if we have some calculation to perform prior to the branch, but the branch does not depend on its outcome, we can move some instructions to follow the branch and tag them as not having to be discarded if the branch is taken. Thus, the penalty gap can be filled with useful work no matter what the branch does.

These are all static methods as they rely on analysis of the code prior to execution. At execution time we can also use the past behavior of a branch to help predict its future actions. Most individual branches exhibit a behavior in which there is a typical action and an exceptional case. Thus, once a branch is executed, we can guess that it will behave the same way again.

A processor can be built with a table that stores the locations and actions of branches as it encounters them (called a branch history table). When a branch is fetched, its address causes the history table to recall its prior action and the control unit can assume that it will behave the same way and take the appropriate action.

Of course, if the branch takes the exceptional action, then the full penalty must be payed. In addition, the first time a branch is encountered, the full branch penalty must be taken. This approach is called dynamic branch prediction because it takes place at run time.

By combining static and dynamic methods, it is quite possible to achieve a 90% prediction rate so that the effective probability of a branch being taken is just 10%. With our previous example, we have:

Average time = 1 + 3 x 0.3 x 0.1 = 1.09

Efficiency = 1/average time = 91.7%

So, for a 5-stage pipeline, we manage to get effectively a speedup of 4.6 over a non- pipelined system for an infinite instruction stream. Note that this is better performance from a 5-stage pipe than we got from our example of an 8-stage pipe with less aggressive prediction.

Timing

Because all of the stages of a pipeline advance in lock-step, the basic cycle time of a pipe is determined by the slowest stage. There is thus a significant incentive for the designer to equalize the time through all of the stages. Otherwise, speed in some stages is wasted. As we've seen, speed is often related to power consumption. Thus, power may be wasted in an imbalanced pipeline.

There are several ways that we could go about balancing the timing of individual stages. Can you think of any?

Use faster or slower components as needed.

Shift some functionality between stages.

Split long stages into two shorter stages.

Combine two fast stages into one slower stage.

If, instead of having the stages operate in lockstep, we have them pass data to each other as they finish with it and the successor unit is available, then we have a different kind of system called a "dataflow" processor. Dataflow designs can have pipelines with nonlinear topologies (that is, branches), and they are perhaps most commonly employed in signal processing, where a long and complex calculation must be performed on a stream of data. The calculation is broken into stages and these are activated on a demand basis as data enters the system.

Implementing the Pipelined Datapath

Ironically, when we converted the single-cycle datapath into a multicycle datapath we made it more difficult to implement as a pipelined datapath in at least one way. What way is that?

We reduced the independence of the stages by reusing the ALU in different cycles.

Even though a pipelined datapath's stages correspond more to the cycles in a multicycle datapath, it prohibits the reuse of elements in different stages because every stage (cycle) is active at the same time. So, in order to build a pipleined version of the datapath, we go back to the single-cycle design and add memory elements at the points corresponding to the breaks between cycles. Thus, we might consider the pipelined datapath to be a combination of the two earlier designs that is greater than the sum of its parts.



The purpose of the memory elements is to preserve the state of the previous stage for input to the next stage, thereby enabling the previous stage to start work on the next instruction. This is conceptually similar to the approach taken in the Master-Slave flip-flop to preserve the output of the master stage so that the master can start sampling the input signal again.

Notice that this datapath again has the separate adders for the PC increment and the jump address addition. Although it might appear at first that this is a 4-stage pipe it is actually 5 stages:

Instruction fetch

Instruction decode

Execute

Memory access

Write back

To make the diagram more representative of the 5 stages, we might repeat the register box at the right side, because the registers are actually capable of both reading and writing at once. Thus, they could be considered as two independent units. Another way to recognize this as a five-stage pipeline is to notice that four memory buffers are needed to separate five stages.

Another aspect of this design that isn't obvious at first is that the Destination Register value that emerges from the instruction fetch stage must be buffered through the rest of the stages. Why? Because it determines the place where this instruction's result will be written back to, and that result isn't ready until the write back stage. Thus, the register number must be saved until the instruction has passed through the remaining stages of the pipe.

Let's follow an example through the pipeline. Suppose we want to execute

Load B

Load C

Load D

Load E

Add B, C

Sub C, D

Brz E






















Notice how each instruction reaches the write-back stage before its result is needed. The instruction sequence was carefully selected to ensure that there were no dependences on between any of the instructions. Later on we will examine the effects of instruction sequences where a computation depends on a value that isn't available yet.

Control in a Pipelined Datapath

In a non-pipelined datapath, we saw that control signals are generated from the instruction by a central FSM and then distributed to all parts of the datapath at once. In a pipelined datapath, however, there are multiple instructions active at once. Thus, the control unit must be pipelined as well.

In effect, the control unit generates the signals that would direct the actions of the single- cycle processor, but instead of being distributed, these signals are buffered along with the intermediate results between the stages. That is, we take a snapshot of the signals being emitted by the control unit and pass the snapshot through the pipeline. As these frozen control signals are transferred from stage to stage, some of them are connected from the output of the interstage buffer to the components they control. These signals are generally not forwarded to the next interstage buffer. Only the control signals that are needed in subsequent stages are passed on.



Notice that only the stages after decoding require pipelining. The control unit's signals to the Fetch and Decode stages are independent of the instruction. It is only the instruction- dependent control signals that need to be captured and passed from stage to stage.

If we look at the datapath, we can see which of the signals fall into each of these groups simply by looking at the stage that the signal is needed in. For example, the execute stage need signals to control the ALU function and the two multiplexers that are in that stage. In the memory stage, memory read and write control are needed and so is the singal that indicates whether this instruction is a branch on zero. In the write back stage, we have to select the appropriate mux input and send the register write control signal.

Data Hazards

Thus far we have looked at the operation of a pipeline when the instructions are independent of each other. But what if they depend on each other? The answer is that the pipeline becomes much more complex. How can instructions depend on one another? As with most aspects of computers, there are two forms that dependences can take: data and control. First we'll look at data dependences.

Data can enter the pipeline from two places: memory or registers. Thus, not surprisingly, data dependences can be split into two forms, depending on their source. Register dependences take the form of one instruction needing a value in a register that has not yet been put there by another instruction that has gone before it in the pipeline. To see how this can happen, consider the following sequence:

Add $1, $2, $3 -- Add registers 2 and 3, storing result in 1

Sub $4, $5, $1 -- Subtract register 5 from register 1, storing result in 4

Looking at the following diagram, we can see that once the second instruction is fetched, it is trying to read the value in register $1 to be passed to the ALU during its next stage. The trouble is, this register does not yet contain the result of the first instruction because that instruction has not yet reached the write back stage. In fact it won't reach its write-back until two more stages have elapsed.

One solution to this problem is to have the compiler rearrange instructions so that dependent operations are always separated by two independent operations. In some cases the compiler can find other useful instructions to insert between the dependent ones. For example, if we have a series of assignment statements that are independent of each other but that have internally dependent operations, we might interleave the assembly language for their calculations to achieve the necessary separation.



In the worst case, the compiler inserts no-op instructions that fill the necessary pipeline stages. Obviously, this is a waste of potentially productive cycles. However, it has not required us to make any changes to the pipeline design, which keeps it very simple. The other unfortunate side effect, however, is that it requires the assembly language programmer to view the ISA differently. Thus, assembly programs must reflect the timing of the pipeline. If, in the future, we change the design of the pipeline, these assembly language programs might no longer be valid (or at least not efficient). The implementation of the pipeline has become visible at the ISA level, which is precisely what we hope to avoid by specifying an ISA.

So, how might we solve the problem in hardware? We could detect the dependence and insert no-op instructions into the pipe ourselves. How do we detect the dependence?

Compare the SrcReg1 and SrcReg2 values to the DestReg values in the next three buffer stages. If any of them are equal, then substitute a no-op at the outputs of the Fetch/Decode buffer.

There is one more bit of housekeeping that has to be done to make this scheme work. What is it?

We have to stop the flow if instructions into the pipe -- to do this we inhibit the increment of the PC and the Read signal to the instruction memory. Once the instruction with the dependent register reference has reached Write Back, the output of the comparator shows that it is safe to proceed, and the instruction in the Fetch/Decode buffer is released and the PC is allowed to increment and fetch the next instruction.

If we build our comparator to also recognize the different instruction formats (register vs. memory), then the same scheme can also deal with the other type of dependence: memory dependence. Clearly, because a memory load writes back at the same point as a register reference instruction, it can't be followed immediately by instructions that try to use the value being fetched.

This hardware solution does not entirely hide the pipeline implementation from the ISA. A clever assembly language programmer can still take advantage of knowledge of the pipeline to improve efficiency through instruction reordering (as can the compiler writer). However, it is no longer possible to produce erroneous results with code that should be correct according to the ISA.

In the terminology of the book, these dependences are called data hazards, and the insertion of no-op bubbles into the pipeline is called stalling the pipeline. Obviously, a stall reduces the efficiency of a pipeline. We can have stalls of length 1, 2, and 3 (ignoring memory loads that take more than one cycle). If P1, P2, and P3 represent the probability that an instruction stalls for this long, respectively, then the average execution time for an instruction in an infinite sequence is

1 + P1 + P2 * 2 + P3 * 3

So, if we have 10% each of these types of stalls in a stream, the average execution time is

1 + 0.1 + 0.1 * 2 + 0.1 * 3 = 1.6

In other words the pipeline is only operating at 62.5% efficiency. Given the inevitability of data dependences, is there anything we can do to improve on this level of efficiency?

Forwarding

Notice in the previous example that the result of the Add instruction has been computed, but is simply not yet written into the appropriate register. What if we could detect this and feed the result directly back into the appropriate ALU input for the next instruction (in addition to sending it forward in the pipeline for eventual writeback)? In that case, we wouldn't have to stall the pipeline at all.

This technique is called "forwarding" because the values are effectively sent forward in time to achieve the same effect as if we waited for the writeback. It is also called "bypassing" because the values bypass the later stages and go directly to a point of reuse. Note that, whatever the technique is called, the values still proceed as always to an eventual writeback. It is simply that they are simultaneously made available to the ALU inputs.

Forwarding is implemented by detecting a dependence and then setting a multiplexer at the ALU inputs to read the appropriate value from the ALU output, the Execute/Memory buffer output, or the Memory/Writeback buffer output. The choice of which output to input to the ALU depends on where the match is detected.

What should we do if there are multiple matches?

Multiple matches indicate that a series of writes to the same register are occuring, so we should take the last of these writes as input to the ALU.

Load Hazards and Forwarding

There is one situation that we can't significantly improve upon with forwarding. When we have a memory load hazard, the result isn't available until the next-to-last stage. We can forward the value back to the ALU at the same time it is being written to the register. Effectively, all this does is make it appear that the register outputs the value at its input whenever it is being written. In fact, building the registers to do this is easier than actually implementing forwarding for a load.

Thus, while register data hazards can be hidden entirely by forwarding, we cannot avoid every possible data hazard. The pipeline must therefore retain the dependence detection and stall logic together with the forwarding logic. In terms of efficiency, the formula given above is still valid. We have simply found a way to reduce the percentages of instructions that must stall.

Control Hazards

Now that we have looked at data hazards, let's return to the other major form of hazard: the control hazard. Control hazards come from two sources: instructions and exceptions. As we have seen, instructions such as jumps and branches are the bain of any pipeline designer because they are detected after other instructions have entered the pipeline, thus requiring the successive contents of the pipe to be flushed.

Unconditional jumps can be detected early in the pipeline, so it is usually only necessary to cancel the instruction behind the jump, which is easily accomplished by substituting a no- op at the output of the Fetch/Decode buffer until the target instruction is fetched.

The outcome of a branch is not known until the execute stage, at which point two more instructions have entered the pipeline. If they are the next instructions, then they proceed through, but if they were fetched in error, then they have to be cancelled. Let's look at an example.

Move $7 $3

BrZ

Add $1 $2 $3

Sub $4 $5 $1



In the execute stage, the BrZ operation detects that the branch is taken and in the next stage it will load the PC with the target address. The following two instructions are invalid and must be flushed. This can be accomplished by setting their successive inter-stage buffers to effective no-op instructions. One easy way to create a no-op is to write to register 0, so we could simply substitute Mov $0 $0 for these two.

Notice that there are data dependences in this stream as well. Those that follow the branch are simply flushed. However, the dependence between the first instruction and the third instruction crosses the branch. Normally, this would cause the result of the Mov to be forwarded to the ALU input, but as soon as the no-op is substituted in the Fetch/Decode buffer for the Add, the dependence vanishes and the forward does not occur.

In a more complex pipeline that divides some of these stages into smaller steps, it is possible that a forward could occur before the branch is detected. As long as the write-back has not occured, however, there is still time to substitute a no-op result into the appropriate interstage buffer. This is why the write-back is a separate stage that comes at the end of the pipeline, rather than having results be written back as soon as they are available.

Branch Prediction Revisited

Suppose that we want to change the pipe to employ static branch prediction in which it is assumed that the branch is always taken. We would have to place a decoder at the output of the instruction memory that specifically detects a branch, and then sends the PC and sign extended and shifted branch address to an adder whose output is sent back to the PC. In other words, we take the branch address generation logic from the single-cycle datapath except that we don't make it conditional on the status output of the ALU. The PC gets its next value from this circuit whenever the op-code indicates a branch.

If it turns out that the branch isn't taken, then the hazard detection logic comes into play and flushes the instructions that were fetched from the target location. Note that the condition that causes the hazard is the inverse of the earlier version of the pipeline (i.e., the hazard occurs when the value is not 0).

There is one other detail that we have to take care of. Can you guess what it is?

We need to restore the original next instruction PC when the branch isn't taken. Thus, the next PC becomes one of the items to be buffered through the stages, and may be routed back to the PC if the branch prediction is in error.

There are many more complex dynamic branch prediction mechanisms that are in use, many of which involve keeping some history of the behavior of a branch. As you can see from this discussion, any such technique would have to produce a decision in less than a clock cycle so that the target address or the next PC can be loaded into the PC. Most modern microprocessors employ a separate branch prediction that is fed any branch instructions early so that the decision can be made in time. But branch prediction is beyond the scope of this course and will be covered in CmpSci 635.

Exceptions and Interrupts

The other source of control hazards is the nonprogrammed jump that occurs when an exception or an interrupt takes place. Interrupts are most easily implemented by inserting a subroutine jump to the interrupt handler in the input stream to the pipeline. We simply multiplex the input to the fetch/decoder buffer with a subroutine call, so that the the PC is saved and the address of the interrupt routine becomes the new PC value.

Exceptions are implemented much like a branch in that they flush any operations that have entered the pipeline after the instruction that caused the exception. Exceptions can occur in different stages. The most common, such as arithmetic errors occur in the execute stage. However, we could also have a protection violation be detected in the fetch stage, an invalid instruction be detected in the decode stage, or a data memory error be detected in the memory stage.

Wherever the exception occurs, it is important that write backs following the detection point be suppressed so that the couse of the offending condition can be identified. Otherwise we might overwrite a value that contributed to the exception.

One way to implement exceptions is to insert a subroutine jump as we did for the interrupt. The only difference is that we want the return address to point to the instruction that caused the error, which is not the same as the next instruction. We thus have to either calculate the address of the offending instruction (difficult in the presence of jumps) or we have to pass the PC for the instruction along with it through all of the stage buffers. Thus, whatever stage detects the exception sends its copy of the PC to be used as the return address.

In addition to saving the PC, we have to cancel the operations between the offending stage and the start of the pipe by substituting no-ops into all of the inter-stage buffers that are to be flushed. We are then ready to start filling the pipe from the start of the exception handler.

Difficulty of Detecting Hazards and Controlling a Pipeline in a CISC ISA

Throughout this discussion of the pipelined datapath we have had the benefit that the RISC ISA simplifies control by allowing us to carry a minimal amount of information between the interstage buffers. Detection of hazards is straightforward and the single-word, single cycle nature of instructions results in a pipeline with simple, uniform stages.

In a CISC ISA where individual instructions can occupy multiple stages of the pipeline at once and can require multiple instruction fetches to be fully decoded, it is more difficult to balance the pipeline and to detect hazards. Even when a hazard is detected, it may not be easy to compensate for it. Suppose that a multiword memory to memory move operation is in progress and crosses a protection boundary -- part of the operation has already written back to memory and so the exception handler must deal with the consequences of a partially completed instruction. The memory writes cannot be undone, and the counter that is used by the instruction is an internal register that is hidden from programmer access.

In addition, complex addressing modes can make life difficult for the pipeline designer, requiring many more arithmetic units in order to accelerate address calculation so that it can fit into a single stage. Alternatively, the designer may use multiple stages that are bypassed when simpler addressing modes are employed. Of course, this produces a new kind of hazard in that the simpler instructions cannot be allowed to jump ahead of others that are using the more complex modes.

The bottom line of this discussion is that pipeline design in a CISC ISA is more difficult. For example, the Intel architecture introduced a superscalar pipeline design with the Pentium, several years after much smaller competitors in the RISC ISA camp had brought such designs to market. The CISC ISA of the Intel architecture held it back because of the added complexity of going to this next level of pipelining

Superscalar Pipelines

The term superscalar is a relatively recent one that actually just refers to a concept that was introduced in supercomputers of the 1960's. Many times in executing a program, calculations take place that are independent of each other (at least for the length of the pipeline). These operations can be executed in parallel if multiple pipelines are built into the machine.

In the Intel Pentium, for example, there are two integer pipelines, one of which also feeds the floating point unit. It is thus possible, when conditions are just right, to have two integer operations and one floating point operation complete in a single cycle. It turns out that this is not a common occurence in the Intel design (partly because of the way the floating point unit interacts with the integer pipeline that feeds it), but it still happens often enough to provide a significant performance boost.

In RISC architectures, it is now commonplace to have two integer pipelines, a floating point pipe, a load-store pipeline, and a branch unit. Under perfect conditions, such a design can complete five instructions in one cycle.

With so many pipelines active at once, it is difficult to detect and manage dependences between the instructions in the different pipes. It is believed that the amount of gain from this sort of parallelism is probably limited to a less than a factor of ten for general applications. However, for certain very regular computations it has been shown that there is much greater parallelism to be exploited. On the other hand, these naturally parallel applications tend also to be easily implemented on more traditional parallel processors, so it is unclear whether the market will support a more significant level of superscalar parallelism.

Pipeline Examples

Intel Pentium

This is the first in the Intel architecture family to make real use of pipelines. Previously, a prefetch buffer and an instruction predecode buffer took the place of the early stages of a traditional pipeline. The Pentium uses a 5-stage pipeline that splits in two to achieve superscalar instruction execution. It also ties an additional 3 stages onto one of the branches to handle floating point.

The instruction decoder determines whether instructions can execute in parallel and dispatches them to the appropriate pipe. Floating point instructions must go to the U pipe, which begins the floating point operation. Floating point operations cannot be executed in parallel with any other operation. Thus, the Pentium is superscalar for integer operations only.

Motorola 68060

The latest processor in the Motorola CISC line combines a 4-stage instruction fetch pipeline with a dual (superscalar) 6-stage execution pipeline. One of the two pipelines can perform eith integer or floating point operations. The other pipeline is restricted to integer operations. THere are no restrictions on executing floating point operations with other operations, so the 68060 is a true superscalar processor.

Motorola 68040

The prior version of the Motorola 68K line, the 68040, is a scalar pipeline with 6 stages, expanding to 7 stages for floating point operations.



Notice that the Motorola designs include a stage simply for calculating the operand address. This stage is needed to handle the complex addressing modes supported by the 68K ISA. It is also interesting to note that Motorola was able to introduce a pipelined processor well before Intel. This is in part due to the uniformity and symmetry of the 68K ISA. Even though it is thoroughly CISC in nature, the 68K instruction set is much simpler to decode and execute than the Intel 80X86 ISA.

MIPS R4000

Unlike the pipeline described for the MIPS in the text, the actual MIPS R4000 processor uses an eight-stage pipeline divided as follows:

IF - Instruction fetch first stage

IS - Instruction fetch second stage

RF - Register fetch

EX - Execute

DF - Data fetch first stage

DS - Data feth second stage

TC - Tag check

WB - Write back

The instruction and data fetch stages are split because the use of virtual memory in the actual processor delays accesses so that they take 2 cycles. The TC stage is introduced to verify that data has actually been fetched -- it checks the tag information from the cache and the virtual memory address translation unit to see whether the cache fetch and translation were successful. If the value is not in the cache, or the address translation calculation is taking longer than expected, then the TC stage signals that the data has not been fetched and must stall the pipeline until it arrives.

The R4000 has a matching pipeline for floating point operations that operates independently. In addition, the floating point unit has three separate units that can operate simultaneously: Add/Root, Multiply, Divide. Thus, the pipeline dispatches operations to these units (which may take different amounts of time to complete) at the execute stage. The adder and multiplier are also pipelined so that one new operation can begin while a current operation is completing.

DEC Alpha AXP

DEC's RISC processor architecture employs a pair of very deep pipelines (that is, they have many stages) that classify it as a "superpipelined superscalar" processor. The integer pipeline in the Alpha is 7 stages long and the floating point pipeline is 10 stages in length. These short stages allow the Alpha to execute with clock rates of 300 MHz, which is approaching the clock rate of current Cray supercomputers.

The integer pipeline's seven stages are as follows. The first four stages not only fetch the instruction, but they check for dependences between instructions. These stages can be stalled, but after that the pipeline proceeds without interruption.

Cache Access

Swap predict

Decode

Issue Register File Read

ALU First Stage / PC Generate / Virtual Address Generate (depending on instruction type).

ALU Second Stage / Instr. Address Translation / Data Address Translation

Write back / Instr. Cache Tag Check / Data Cache Tag Check

The Alpha floating point pipeline has essentially the same first four stages as the integer pipe. The next five stages are simply called F1 - F5 and the last stage is floating write back (FWR). Part of the reason for the uninformative names of the floating point stages is that they do not correspond to specific operations. Rather, the five stages are responsible for just four major processing steps that are split up so as to fit the five cycles. This is a good example of a design in which the pipeline's workload had been redistributed to balance the timing of the stages. The four operations that are performed depend on the type of floating point operation:

Exponent Difference / Multiply by 3

Leading 1 Detection / Multiply Part 1

Shift Alignment / Multiply Part 2

Addition and Rounding

The multiplication in the Alpha is done with yet another unconventional technique called the Radix-8 method. It is a pipelined circuit and one of the inputs that it requires is 3 times the multiplicand.

The Alpha can process two instructions at once, within the following constraints:

Load or store can execute in parallel with any non-memory operation

Integer and floating point operations can execute in parallel

Floating point operations and flaoting point branches can execute in parallel

Integer operations and integer branches can execcute in parallel

PowerPC 601

The first processor in the Motorola/IBM/Apple consortium's line of RISC microprocessors has three distinct, pipelines, one of which serves two functions. One pipeline is dedicated to identifying and dynamically predicting branches, another handles load/store operations in five stages, or integer operations in four stages. The third is the floating-pint pipeline, and occupies six stages. These pipelines are fed by a prefetch unit that paritally decodes the instructions, identifies dependences, and issues them to the appropriate pipeline. The 601 is thus a 3-way superscalar issue processor. The pipelines are organized as follows:



One of the interesting aspects of the PowerPC 601 is that it can reorder instructions within a window of four instruction slots so as to take advantage of the availability of functional units in the pipelines. It thus has to check the dependences between instructions within this window and ensure that the results emerge and are eventually stored into the registers in the proper order. To accomplish this reordering, it has extra "temporary" registers that are hidden from the user and the ability to rename the registers so that the results appear in the proper places at the proper times. It is interesting to note that Intel will only be able to introduce this feature to a limited degree in the P6 successor to the Pentium.

PowerPC 604

The successor to the 601 is a 4-way superscalar issue processor. Its four pipelines are arranged as follows:



Note that the integer pipeline is now six stages deep and is separate from the load/store pipeline. The load/store pipeline handles eight different stages of processing in seven pipeline cycles -- again, an example of finely dividing the work between adjacent stages in order to balance their timing.



© Copyright 1995, 1996 Charles C. Weems Jr. All rights reserved.


Back to Chip Weems' home page.
Back to courses index page.
Back to Computer Science Department home page.