In the previous chapter we saw that a datapath can be divided into multiple
cycles. The advantage of this division is that instructions that do not
use the entire datapath can complete in fewer cycles. In addition, the clock
rate can be increased. However, each instruction must wait for the one ahead
of it to complete before it can begin to execute. This approach could be
likened to a washing machine: different types of loads require different
sequences of steps, but each load of clothes must be completely washed and
removed before we can put in another load.
![]()
Suppose we wanted to be able to wash more clothes in a given time. Notice
that, at any given time, the machine could be doing one of four different
steps. Why can't it do these all at once for multiple loads?
Because it can only hold one load at a time.
So how can we enable it to perform multiple steps at once?
If we have separate units for each of the steps, we can transfer the clothes
from one step to the next and start a new load of clothes right away.

What is it about this change that enables us to run multiple loads at once?
The steps are independent, and they each can hold a load of clothes.
In a datapath, the same basic idea can be used to process multiple instructions
at once. It requires that the hardware for each of the steps be independent
-- able to operate separately from the hardware of the other steps -- and
that it have storage for intermediate state information.
There is one other aspect of this approach that isn't obvious at first.
The loads of clothes, and the instructions, can only start if they are independent
of each other. For example, if you know you need to run a sweatshirt through
twice to get rid of a stain, you can't start the second load until the first
one is completed. To make this approach worthwhile, you also need to have
enough loads of wash to keep all of the units busy. Fortunately, in a datapath,
there are usually plenty of instructions, and we take different (independent)
instructions from memory in a simple sequential order.
![]()


Notice that the first example requires 10 cycles to complete while the second
example takes only 6 cycles. This is a 40 percent speed increase. Notice
also that if more and more instructions were to be processed, the average
time per instruction remains the average number of steps in an instruction
(in this case, something between 3 and 4) but the rate at which instructions
are completed is much closer to one per cycle (roughly a 60 to 75 percent
improvement).
Thus, instructions flow through the datapath in a steady stream, something
like fluid flowing through a pipe. Hence this style of datapath is called
a pipeline.
When we pour fluid into a pipe, it takes some time (proportional to the
length of the pipe) for the first drops to emerge from the other end. But
if we pour steadily, then fluid continues to emerge from the pipe at whatever
rate it enters.
For an infinite sequence of instructions that keep the pipeline full,
the rate at which instructions are executed is one per cycle, where a cycle
is a pipeline stage time. Thus, if we were to compare the instruction execution
rate of a pipelined vs. a non-pipelined version of the same datapath on
such an instruction stream, the pipelined processor would execute at N times
the rate of the non-pipelined processor, where N is the number of stages
in the pipe.
For a finite sequence of S instructions, the effective execution time is
S/(N + S - 1) because the first instruction takes N cycles and the remaining
S - 1 instructions each take another cycle to be completed. The seepdup
is thus SN/(N + S - 1). As you can see, if S approaches infinity, the execution
time approaches 1 cycle per instruction and the speedup approaches N. Unfortunately,
we can't keep a pipeline full forever.
If we stop pouring fluid for a while, then air enters the pipe and a bubble
forms in the fluid flow. If we start pouring again, the other end of the
pipe will have a pause in its output that equals the length of the pause
in the pouring.
The pipelined datapath in a computer behaves the same way -- it shows the
same initial delay in emitting the first result, and any gaps in the input
produce corresponding gaps in the output that are referred to as "bubbles".
What might cause bubbles?
Gaps in the instruction stream due to memory delays.
Jumps -- notice that the jump occurs at the end of an instruction, but it
invalidates the fetches that went on after it was fetched. The jump operation
is not an independent instruction -- the instruction that follows it depends
on its execution.
Jumps account for a large fraction of the inefficiency seen in actual pipelines.
In some programs, especially those written in object oriented languages,
jumps account for 25% to 30% of the instructions executed. How does this
affect the efficiency of a pipeline?
When a jump is executed, it takes some number of cycles for the control
unit to detect that the instruction is a jump. With an unconditional jump,
it can immediately switch to fetching from the target location, and the
gap is just one or two pipeline stages (whatever stage the decode occurs
plus one more to transfer the target address to the PC and simultaneously
start the next fetch).
With a conditional jump (or branch), the number of cycles before it is detected
is typically greater. Even though the control unit can identify the instruction
as a branch, it can't tell whether it will be taken until some computation
is performed. Thus, the delay can be almost the full length of the pipeline.
For example, in a 5-stage pipeline, a branch might not be determined until
the fourth stage. Thus, the three stages behind it in the pipeline are invalid
and another instruction must enter the pipe at that point. The three invalid
stages represent three instructions that were partially executed on the
speculation that the branch would not be taken. They also represent three
wasted pipeline cycles that are called the jump penalty.

If the jump penalty in a pipeline is P, the probability that an instruction
is a jump is J, and the probability that the jump is taken is T, then the
average instruction time (in cycles) for the pipeline 1 + PJT. The efficiency
is the inverse of this quantity: 1/(1 + PJT).
The probabilities in this formula are dependent on the specific program
being executed. Tight loops and frequent subroutine calls increase the probabilities
while long stretches of straight-line code decrease the probabilities.
Let's look at some examples to see the magnitude of this effect. Suppose
we have a pipe as above, and a code has 30% branches and 25% of them are
taken. What is the average instruction time and the efficiency?
Average time = 1 + 3 x 0.3 x 0.25 = 1.225
Efficiency = 1/average time = 81.6%
Speedup = Efficiency x Number of stages = 4.08
If the jumps are taken 50% of the time
Average time = 1 + 3 x 0.3 x 0.5 = 1.45
Efficiency = 1/average time = 68.9%
Speedup = Efficiency x Number of stages = 3.45
So, for fairly conservative scenarios, we see reductions in performance
of nearly a third. Suppose the probability that jump is taken rises to 70%.
Average time = 1 + 3 x 0.3 x 0.7 = 1.63
Efficiency = 1/average time = 61.3%
Speedup = Efficiency x Number of stages = 3.07
What simple change could we make in this case to reach an 81.6% level
of efficiency?
Always assume that a branch is taken.
Making such an assumption (one way or the other) is called static branch
prediction. We are using stastical analysis of our programs to make a design
decision regarding the treatment of branches in order to reduce the loss
of efficiency due to branches. Of course, if our sample set of programs
isn't representative, then our static strategy fails. On the other hand,
once we make such a decision, we can often have a compiler provide some
help.
Let's look at what happens if the branch detection point changes. What would
you expect to be the effect if the detection point occurs later? Let's say
it happens in stage 6 of an 8- stage pipe, and we have 30% branches with
a 50% probability of being taken.
Average time = 1 + 6 x 0.3 x 0.5 = 1.90
Efficiency = 1/average time = 52.6%
Speedup = Efficiency x Number of stages = 4.21
As we could guess, moving the detection point later increases the penalty
and decreases the efficiency. With a penalty of 3, the efficiency is 68.9%
but it drops to 52.6% when the penalty is 6. Note however that, because
we increased the length of the pipeline, the speedup still increases. If
the penalty is reduced to 2 in a 5-stage pipe, then
Average time = 1 + 2 x 0.3 x 0.5 = 1.30
Efficiency = 1/average time = 76.9%
Speedup = Efficiency x Number of stages = 3.85
So earlier detection is better, as we would expect.
Several techniques can be used to improve the efficiency of a pipeline in
the presence of jumps. One is to tag jumps as being likely to be taken or
not. For example, in a FOR loop, the jump that returns to the top of the
loop could be tagged as likely to be taken. That is, a bit in the opcode
could be reserved to indicate the likelihood of the jump being taken. This
allows us to move the detection point up to the decode stage (perhaps reducing
the penalty to one). Of course, when we guess wrong (the loop does eventually
exit) then we pay a larger penalty, but hopefully we do so infrequently.
With a true FOR loop, we can even use a special counter that signals when
the loop is about to exit, and thus we pay no penalty at all for the jump
at the end of the loop. Unfortunately, C and C++ do not contain true FOR
loops (they can have additional conditions) so this applies mostly to Pascal,
Ada and Fortran programs.
Another approach that is used is to tag instructions as being independent
of the branch and move them after the branch. For example, if we have some
calculation to perform prior to the branch, but the branch does not depend
on its outcome, we can move some instructions to follow the branch and tag
them as not having to be discarded if the branch is taken. Thus, the penalty
gap can be filled with useful work no matter what the branch does.
These are all static methods as they rely on analysis of the code prior
to execution. At execution time we can also use the past behavior of a branch
to help predict its future actions. Most individual branches exhibit a behavior
in which there is a typical action and an exceptional case. Thus, once a
branch is executed, we can guess that it will behave the same way again.
A processor can be built with a table that stores the locations and actions
of branches as it encounters them (called a branch history table). When
a branch is fetched, its address causes the history table to recall its
prior action and the control unit can assume that it will behave the same
way and take the appropriate action.
Of course, if the branch takes the exceptional action, then the full penalty
must be payed. In addition, the first time a branch is encountered, the
full branch penalty must be taken. This approach is called dynamic branch
prediction because it takes place at run time.
By combining static and dynamic methods, it is quite possible to achieve
a 90% prediction rate so that the effective probability of a branch being
taken is just 10%. With our previous example, we have:
Average time = 1 + 3 x 0.3 x 0.1 = 1.09
Efficiency = 1/average time = 91.7%
So, for a 5-stage pipeline, we manage to get effectively a speedup of 4.6
over a non- pipelined system for an infinite instruction stream. Note that
this is better performance from a 5-stage pipe than we got from our example
of an 8-stage pipe with less aggressive prediction.
Because all of the stages of a pipeline advance in lock-step, the basic
cycle time of a pipe is determined by the slowest stage. There is thus a
significant incentive for the designer to equalize the time through all
of the stages. Otherwise, speed in some stages is wasted. As we've seen,
speed is often related to power consumption. Thus, power may be wasted in
an imbalanced pipeline.
There are several ways that we could go about balancing the timing of individual
stages. Can you think of any?
Use faster or slower components as needed.
Shift some functionality between stages.
Split long stages into two shorter stages.
Combine two fast stages into one slower stage.
If, instead of having the stages operate in lockstep, we have them pass
data to each other as they finish with it and the successor unit is available,
then we have a different kind of system called a "dataflow" processor.
Dataflow designs can have pipelines with nonlinear topologies (that is,
branches), and they are perhaps most commonly employed in signal processing,
where a long and complex calculation must be performed on a stream of data.
The calculation is broken into stages and these are activated on a demand
basis as data enters the system.
Ironically, when we converted the single-cycle datapath into a multicycle
datapath we made it more difficult to implement as a pipelined datapath
in at least one way. What way is that?
We reduced the independence of the stages by reusing the ALU in different
cycles.
Even though a pipelined datapath's stages correspond more to the cycles
in a multicycle datapath, it prohibits the reuse of elements in different
stages because every stage (cycle) is active at the same time. So, in order
to build a pipleined version of the datapath, we go back to the single-cycle
design and add memory elements at the points corresponding to the breaks
between cycles. Thus, we might consider the pipelined datapath to be a combination
of the two earlier designs that is greater than the sum of its parts.

The purpose of the memory elements is to preserve the state of the previous
stage for input to the next stage, thereby enabling the previous stage to
start work on the next instruction. This is conceptually similar to the
approach taken in the Master-Slave flip-flop to preserve the output of the
master stage so that the master can start sampling the input signal again.
Notice that this datapath again has the separate adders for the PC increment
and the jump address addition. Although it might appear at first that this
is a 4-stage pipe it is actually 5 stages:
Instruction fetch
Instruction decode
Execute
Memory access
Write back
To make the diagram more representative of the 5 stages, we might repeat
the register box at the right side, because the registers are actually capable
of both reading and writing at once. Thus, they could be considered as two
independent units. Another way to recognize this as a five-stage pipeline
is to notice that four memory buffers are needed to separate five stages.
Another aspect of this design that isn't obvious at first is that the Destination
Register value that emerges from the instruction fetch stage must be buffered
through the rest of the stages. Why? Because it determines the place where
this instruction's result will be written back to, and that result isn't
ready until the write back stage. Thus, the register number must be saved
until the instruction has passed through the remaining stages of the pipe.
Let's follow an example through the pipeline. Suppose we want to execute
Load B
Load C
Load D
Load E
Add B, C
Sub C, D
Brz E










Notice how each instruction reaches the write-back stage before its result
is needed. The instruction sequence was carefully selected to ensure that
there were no dependences on between any of the instructions. Later on we
will examine the effects of instruction sequences where a computation depends
on a value that isn't available yet.
In a non-pipelined datapath, we saw that control signals are generated
from the instruction by a central FSM and then distributed to all parts
of the datapath at once. In a pipelined datapath, however, there are multiple
instructions active at once. Thus, the control unit must be pipelined as
well.
In effect, the control unit generates the signals that would direct the
actions of the single- cycle processor, but instead of being distributed,
these signals are buffered along with the intermediate results between the
stages. That is, we take a snapshot of the signals being emitted by the
control unit and pass the snapshot through the pipeline. As these frozen
control signals are transferred from stage to stage, some of them are connected
from the output of the interstage buffer to the components they control.
These signals are generally not forwarded to the next interstage buffer.
Only the control signals that are needed in subsequent stages are passed
on.

Notice that only the stages after decoding require pipelining. The control
unit's signals to the Fetch and Decode stages are independent of the instruction.
It is only the instruction- dependent control signals that need to be captured
and passed from stage to stage.
If we look at the datapath, we can see which of the signals fall into each
of these groups simply by looking at the stage that the signal is needed
in. For example, the execute stage need signals to control the ALU function
and the two multiplexers that are in that stage. In the memory stage, memory
read and write control are needed and so is the singal that indicates whether
this instruction is a branch on zero. In the write back stage, we have to
select the appropriate mux input and send the register write control signal.
Thus far we have looked at the operation of a pipeline when the instructions
are independent of each other. But what if they depend on each other? The
answer is that the pipeline becomes much more complex. How can instructions
depend on one another? As with most aspects of computers, there are two
forms that dependences can take: data and control. First we'll look at data
dependences.
Data can enter the pipeline from two places: memory or registers. Thus,
not surprisingly, data dependences can be split into two forms, depending
on their source. Register dependences take the form of one instruction needing
a value in a register that has not yet been put there by another instruction
that has gone before it in the pipeline. To see how this can happen, consider
the following sequence:
Add $1, $2, $3 -- Add registers 2 and 3, storing result in 1
Sub $4, $5, $1 -- Subtract register 5 from register 1, storing result in
4
Looking at the following diagram, we can see that once the second instruction
is fetched, it is trying to read the value in register $1 to be passed to
the ALU during its next stage. The trouble is, this register does not yet
contain the result of the first instruction because that instruction has
not yet reached the write back stage. In fact it won't reach its write-back
until two more stages have elapsed.
One solution to this problem is to have the compiler rearrange instructions
so that dependent operations are always separated by two independent operations.
In some cases the compiler can find other useful instructions to insert
between the dependent ones. For example, if we have a series of assignment
statements that are independent of each other but that have internally dependent
operations, we might interleave the assembly language for their calculations
to achieve the necessary separation.

In the worst case, the compiler inserts no-op instructions that fill the
necessary pipeline stages. Obviously, this is a waste of potentially productive
cycles. However, it has not required us to make any changes to the pipeline
design, which keeps it very simple. The other unfortunate side effect, however,
is that it requires the assembly language programmer to view the ISA differently.
Thus, assembly programs must reflect the timing of the pipeline. If, in
the future, we change the design of the pipeline, these assembly language
programs might no longer be valid (or at least not efficient). The implementation
of the pipeline has become visible at the ISA level, which is precisely
what we hope to avoid by specifying an ISA.
So, how might we solve the problem in hardware? We could detect the dependence
and insert no-op instructions into the pipe ourselves. How do we detect
the dependence?
Compare the SrcReg1 and SrcReg2 values to the DestReg values in the next
three buffer stages. If any of them are equal, then substitute a no-op at
the outputs of the Fetch/Decode buffer.
There is one more bit of housekeeping that has to be done to make this scheme
work. What is it?
We have to stop the flow if instructions into the pipe -- to do this we
inhibit the increment of the PC and the Read signal to the instruction memory.
Once the instruction with the dependent register reference has reached Write
Back, the output of the comparator shows that it is safe to proceed, and
the instruction in the Fetch/Decode buffer is released and the PC is allowed
to increment and fetch the next instruction.
If we build our comparator to also recognize the different instruction formats
(register vs. memory), then the same scheme can also deal with the other
type of dependence: memory dependence. Clearly, because a memory load writes
back at the same point as a register reference instruction, it can't be
followed immediately by instructions that try to use the value being fetched.
This hardware solution does not entirely hide the pipeline implementation
from the ISA. A clever assembly language programmer can still take advantage
of knowledge of the pipeline to improve efficiency through instruction reordering
(as can the compiler writer). However, it is no longer possible to produce
erroneous results with code that should be correct according to the ISA.
In the terminology of the book, these dependences are called data hazards,
and the insertion of no-op bubbles into the pipeline is called stalling
the pipeline. Obviously, a stall reduces the efficiency of a pipeline. We
can have stalls of length 1, 2, and 3 (ignoring memory loads that take more
than one cycle). If P1, P2, and P3 represent the probability that an instruction
stalls for this long, respectively, then the average execution time for
an instruction in an infinite sequence is
1 + P1 + P2 * 2 + P3 * 3
So, if we have 10% each of these types of stalls in a stream, the average
execution time is
1 + 0.1 + 0.1 * 2 + 0.1 * 3 = 1.6
In other words the pipeline is only operating at 62.5% efficiency. Given
the inevitability of data dependences, is there anything we can do to improve
on this level of efficiency?
Notice in the previous example that the result of the Add instruction
has been computed, but is simply not yet written into the appropriate register.
What if we could detect this and feed the result directly back into the
appropriate ALU input for the next instruction (in addition to sending it
forward in the pipeline for eventual writeback)? In that case, we wouldn't
have to stall the pipeline at all.
This technique is called "forwarding" because the values are effectively
sent forward in time to achieve the same effect as if we waited for the
writeback. It is also called "bypassing" because the values bypass
the later stages and go directly to a point of reuse. Note that, whatever
the technique is called, the values still proceed as always to an eventual
writeback. It is simply that they are simultaneously made available to the
ALU inputs.
Forwarding is implemented by detecting a dependence and then setting a multiplexer
at the ALU inputs to read the appropriate value from the ALU output, the
Execute/Memory buffer output, or the Memory/Writeback buffer output. The
choice of which output to input to the ALU depends on where the match is
detected.
What should we do if there are multiple matches?
Multiple matches indicate that a series of writes to the same register are
occuring, so we should take the last of these writes as input to the ALU.
There is one situation that we can't significantly improve upon with
forwarding. When we have a memory load hazard, the result isn't available
until the next-to-last stage. We can forward the value back to the ALU at
the same time it is being written to the register. Effectively, all this
does is make it appear that the register outputs the value at its input
whenever it is being written. In fact, building the registers to do this
is easier than actually implementing forwarding for a load.
Thus, while register data hazards can be hidden entirely by forwarding,
we cannot avoid every possible data hazard. The pipeline must therefore
retain the dependence detection and stall logic together with the forwarding
logic. In terms of efficiency, the formula given above is still valid. We
have simply found a way to reduce the percentages of instructions that must
stall.
Now that we have looked at data hazards, let's return to the other major
form of hazard: the control hazard. Control hazards come from two sources:
instructions and exceptions. As we have seen, instructions such as jumps
and branches are the bain of any pipeline designer because they are detected
after other instructions have entered the pipeline, thus requiring the successive
contents of the pipe to be flushed.
Unconditional jumps can be detected early in the pipeline, so it is usually
only necessary to cancel the instruction behind the jump, which is easily
accomplished by substituting a no- op at the output of the Fetch/Decode
buffer until the target instruction is fetched.
The outcome of a branch is not known until the execute stage, at which point
two more instructions have entered the pipeline. If they are the next instructions,
then they proceed through, but if they were fetched in error, then they
have to be cancelled. Let's look at an example.
Move $7 $3
BrZ
Add $1 $2 $3
Sub $4 $5 $1

In the execute stage, the BrZ operation detects that the branch is taken
and in the next stage it will load the PC with the target address. The following
two instructions are invalid and must be flushed. This can be accomplished
by setting their successive inter-stage buffers to effective no-op instructions.
One easy way to create a no-op is to write to register 0, so we could simply
substitute Mov $0 $0 for these two.
Notice that there are data dependences in this stream as well. Those that
follow the branch are simply flushed. However, the dependence between the
first instruction and the third instruction crosses the branch. Normally,
this would cause the result of the Mov to be forwarded to the ALU input,
but as soon as the no-op is substituted in the Fetch/Decode buffer for the
Add, the dependence vanishes and the forward does not occur.
In a more complex pipeline that divides some of these stages into smaller
steps, it is possible that a forward could occur before the branch is detected.
As long as the write-back has not occured, however, there is still time
to substitute a no-op result into the appropriate interstage buffer. This
is why the write-back is a separate stage that comes at the end of the pipeline,
rather than having results be written back as soon as they are available.
Suppose that we want to change the pipe to employ static branch prediction
in which it is assumed that the branch is always taken. We would have to
place a decoder at the output of the instruction memory that specifically
detects a branch, and then sends the PC and sign extended and shifted branch
address to an adder whose output is sent back to the PC. In other words,
we take the branch address generation logic from the single-cycle datapath
except that we don't make it conditional on the status output of the ALU.
The PC gets its next value from this circuit whenever the op-code indicates
a branch.
If it turns out that the branch isn't taken, then the hazard detection logic
comes into play and flushes the instructions that were fetched from the
target location. Note that the condition that causes the hazard is the inverse
of the earlier version of the pipeline (i.e., the hazard occurs when the
value is not 0).
There is one other detail that we have to take care of. Can you guess what
it is?
We need to restore the original next instruction PC when the branch isn't
taken. Thus, the next PC becomes one of the items to be buffered through
the stages, and may be routed back to the PC if the branch prediction is
in error.
There are many more complex dynamic branch prediction mechanisms that are
in use, many of which involve keeping some history of the behavior of a
branch. As you can see from this discussion, any such technique would have
to produce a decision in less than a clock cycle so that the target address
or the next PC can be loaded into the PC. Most modern microprocessors employ
a separate branch prediction that is fed any branch instructions early so
that the decision can be made in time. But branch prediction is beyond the
scope of this course and will be covered in CmpSci 635.
The other source of control hazards is the nonprogrammed jump that occurs
when an exception or an interrupt takes place. Interrupts are most easily
implemented by inserting a subroutine jump to the interrupt handler in the
input stream to the pipeline. We simply multiplex the input to the fetch/decoder
buffer with a subroutine call, so that the the PC is saved and the address
of the interrupt routine becomes the new PC value.
Exceptions are implemented much like a branch in that they flush any operations
that have entered the pipeline after the instruction that caused the exception.
Exceptions can occur in different stages. The most common, such as arithmetic
errors occur in the execute stage. However, we could also have a protection
violation be detected in the fetch stage, an invalid instruction be detected
in the decode stage, or a data memory error be detected in the memory stage.
Wherever the exception occurs, it is important that write backs following
the detection point be suppressed so that the couse of the offending condition
can be identified. Otherwise we might overwrite a value that contributed
to the exception.
One way to implement exceptions is to insert a subroutine jump as we did
for the interrupt. The only difference is that we want the return address
to point to the instruction that caused the error, which is not the same
as the next instruction. We thus have to either calculate the address of
the offending instruction (difficult in the presence of jumps) or we have
to pass the PC for the instruction along with it through all of the stage
buffers. Thus, whatever stage detects the exception sends its copy of the
PC to be used as the return address.
In addition to saving the PC, we have to cancel the operations between the
offending stage and the start of the pipe by substituting no-ops into all
of the inter-stage buffers that are to be flushed. We are then ready to
start filling the pipe from the start of the exception handler.
Throughout this discussion of the pipelined datapath we have had the
benefit that the RISC ISA simplifies control by allowing us to carry a minimal
amount of information between the interstage buffers. Detection of hazards
is straightforward and the single-word, single cycle nature of instructions
results in a pipeline with simple, uniform stages.
In a CISC ISA where individual instructions can occupy multiple stages of
the pipeline at once and can require multiple instruction fetches to be
fully decoded, it is more difficult to balance the pipeline and to detect
hazards. Even when a hazard is detected, it may not be easy to compensate
for it. Suppose that a multiword memory to memory move operation is in progress
and crosses a protection boundary -- part of the operation has already written
back to memory and so the exception handler must deal with the consequences
of a partially completed instruction. The memory writes cannot be undone,
and the counter that is used by the instruction is an internal register
that is hidden from programmer access.
In addition, complex addressing modes can make life difficult for the pipeline
designer, requiring many more arithmetic units in order to accelerate address
calculation so that it can fit into a single stage. Alternatively, the designer
may use multiple stages that are bypassed when simpler addressing modes
are employed. Of course, this produces a new kind of hazard in that the
simpler instructions cannot be allowed to jump ahead of others that are
using the more complex modes.
The bottom line of this discussion is that pipeline design in a CISC ISA
is more difficult. For example, the Intel architecture introduced a superscalar
pipeline design with the Pentium, several years after much smaller competitors
in the RISC ISA camp had brought such designs to market. The CISC ISA of
the Intel architecture held it back because of the added complexity of going
to this next level of pipelining
The term superscalar is a relatively recent one that actually just refers
to a concept that was introduced in supercomputers of the 1960's. Many times
in executing a program, calculations take place that are independent of
each other (at least for the length of the pipeline). These operations can
be executed in parallel if multiple pipelines are built into the machine.
In the Intel Pentium, for example, there are two integer pipelines, one
of which also feeds the floating point unit. It is thus possible, when conditions
are just right, to have two integer operations and one floating point operation
complete in a single cycle. It turns out that this is not a common occurence
in the Intel design (partly because of the way the floating point unit interacts
with the integer pipeline that feeds it), but it still happens often enough
to provide a significant performance boost.
In RISC architectures, it is now commonplace to have two integer pipelines,
a floating point pipe, a load-store pipeline, and a branch unit. Under perfect
conditions, such a design can complete five instructions in one cycle.
With so many pipelines active at once, it is difficult to detect and manage
dependences between the instructions in the different pipes. It is believed
that the amount of gain from this sort of parallelism is probably limited
to a less than a factor of ten for general applications. However, for certain
very regular computations it has been shown that there is much greater parallelism
to be exploited. On the other hand, these naturally parallel applications
tend also to be easily implemented on more traditional parallel processors,
so it is unclear whether the market will support a more significant level
of superscalar parallelism.
This is the first in the Intel architecture family to make real use of
pipelines. Previously, a prefetch buffer and an instruction predecode buffer
took the place of the early stages of a traditional pipeline. The Pentium
uses a 5-stage pipeline that splits in two to achieve superscalar instruction
execution. It also ties an additional 3 stages onto one of the branches
to handle floating point.
The instruction decoder determines whether instructions can execute in parallel
and dispatches them to the appropriate pipe. Floating point instructions
must go to the U pipe, which begins the floating point operation. Floating
point operations cannot be executed in parallel with any other operation.
Thus, the Pentium is superscalar for integer operations only.

The latest processor in the Motorola CISC line combines a 4-stage instruction
fetch pipeline with a dual (superscalar) 6-stage execution pipeline. One
of the two pipelines can perform eith integer or floating point operations.
The other pipeline is restricted to integer operations. THere are no restrictions
on executing floating point operations with other operations, so the 68060
is a true superscalar processor.

The prior version of the Motorola 68K line, the 68040, is a scalar pipeline
with 6 stages, expanding to 7 stages for floating point operations.

Notice that the Motorola designs include a stage simply for calculating
the operand address. This stage is needed to handle the complex addressing
modes supported by the 68K ISA. It is also interesting to note that Motorola
was able to introduce a pipelined processor well before Intel. This is in
part due to the uniformity and symmetry of the 68K ISA. Even though it is
thoroughly CISC in nature, the 68K instruction set is much simpler to decode
and execute than the Intel 80X86 ISA.
Unlike the pipeline described for the MIPS in the text, the actual MIPS
R4000 processor uses an eight-stage pipeline divided as follows:
IF - Instruction fetch first stage
IS - Instruction fetch second stage
RF - Register fetch
EX - Execute
DF - Data fetch first stage
DS - Data feth second stage
TC - Tag check
WB - Write back
The instruction and data fetch stages are split because the use of virtual
memory in the actual processor delays accesses so that they take 2 cycles.
The TC stage is introduced to verify that data has actually been fetched
-- it checks the tag information from the cache and the virtual memory address
translation unit to see whether the cache fetch and translation were successful.
If the value is not in the cache, or the address translation calculation
is taking longer than expected, then the TC stage signals that the data
has not been fetched and must stall the pipeline until it arrives.
The R4000 has a matching pipeline for floating point operations that operates
independently. In addition, the floating point unit has three separate units
that can operate simultaneously: Add/Root, Multiply, Divide. Thus, the pipeline
dispatches operations to these units (which may take different amounts of
time to complete) at the execute stage. The adder and multiplier are also
pipelined so that one new operation can begin while a current operation
is completing.
DEC's RISC processor architecture employs a pair of very deep pipelines
(that is, they have many stages) that classify it as a "superpipelined
superscalar" processor. The integer pipeline in the Alpha is 7 stages
long and the floating point pipeline is 10 stages in length. These short
stages allow the Alpha to execute with clock rates of 300 MHz, which is
approaching the clock rate of current Cray supercomputers.
The integer pipeline's seven stages are as follows. The first four stages
not only fetch the instruction, but they check for dependences between instructions.
These stages can be stalled, but after that the pipeline proceeds without
interruption.
Cache Access
Swap predict
Decode
Issue Register File Read
ALU First Stage / PC Generate / Virtual Address Generate (depending on instruction
type).
ALU Second Stage / Instr. Address Translation / Data Address Translation
Write back / Instr. Cache Tag Check / Data Cache Tag Check
The Alpha floating point pipeline has essentially the same first four stages
as the integer pipe. The next five stages are simply called F1 - F5 and
the last stage is floating write back (FWR). Part of the reason for the
uninformative names of the floating point stages is that they do not correspond
to specific operations. Rather, the five stages are responsible for just
four major processing steps that are split up so as to fit the five cycles.
This is a good example of a design in which the pipeline's workload had
been redistributed to balance the timing of the stages. The four operations
that are performed depend on the type of floating point operation:
Exponent Difference / Multiply by 3
Leading 1 Detection / Multiply Part 1
Shift Alignment / Multiply Part 2
Addition and Rounding
The multiplication in the Alpha is done with yet another unconventional
technique called the Radix-8 method. It is a pipelined circuit and one of
the inputs that it requires is 3 times the multiplicand.
The Alpha can process two instructions at once, within the following constraints:
Load or store can execute in parallel with any non-memory operation
Integer and floating point operations can execute in parallel
Floating point operations and flaoting point branches can execute in parallel
Integer operations and integer branches can execcute in parallel
The first processor in the Motorola/IBM/Apple consortium's line of RISC
microprocessors has three distinct, pipelines, one of which serves two functions.
One pipeline is dedicated to identifying and dynamically predicting branches,
another handles load/store operations in five stages, or integer operations
in four stages. The third is the floating-pint pipeline, and occupies six
stages. These pipelines are fed by a prefetch unit that paritally decodes
the instructions, identifies dependences, and issues them to the appropriate
pipeline. The 601 is thus a 3-way superscalar issue processor. The pipelines
are organized as follows:

One of the interesting aspects of the PowerPC 601 is that it can reorder
instructions within a window of four instruction slots so as to take advantage
of the availability of functional units in the pipelines. It thus has to
check the dependences between instructions within this window and ensure
that the results emerge and are eventually stored into the registers in
the proper order. To accomplish this reordering, it has extra "temporary"
registers that are hidden from the user and the ability to rename the registers
so that the results appear in the proper places at the proper times. It
is interesting to note that Intel will only be able to introduce this feature
to a limited degree in the P6 successor to the Pentium.
The successor to the 601 is a 4-way superscalar issue processor. Its
four pipelines are arranged as follows:

Note that the integer pipeline is now six stages deep and is separate from
the load/store pipeline. The load/store pipeline handles eight different
stages of processing in seven pipeline cycles -- again, an example of finely
dividing the work between adjacent stages in order to balance their timing.