Because of the complexity of the implementation of a datapath, we'll
go over the example in the text and take a look at the datapaths in some
other architectures.
The basic design of any datapath involves an instruction fetch unit (program
counter, memory), data load/store unit (address register, memory, address
ALU), and an execution unit (registers, ALU, control unit).
The instruction fetch unit is basically a device that takes the program
counter, presents it to the memory as an address, signals a read cycle on
the memory, and latches the memory output to the instruction register. In
addition it must handle the increment of the PC to get the next instruction,
and the addition of a relative jump address for PC relative jumps, or the
substitution of a branch address for direct branches.

In the MIPS and other RISC processors, the instruction increment adder has
a fixed input (4 if byte addressing, 1 if word addressing). The advantage
of knowing this is that a specilized adder can be used -- there is just
a single bit input to the second operand, and thus the bit below that point
can pass through unaltered and the bits above need only propagate a carry
(if necessary).
With a CISC processor, the lengths of instructions vary and so the PC must
be incremented by different amounts. This requires that the instruction
register be decoded to the point that the length of the instruction can
be determined and then that value is passed to the adder. Note, however,
that if the instruction spans multiple words, then the fetch unit must cycle
that many times to fetch all of the parts of the instruction. It must also
fill the appropriate portion of a longer instruction register, and so the
control unit must generate an address into the instruction register for
each of these cycles.
In a processor such as the 80486, instructions can also be less than a word
in length. Thus we might see two instruction in one word, in which case
the fetch unit must be able to realign the second instruction once the first
is executed (or the control unit must be able to selectively read part of
the IR).
In an actual implementation of the Intel architecture, the fetch unit fills
a buffer several words long with instructions and it simply watches as the
control unit consumes them, refilling the buffer whenever there is a spare
memory bus cycle. This gives the control unit more flexible access to the
variable width instructions, but also introduces the need for logic to flush
the buffer in the event of a branch.
The data load/store unit is similar to the instruction fetch unit in
that it provides an interface between memory and register storage. However,
it must handle writes in addition to reads.

The data address may come directly from an instruction or be the result
of some operation on the instruction and/or the registers. When the address
must be computed, it may be generated by a special address arithmetic ALU,
or by the same ALU used for normal computation. In the design shown in the
book, the latter approach is taken.
Note that the MIPS design shows separate data and instruction memories.
This is because the MIPS design uses a cache that is divided in this manner.
So even though data and instructions are intermixed within main memory,
they are in separate memories at the level of the hierarchy closest to the
registers. This is a common design approach among many RISC processors now
that there is room on the chip for two caches of sufficient size. However,
there are still designs that use a "unified" cache, and in this
situation the instruction fetch unit and the load/store unit must compete
for access to the memory.
The execution unit contains the data registers and the ALU. It selects
the registers to operate on, feeds them to an ALU, selects the appropriate
result, and sends it back to the registers. In a load or store, the ALU
may be bypassed so that the data moves between the registers and memory
unchanged. The ALU may also serve to compute a branch address.

With a RISC instruction set, the selection of the source and destination
registers is straightforward because they are fixed fields in the IR, and
it is simply a matter of connecting them to the register file's address
inputs. In a CISC instruction set, the source and destination registers
may be specified in different parts of the IR, and even in different words.
Thus, the IR fields must pass through a steering selector that is controlled
by the control unit.
Notice that for a load, the SRC1 supplies the value to be added to the instruction's
address, and the data itself comes from SRC2. The multiplexer ensures that
the address gets steered to the ALU in place os SRC2 in this case.
In a RISC ISA, there are just a few types of instructions, and they are
either pure register to register, or load/store in their treatment of data.
With a CISC architecture, the source inputs to the ALU can include memory
(actually a hidden memory data register) and a register, so there would
be an extra input to the lower multiplexer, and a second multiplexer on
the SRC1 input, to handle memory data. The result data also can go directly
to memory, so a demultiplexer would be needed to steer the ALU result eith
back to the registers or out to memory.
A CISC ISA has more addressing modes than a RISC ISA, which would result
in greater complexity in the address calculation step. The RISC processor
can basically use a register together with the address in an instruction.
A CISC address may be generated from an indirect reference (requiring another
memory fetch to the memory data register), or multiple registers (e.g.,
base + offset), or an extra instruction word (the extended IR), and possibly
with a pre- or post-increment (or decrement) of a register (requiring a
source value to pass through an increment/decrement unit either before or
after the ALU, and be passed back to that register). Given this specialized
use of the ALU, it is common in CISC architectures to have a separate addressing
ALU.
The control unit is a finite state machine that takes as its inputs the
IR, the status register (which is partly filled by the status output from
the ALU), and the current major state of the cycle. Its rules are encoded
either in random logic or in a PLA, and its outputs are sent across the
processor to each point requiring coordination or direction for the control
unit.
For example the outputs needed for the preceding units are Jump/Branch/NextPC,
IR Latch, Read Control, Load Control, ALU Function Select, Load/Reg-Reg,
Reg R/W.

The ALU function select takes the instruction op code and translates it
into a given function of the ALU (either one line per ALU function or a
compact binary code for the function). The Jump/Branch/PC depends on the
instruction type and in a RISC architecture these may be directly coded
in the op code. Read control occurs at the start of an instruction cycle.
IR latch and occurs at the end of the fetch state. Load control happens
at the end of the data fetch state of a load instruction. Load Reg/Reg again
depends on the op-code. Register R/W is in the start of the data fetch stage
and at the write back stage of an operation. It thus depends on the major
state and the instruction.
A CISC architecture typically uses a more complex control unit. As noted
before, the IR is often multiple words, and the control unit has to look
at different parts of the IR at different stages of execution. In fact,
the entire IR may not be available at once, requiring interlocks with fetch
logic to ensure the contents of the IR are valid.
There are many more control signals coming out of a CISC control unit, partly
to control the more complex addressing logic, but also to directly connect
to the many special purpose registers. In a RISC architecture, the registers
are accessed uniformly in a block so a simple decoder in the register file
can select the particular register. In a CISC architecture, there are restrictions
on the particular registers that can be used by a given instruction and
these are enforced by the control unit.
Executing an instruction in a single clock cycle as shown in the text
is an unrealistic approach. In the precedig discussion we saw that at least
an instruction register is required to deal with the fact that an instruction
fetch may not be completed in a single cycle, and also because it is unreasonable
to expect the memory to drive the entire data path. The text assumes a memory
that always responds in time, and that has enough power to drive the datapath.
From our prior discussions we know that memory is generally slower than
the computational components. Thus, if we are forced to run a processor
with a cycle time corresponding to a memory cycle, we will be holding back
the potential speed of computation.
(Side note: this may raise the question, "How can a processor ever
run at full speed when it has to fetch instructions from memory on every
cycle." The answer is that it takes a fast cache memory with a wide
data path that delivers multiple instructions on each cycle to an instruction
prefetch buffer that is built with registers. The buffer can feed the IR
at full speed, and even though the cache is slower, the fact that it outputs
multiple instructions at once allows it to keep up. Notice that this is
only possible with a multicycle design.)
Without the IR, the design is also forced to employ the split instruction
and data memory because a combined memory could not provide both the instruction
and data for an operation in a single cycle. The use of an IR allows us
to combine the two memories, but if we do so, then we also need to split
an operation into distinct cycles because we must clock the memory to cause
the read and then the IR to cause it to store the fetched instruction.
In addition, the single cycle design requires that all instructions take
the same length of time to execute. In RISC processor, most of the instructions
are 1 CPI, but not all. For example, floating point instructions may take
an order of magnitude longer to execute. Thus, if there can be only one
length of instruction cycle, it would have to be perhaps 10 times longer
than the majority of instructions require.
For a CISC ISA it would be impossible to build a single-cycle datapath because
many of the instructions involve multiple memory fetch operations, each
of which takes a clock signal to drive the memory. Our earlier comparisons
of CISC and RISC datapaths were only conceptually possible because an IR
was introduced.
There is one class of machines for which a single-cycle datapath has worked
reasonably well. These are the bit-serial array processors. In such machines,
there are thousands of processors that are only one bit wide. To add an
8-bit number, one of these processors must perform a series of 8 1-bit adds
with a carry being saved each time for the next instruction. Because all
of the instructions operate on just one bit, there is just a single execution
time for any instruction. Every operation of greater complexity is broken
down into these 1-bit instructions. However, this is an extreme instance
of a datapath. Even in these processors, operations such as I/O and communication
with each other have to be given multiple cycles.
In addition to being able to run simpler instructions in less time, the
use of multiple cycles allows us to reuse the computational ALU for address
calculation. To see why this is possible, consider that the realtive branch
offset and PC are available as soon as the IR is filled. But any register
operations that go to the ALU require register data to be fetched -- which
can't occur until one cycle after the IR is loaded. Thus, we can feed the
branch offset and PC to the ALU during this cycle and store the result in
a temporary register (target) for later use. Then, on the next cycle, the
register values arrive at the ALU and the computation is carried out.
There is one problem with this approach. What is it?
The instruction is only just being decoded during the same cycle! So how
do we know whether we have a branch or not?
The answer is that we don't care. If we don't have a branch, we can just
ignore the value in target and there is no harm done (the ALU would simply
have been idle for that cycle anyway).
But what if we do have a branch? We have this useless computational result
that has been produced from misinterpreting the fields of the IR. (Recall
that in the MIPS as in any RISC ISA, the branch and register instructions
have different formats.) Actually, the computational result has not been
entirely in vain. Because of the regularity of the ISA, the two register
fields in a branch are in the same place as the two source register fields
of a register operation. Thus, the correct values have gone through the
ALU, and we can use the result of comparing them to decide whether to take
the branch (and simply ignore the other ALU output). Note that this is another
clever trick that would be difficult to make use of in a CISC ISA.
Going back to the case where we don't have a branch, the same two register
fields feed the ALU and we simply keep the output from the ALU instead of
transferring the value in target to the PC.

Instruction Fetch to IR, PC - PC + 4
Instruction Decode, Src1 and Src2 Fetch, Target - PC + (Sign-extended address
from IR shifted 2 left)
Execution, ALU operation complete
Write back to destination register
Instruction Fetch to IR, PC - PC + 4
Instruction Decode, Src1 and Src2 Fetch, Target - PC + (Sign-extended address
from IR shifted 2 left)
Execution, ALU computed register + sign-extended address from IR
Write Src2 into Memory(ALU result) OR Read Memory(ALU result)
Write back Read data to Destination Register
Instruction Fetch to IR, PC - PC + 4
Instruction Decode, Src1 and Src2 Fetch, Target - PC + (Sign-extended address
from IR shifted 2 left)
IF Src1 = Src2 then PC gets target
Instruction Fetch to IR, PC - PC + 4
Instruction Decode, Src1 and Src2 Fetch, Target - PC + (Sign-extended address
from IR shifted 2 left)
PC gets high order bits of PC concatenated with jump address part of IR
shifted 2 left
The text shows how this datapath can be controlled by a finite state
machine with just nine states.

Unlike the way this is drawn in the text, it is quite obvious here that
each level of the controller's finite state machine corresponds to a clock
cycle (or major state). This view clearly shows the commonality of the first
two major states, prior to the decoding of the instruction. The third stage
is a typical fanning out of the finite state machine to deal with the different
cases of the instruction types. Thus, we can see that the machine is arranged
in a matrix where the rows are major states, and the clolumns are major
instruction types. Because the new PC values have been computed, the Jump
and Branch types can be finished early. The memory read takes the longest.
Each of these states produces a set of control signals that cause the multiplexers
to select the appropriate inputs, and the various registers to latch at
the proper time. This is determined by looking at each device requiring
control and determining for each state whether it is necessary to issue
a signal on that particular control line.
The controller is implemented by a block of combinational logic that takes
as its inputs the opcode, the status output from the ALU and the, and the
state register. Note that in the text, the circuitry controlled by the status
from the ALU is shown external to the control for simplicity. But in a more
general controller that must handle multiple conditions, this logic would
not be separated. Because there are nine distinct states, the state register
for this machine must have four bits.
One common approach to implementing a more complex controller is to have
the major state be represented by a counter and within each of these states
the different columns are represented by the state register. The major state
counter simply increments on each clock cycle, and when a column of states
finishes early, its last state generates a reset signal to the major state
counter.
Another issue that is not addressed here is what happens when there is a
memory delay. We are still assuming that memory returns in a single cycle.
The simplest case is to stall the finite state machine until the fetch is
complete. This is done by having a memory wait signal that is input to the
finite state machine. Each memory fetch state has a next state arc that
loops back to itself whenever memory wait is asserted.
The DEC PDP-8 is a frequently cited example of an almost trivial datapath.
We'll quickly take a look at it and note some differences in the implementation
approach. The PDP-8 is a 12-bit word machine with a single register (called
the Accumulator). It thus falls into the class of single address computers.
The majority of the intructions are specified by the upper 3 bits of a 12-bit
word. There are thus 8 major instuction classes:
0xxx Logical AND location xxx with Accumulator
1xxx Add location xxx to Accumulator
2xxx Increment location xxx and skip next instruction if the result is 0
3xxx Store Accumulator in location xxx and clear Accumulator
4xxx Subroutine jump to xxx + 1 and store return address in xxx
5xxx Unconditional jump to xxx
6xxx I/O value in Accumulator with device according to xxx
7xxx Subinstruction code specified by xxx
In the DEC scheme, the high order bit is number 0 and the low order bit
is 11. Operations 0 through 5 use an addressing scheme in which bit 3 determines
whether the address is direct or indirect, and bit 4 determines whether
it refers to an address in the current 128-word "page" or the
page that starts at address 0.
The processor organization is as follows, note that for the sake of further
simplification, the Current Page mode logic has been omitted:

The processor has three major states: Fetch, Defer (Indirect address fetch),
and Execute. Here is an example of an instruction execution:
1077 -- add location 77 directly to the accumulator
Major State 1 (Fetch)
Minor State 1 MAR - PC
Minor State 2 MBR - Mem(MAR), PC - PC + 1
Minor State 3 Latch MBR into IR
Minor State 4 Decode Instruction = Add, Mode = Direct (no defer), Page =
0, Max St, No defer
Major State 2 (Execute)
Minor State 1 MAR - 00000 + IR6..IR11
Minor State 2 MBR - Mem(MAR), ALU Function = Add
Minor State 3 Acc - ALU result, Max St
Now let's look at some of these control lines to see what logic expressions
drive them:
No defer = (Major State = 1) AND (IR5 = 0)
Latch MBR = (Major State = 1) AND (Minor State = 3)
Increment PC = (Major State = 1) AND (Minor State = 2)
IR address/MBR = (IR0..IR2 = 4 or 5) AND (IR3 = 1)
Load PC = (Major State = 3) AND (Minor State = 3) AND (IR0..IR2 = 4 or 5)
Memory Data/ALU = (Major State = 3) AND (IR0..IR2 = 3)
MBR Latch = ((Major State = 1) AND (Minor State = 2)( OR )(Major State =
2) AND (Minor State = 2)( OR ((Major State = 3) AND (Minor State = 2) AND
(IR0..IR2 4))
And so on. Essentially for each of the control signals we identify all of
the conditions that could cause it to be asserted and add them to the expression
for the given signal.
In many CISC architectures the control unit can feed back to the major
(and minor) states and has internal registers and a ROM. It can thus cause
portions of an instruction to be extended or repeated as necessary. The
minor states, registers. additional control logic, and ROM form a finite
state mechine called a microcode engine. In the microcode engine, the op-code
from the IR becomes a jump address into the ROM. A micro-PC ccan be used
to step through a series of fetches from the ROM starting at that point,
with each fetch resulting in control signals being sent out and feedback
to the major and minor state values.
An alternative to using a micro-PC is to have each instruction explicitly
specify the address of its successor. Thus, one of the fields in the micro-op
may be the address of the next instruction. This allows jumps to be used
anywhere in the microcode with no time penalty -- consider that if a separate
instruction had to be employed for a jump, it would add a cycle to the execution
of the ISA instruction.
The microcode instruction set can contain a subroutine jump so that common
sequences of control outputs can be reused. This is typically present in
systems that employ a microPC rather than a next instruction field. There
is typically just one subroutine return register, so nesting of subroutines
is not allowed. Thus, the subroutine jump may be implemented as a normal
jump with a signal issued that stores the current micro-PC value into the
return register. And the subroutine ends with a Return operation rather
than a jump to another location. The Return operation issues a signal that
loads the return register back into the micro-PC. In a system with an explicit
next instruction field, the current address plus one is stored into the
return register and it is implicit that the next location holds the next
instruction following the subroutine call. (An alternative is to have both
a subroutine jump address and a next instruction address, and simply return
to the same instruction, but this requires the instruction format to be
larger than necessary for most operations.
From the preceding discussion of how sequential execution, jumps, and
calls are executed we can gather that the microinstruction needs to include
control information for the microengine itself. But what else does it store?
In the simplest microinstruction formats, each bit represents a control
signal that is sent out to the datapath. Thus, when a microinstruction is
fetched, the bits in the instruction are connected to control wires and
cause actions to occur in the datapath.
For example, in our PDP-8 example, there are 14 control signals that are
driven by a single bit, and three others that each require multiple bits,
for a total of 23 bits. Thus, we might use a microinstruction format such
as:

Each of the first 20 bits of this microinstruction corresponds to a control
signal in the PDP- 8. For example, bit 0 might be the Halt, bit 1 the No
defer, bit 2 the Max St., bit 3 Acc. Load, bit 4 Acc. clear, etc. This simple
representation is effective but inefficient. For example, it uses 4 bits
to select the ALU function but only one of those bits should be asserted
at any time. Thus, we can save some memory by storing the number of the
active bit and using an external "decoder" circuit to translate
the two bits into the four lines that control the ALU.
In a simple design such as the PDP-8, it may seem that this (two bit) savings
is trivial, but in a CISC ISA, the number of control signals can be quite
large and it is important to minimize the number of bits in the microinstruction
format. Every location in the microcode memory has to store the same set
of bits, so any waste of bits is multiplied by the number of words.
Microinstruction format designs are often classified by the width of the
word employed. One approach is to have a very wide word that contains all
of the control signals necessary to drive the system. Such a design is referred
to as a "horizontal" microinstruction format.
Another approach is to use a narrower word, with a sequence of microinstructions
being required to drive all of the control signals. This is called a "vertical"
microinstruction format.
At first glance, it appears that a vertical microinstruction is inherently
slower -- it takes a sequence of operations to accomplish what the horizontal
microinstruction can do in a single cycle. But consider that in many cases,
individual control lines are asserted only in certain minor states. If all
of these are grouped together by minor state, then they can reuse some of
the bits of the microword by having their outputs first fed to a "demultiplexer"
that steers them to the proper signal lines according to the current minor
state (which may itself be part of the instruction). In effect the microinstruction
format is using multiple instruction formats to reduce redundancy.

Horizontal microcode has been employed in massively parallel array processors
where every processor in the system shares a single controller. Often the
controller is itself a full- fledged computer, and so the microinstruction
both contains traditional machine code for the controller itself as well
as the control signals that are distributed to the processors that make
up the array. A typical horizontal microinstruction for this type of machine
is 128 bits wide (16 bytes). Thus, every effort is made to reduce the number
of words required. It is important to note, however, that this is an unusual
application of microcode and that an implementation of such a microcode
engine would not be on a single chip -- thus, standard ROM parts would be
used to hold the microcode (instead of building the ROM as part of the processor
chip).
One other problem with horizontal microcode is that it is difficult to drive
such a large number of signals simultaneously. The switching of so many
drivers at once can cause the power supply voltage to sag momentarily (the
same as when your lights dim as you turn on a big appliance). This in turn
causes noise to appear on signal lines that can cause erroneous behavior
in other parts of the computer. Avoiding this requires carefull circuit
design and sometimes clever tricks, such as ensuring that the signals are
asserted in a series of slightly offset time steps.
Of course, the whole reason for using microcode is to manage the complexity
of a CISC ISA's control unit. Most RISC designs, even those that have fairly
complex implementations, are still sufficiently regular that their control
units can be directly constructed from a FSM built with combinational logic.
(The advantage of using combinational logic is that it is easier to build
a fast decoder with it than with a microcode ROM.)
However, for a CISC ISA, the speed decrease resulting from the use of microcode
is often outweighted by the need to manage the complexity of controlling
the architecture (a slow processor is, after all, more useful than a faster
one that doesn't work). In a CISC architecture there may be a large number
of instruction types, each with different fields referring to a wide range
of registers that have asymmetrical functions, or referring to one or more
memory operands with as many as 20 different addressing modes. (The DEC
VAX, Intel 80X86, and Intel iAPX 432 are prime examples).
A control unit for a CISC ISA can thus have to deal with instructions that
involve tens and even hundreds of minor states. There would thus be thousands
of logic expressions to generate the control signals. CAD tools can simplify
and minimize these and even lay them out on the silicon automatically, but
it is still a large block of irregular logic on the chip.
More importantly, if a mistake is discovered later (i.e., one of the logic
expressions is wrong), then it may be necessary to resimplify the entire
design and lay it out again, which could mean a redsign of the rest of the
chip to accomodate a change in the size of the control unit. The is obviously
a very costly error. Unfortuanately, it is also common. Even with the best
design and simulation tools, several commercial chips have gone into production
with errors that were discovered later. An early 68000 design had a bug
that would cause the processor to hang in certain cases, Intel shipped half
a million Pentium processors with an error in the floating-point division
instruction before it was caught, and has had bugs in earlier processors.
Since errors do occur, manufacturers of CISC processors use microcode to
reduce the cost of correcting the errors (and to help simplify the initial
design, which in itself helps to reduce errors). A bug-fix in a microcoded
controller is just a matter of changing the ROM, which does not affect the
size of the controller at all. It is a very low-cost correction.
In addition, a microcoded design is easier to enhance because unused op-codes
can be turned into new machines by simply extending the microcode. This
simplicity of extension may be another factor that has lead to the increasingly
complex ISAs produced by CISC manufacturers.
At one point, it was even thought that allowing the user to add to the microcode
was a good idea. If a user has a particular operation that they want to
accelerate, they can code it up as a new instruction in the microcode. Such
machines were said to have "writeable control stores." The VAX
11/780, the first model in theVAX line, was one of the most widely sold
of these machines. However, the feature was rarely used for two reasons:
first, the compilers could not take advantage of the custom instructions
so the user had to program in assembly language; second, the microcode is
tied to the machine implementation (it refers to the particular control
signals in the design) so it is not even portable to another model of the
same architecture.
In some cases, such as the Motorola 68000, there is also a nanocode engine.
The 68000 uses 544 17-bit words in its microengine and 336 68-bit words
in its nanocode engine. It thus has 32,096 bits of ROM. If everything had
been some with 68-bit words, it would have required 36,992 bits.
The M68000 microcode is very unusual in that the microcode implicitly calls
the nanocode. Each microcode instruction causes a corresponding nanocode
engine to be fetched automatically. The nanocode bits are actually the control
signals that get distributed across the machine. The microcode instructions
thus have only to determine what the next instruction will be. They have
two formats, one for an unconditional jump (perhaps just to the next location)
or a conditional jump (two bits of the jump address are reserved for the
result of the conditional test). This would seem to imply that there are
as many nanocode instructions as microcode instructions. Yet we can see
that there are 208 fewer. This is accomplished by carefully assigning the
addresses so that common nanocode operations can have multiple microcode
locations corresponding to them. The address space allows for 1024 instructions,
and they are arranged so that if a bit (or several) is ignored, the same
nanocode address is produced. Essentially the engineers mapped the microcode
operations into locations so that certain of the address bits are "don't
cares" and all of those locations are then mapped to the same nanocode
address. The don't cares are achieved by removing selected transistors in
the address decoders of the nanocode ROM.
Intel 8088: 504 21-bit microwords
Motorola 68000: 544 17-bit microwords, plus 336 68-bit nanowords
DEC LSI-11: 2048 22-bit microwords
IBM 3033 Mainframe: 2048 108-bit microwords plus 2048 126-bit microwords
Texas Instruments 8800: 32K 128-bit microwords in user-programmable RAM
UMass/Hughes IUA-2: 64K 128-bit microwords in RAM