CmpSci 535 Notes from Lecture 7

Data Path

Because of the complexity of the implementation of a datapath, we'll go over the example in the text and take a look at the datapaths in some other architectures.

The basic design of any datapath involves an instruction fetch unit (program counter, memory), data load/store unit (address register, memory, address ALU), and an execution unit (registers, ALU, control unit).

Instruction Fetch

The instruction fetch unit is basically a device that takes the program counter, presents it to the memory as an address, signals a read cycle on the memory, and latches the memory output to the instruction register. In addition it must handle the increment of the PC to get the next instruction, and the addition of a relative jump address for PC relative jumps, or the substitution of a branch address for direct branches.



In the MIPS and other RISC processors, the instruction increment adder has a fixed input (4 if byte addressing, 1 if word addressing). The advantage of knowing this is that a specilized adder can be used -- there is just a single bit input to the second operand, and thus the bit below that point can pass through unaltered and the bits above need only propagate a carry (if necessary).

With a CISC processor, the lengths of instructions vary and so the PC must be incremented by different amounts. This requires that the instruction register be decoded to the point that the length of the instruction can be determined and then that value is passed to the adder. Note, however, that if the instruction spans multiple words, then the fetch unit must cycle that many times to fetch all of the parts of the instruction. It must also fill the appropriate portion of a longer instruction register, and so the control unit must generate an address into the instruction register for each of these cycles.

In a processor such as the 80486, instructions can also be less than a word in length. Thus we might see two instruction in one word, in which case the fetch unit must be able to realign the second instruction once the first is executed (or the control unit must be able to selectively read part of the IR).

In an actual implementation of the Intel architecture, the fetch unit fills a buffer several words long with instructions and it simply watches as the control unit consumes them, refilling the buffer whenever there is a spare memory bus cycle. This gives the control unit more flexible access to the variable width instructions, but also introduces the need for logic to flush the buffer in the event of a branch.

Load and Store

The data load/store unit is similar to the instruction fetch unit in that it provides an interface between memory and register storage. However, it must handle writes in addition to reads.



The data address may come directly from an instruction or be the result of some operation on the instruction and/or the registers. When the address must be computed, it may be generated by a special address arithmetic ALU, or by the same ALU used for normal computation. In the design shown in the book, the latter approach is taken.

Note that the MIPS design shows separate data and instruction memories. This is because the MIPS design uses a cache that is divided in this manner. So even though data and instructions are intermixed within main memory, they are in separate memories at the level of the hierarchy closest to the registers. This is a common design approach among many RISC processors now that there is room on the chip for two caches of sufficient size. However, there are still designs that use a "unified" cache, and in this situation the instruction fetch unit and the load/store unit must compete for access to the memory.

Execution Unit

The execution unit contains the data registers and the ALU. It selects the registers to operate on, feeds them to an ALU, selects the appropriate result, and sends it back to the registers. In a load or store, the ALU may be bypassed so that the data moves between the registers and memory unchanged. The ALU may also serve to compute a branch address.



With a RISC instruction set, the selection of the source and destination registers is straightforward because they are fixed fields in the IR, and it is simply a matter of connecting them to the register file's address inputs. In a CISC instruction set, the source and destination registers may be specified in different parts of the IR, and even in different words. Thus, the IR fields must pass through a steering selector that is controlled by the control unit.

Notice that for a load, the SRC1 supplies the value to be added to the instruction's address, and the data itself comes from SRC2. The multiplexer ensures that the address gets steered to the ALU in place os SRC2 in this case.

In a RISC ISA, there are just a few types of instructions, and they are either pure register to register, or load/store in their treatment of data. With a CISC architecture, the source inputs to the ALU can include memory (actually a hidden memory data register) and a register, so there would be an extra input to the lower multiplexer, and a second multiplexer on the SRC1 input, to handle memory data. The result data also can go directly to memory, so a demultiplexer would be needed to steer the ALU result eith back to the registers or out to memory.

A CISC ISA has more addressing modes than a RISC ISA, which would result in greater complexity in the address calculation step. The RISC processor can basically use a register together with the address in an instruction. A CISC address may be generated from an indirect reference (requiring another memory fetch to the memory data register), or multiple registers (e.g., base + offset), or an extra instruction word (the extended IR), and possibly with a pre- or post-increment (or decrement) of a register (requiring a source value to pass through an increment/decrement unit either before or after the ALU, and be passed back to that register). Given this specialized use of the ALU, it is common in CISC architectures to have a separate addressing ALU.

Control Unit

The control unit is a finite state machine that takes as its inputs the IR, the status register (which is partly filled by the status output from the ALU), and the current major state of the cycle. Its rules are encoded either in random logic or in a PLA, and its outputs are sent across the processor to each point requiring coordination or direction for the control unit.

For example the outputs needed for the preceding units are Jump/Branch/NextPC, IR Latch, Read Control, Load Control, ALU Function Select, Load/Reg-Reg, Reg R/W.



The ALU function select takes the instruction op code and translates it into a given function of the ALU (either one line per ALU function or a compact binary code for the function). The Jump/Branch/PC depends on the instruction type and in a RISC architecture these may be directly coded in the op code. Read control occurs at the start of an instruction cycle. IR latch and occurs at the end of the fetch state. Load control happens at the end of the data fetch state of a load instruction. Load Reg/Reg again depends on the op-code. Register R/W is in the start of the data fetch stage and at the write back stage of an operation. It thus depends on the major state and the instruction.

A CISC architecture typically uses a more complex control unit. As noted before, the IR is often multiple words, and the control unit has to look at different parts of the IR at different stages of execution. In fact, the entire IR may not be available at once, requiring interlocks with fetch logic to ensure the contents of the IR are valid.

There are many more control signals coming out of a CISC control unit, partly to control the more complex addressing logic, but also to directly connect to the many special purpose registers. In a RISC architecture, the registers are accessed uniformly in a block so a simple decoder in the register file can select the particular register. In a CISC architecture, there are restrictions on the particular registers that can be used by a given instruction and these are enforced by the control unit.

Using Multiple Clock Cycles to Implement the Datapath
Disadvantages of the Single Cycle Datapath

Executing an instruction in a single clock cycle as shown in the text is an unrealistic approach. In the precedig discussion we saw that at least an instruction register is required to deal with the fact that an instruction fetch may not be completed in a single cycle, and also because it is unreasonable to expect the memory to drive the entire data path. The text assumes a memory that always responds in time, and that has enough power to drive the datapath.

From our prior discussions we know that memory is generally slower than the computational components. Thus, if we are forced to run a processor with a cycle time corresponding to a memory cycle, we will be holding back the potential speed of computation.

(Side note: this may raise the question, "How can a processor ever run at full speed when it has to fetch instructions from memory on every cycle." The answer is that it takes a fast cache memory with a wide data path that delivers multiple instructions on each cycle to an instruction prefetch buffer that is built with registers. The buffer can feed the IR at full speed, and even though the cache is slower, the fact that it outputs multiple instructions at once allows it to keep up. Notice that this is only possible with a multicycle design.)

Without the IR, the design is also forced to employ the split instruction and data memory because a combined memory could not provide both the instruction and data for an operation in a single cycle. The use of an IR allows us to combine the two memories, but if we do so, then we also need to split an operation into distinct cycles because we must clock the memory to cause the read and then the IR to cause it to store the fetched instruction.

In addition, the single cycle design requires that all instructions take the same length of time to execute. In RISC processor, most of the instructions are 1 CPI, but not all. For example, floating point instructions may take an order of magnitude longer to execute. Thus, if there can be only one length of instruction cycle, it would have to be perhaps 10 times longer than the majority of instructions require.

For a CISC ISA it would be impossible to build a single-cycle datapath because many of the instructions involve multiple memory fetch operations, each of which takes a clock signal to drive the memory. Our earlier comparisons of CISC and RISC datapaths were only conceptually possible because an IR was introduced.

There is one class of machines for which a single-cycle datapath has worked reasonably well. These are the bit-serial array processors. In such machines, there are thousands of processors that are only one bit wide. To add an 8-bit number, one of these processors must perform a series of 8 1-bit adds with a carry being saved each time for the next instruction. Because all of the instructions operate on just one bit, there is just a single execution time for any instruction. Every operation of greater complexity is broken down into these 1-bit instructions. However, this is an extreme instance of a datapath. Even in these processors, operations such as I/O and communication with each other have to be given multiple cycles.

Advantages of the Multiple Cycle Datapath

In addition to being able to run simpler instructions in less time, the use of multiple cycles allows us to reuse the computational ALU for address calculation. To see why this is possible, consider that the realtive branch offset and PC are available as soon as the IR is filled. But any register operations that go to the ALU require register data to be fetched -- which can't occur until one cycle after the IR is loaded. Thus, we can feed the branch offset and PC to the ALU during this cycle and store the result in a temporary register (target) for later use. Then, on the next cycle, the register values arrive at the ALU and the computation is carried out.

There is one problem with this approach. What is it?

The instruction is only just being decoded during the same cycle! So how do we know whether we have a branch or not?

The answer is that we don't care. If we don't have a branch, we can just ignore the value in target and there is no harm done (the ALU would simply have been idle for that cycle anyway).

But what if we do have a branch? We have this useless computational result that has been produced from misinterpreting the fields of the IR. (Recall that in the MIPS as in any RISC ISA, the branch and register instructions have different formats.) Actually, the computational result has not been entirely in vain. Because of the regularity of the ISA, the two register fields in a branch are in the same place as the two source register fields of a register operation. Thus, the correct values have gone through the ALU, and we can use the result of comparing them to decide whether to take the branch (and simply ignore the other ALU output). Note that this is another clever trick that would be difficult to make use of in a CISC ISA.

Going back to the case where we don't have a branch, the same two register fields feed the ALU and we simply keep the output from the ALU instead of transferring the value in target to the PC.

A Multicycle Datapath

Trace execution of register operations (4 cycles):

Instruction Fetch to IR, PC - PC + 4

Instruction Decode, Src1 and Src2 Fetch, Target - PC + (Sign-extended address from IR shifted 2 left)

Execution, ALU operation complete

Write back to destination register

Trace execution for load/store operations (4 or 5 cycles):

Instruction Fetch to IR, PC - PC + 4

Instruction Decode, Src1 and Src2 Fetch, Target - PC + (Sign-extended address from IR shifted 2 left)

Execution, ALU computed register + sign-extended address from IR

Write Src2 into Memory(ALU result) OR Read Memory(ALU result)

Write back Read data to Destination Register

Trace execution for branch (3 cycles):

Instruction Fetch to IR, PC - PC + 4

Instruction Decode, Src1 and Src2 Fetch, Target - PC + (Sign-extended address from IR shifted 2 left)

IF Src1 = Src2 then PC gets target

Trace execution for jump (3 cycles):

Instruction Fetch to IR, PC - PC + 4

Instruction Decode, Src1 and Src2 Fetch, Target - PC + (Sign-extended address from IR shifted 2 left)

PC gets high order bits of PC concatenated with jump address part of IR shifted 2 left

Control for the Multicycle Datapath

The text shows how this datapath can be controlled by a finite state machine with just nine states.



Unlike the way this is drawn in the text, it is quite obvious here that each level of the controller's finite state machine corresponds to a clock cycle (or major state). This view clearly shows the commonality of the first two major states, prior to the decoding of the instruction. The third stage is a typical fanning out of the finite state machine to deal with the different cases of the instruction types. Thus, we can see that the machine is arranged in a matrix where the rows are major states, and the clolumns are major instruction types. Because the new PC values have been computed, the Jump and Branch types can be finished early. The memory read takes the longest.

Each of these states produces a set of control signals that cause the multiplexers to select the appropriate inputs, and the various registers to latch at the proper time. This is determined by looking at each device requiring control and determining for each state whether it is necessary to issue a signal on that particular control line.

The controller is implemented by a block of combinational logic that takes as its inputs the opcode, the status output from the ALU and the, and the state register. Note that in the text, the circuitry controlled by the status from the ALU is shown external to the control for simplicity. But in a more general controller that must handle multiple conditions, this logic would not be separated. Because there are nine distinct states, the state register for this machine must have four bits.

One common approach to implementing a more complex controller is to have the major state be represented by a counter and within each of these states the different columns are represented by the state register. The major state counter simply increments on each clock cycle, and when a column of states finishes early, its last state generates a reset signal to the major state counter.

Another issue that is not addressed here is what happens when there is a memory delay. We are still assuming that memory returns in a single cycle. The simplest case is to stall the finite state machine until the fetch is complete. This is done by having a memory wait signal that is input to the finite state machine. Each memory fetch state has a next state arc that loops back to itself whenever memory wait is asserted.

Another Simple Datapath Example: the PDP-8

The DEC PDP-8 is a frequently cited example of an almost trivial datapath. We'll quickly take a look at it and note some differences in the implementation approach. The PDP-8 is a 12-bit word machine with a single register (called the Accumulator). It thus falls into the class of single address computers. The majority of the intructions are specified by the upper 3 bits of a 12-bit word. There are thus 8 major instuction classes:

0xxx Logical AND location xxx with Accumulator

1xxx Add location xxx to Accumulator

2xxx Increment location xxx and skip next instruction if the result is 0

3xxx Store Accumulator in location xxx and clear Accumulator

4xxx Subroutine jump to xxx + 1 and store return address in xxx

5xxx Unconditional jump to xxx

6xxx I/O value in Accumulator with device according to xxx

7xxx Subinstruction code specified by xxx

In the DEC scheme, the high order bit is number 0 and the low order bit is 11. Operations 0 through 5 use an addressing scheme in which bit 3 determines whether the address is direct or indirect, and bit 4 determines whether it refers to an address in the current 128-word "page" or the page that starts at address 0.

The processor organization is as follows, note that for the sake of further simplification, the Current Page mode logic has been omitted:



The processor has three major states: Fetch, Defer (Indirect address fetch), and Execute. Here is an example of an instruction execution:

1077 -- add location 77 directly to the accumulator

Major State 1 (Fetch)

Minor State 1 MAR - PC

Minor State 2 MBR - Mem(MAR), PC - PC + 1

Minor State 3 Latch MBR into IR

Minor State 4 Decode Instruction = Add, Mode = Direct (no defer), Page = 0, Max St, No defer

Major State 2 (Execute)

Minor State 1 MAR - 00000 + IR6..IR11

Minor State 2 MBR - Mem(MAR), ALU Function = Add

Minor State 3 Acc - ALU result, Max St

Now let's look at some of these control lines to see what logic expressions drive them:

No defer = (Major State = 1) AND (IR5 = 0)

Latch MBR = (Major State = 1) AND (Minor State = 3)

Increment PC = (Major State = 1) AND (Minor State = 2)

IR address/MBR = (IR0..IR2 = 4 or 5) AND (IR3 = 1)

Load PC = (Major State = 3) AND (Minor State = 3) AND (IR0..IR2 = 4 or 5)

Memory Data/ALU = (Major State = 3) AND (IR0..IR2 = 3)

MBR Latch = ((Major State = 1) AND (Minor State = 2)( OR )(Major State = 2) AND (Minor State = 2)( OR ((Major State = 3) AND (Minor State = 2) AND (IR0..IR2 4))

And so on. Essentially for each of the control signals we identify all of the conditions that could cause it to be asserted and add them to the expression for the given signal.

Microcode

In many CISC architectures the control unit can feed back to the major (and minor) states and has internal registers and a ROM. It can thus cause portions of an instruction to be extended or repeated as necessary. The minor states, registers. additional control logic, and ROM form a finite state mechine called a microcode engine. In the microcode engine, the op-code from the IR becomes a jump address into the ROM. A micro-PC ccan be used to step through a series of fetches from the ROM starting at that point, with each fetch resulting in control signals being sent out and feedback to the major and minor state values.

An alternative to using a micro-PC is to have each instruction explicitly specify the address of its successor. Thus, one of the fields in the micro-op may be the address of the next instruction. This allows jumps to be used anywhere in the microcode with no time penalty -- consider that if a separate instruction had to be employed for a jump, it would add a cycle to the execution of the ISA instruction.

The microcode instruction set can contain a subroutine jump so that common sequences of control outputs can be reused. This is typically present in systems that employ a microPC rather than a next instruction field. There is typically just one subroutine return register, so nesting of subroutines is not allowed. Thus, the subroutine jump may be implemented as a normal jump with a signal issued that stores the current micro-PC value into the return register. And the subroutine ends with a Return operation rather than a jump to another location. The Return operation issues a signal that loads the return register back into the micro-PC. In a system with an explicit next instruction field, the current address plus one is stored into the return register and it is implicit that the next location holds the next instruction following the subroutine call. (An alternative is to have both a subroutine jump address and a next instruction address, and simply return to the same instruction, but this requires the instruction format to be larger than necessary for most operations.

Microinstruction Format Design

From the preceding discussion of how sequential execution, jumps, and calls are executed we can gather that the microinstruction needs to include control information for the microengine itself. But what else does it store?

In the simplest microinstruction formats, each bit represents a control signal that is sent out to the datapath. Thus, when a microinstruction is fetched, the bits in the instruction are connected to control wires and cause actions to occur in the datapath.

For example, in our PDP-8 example, there are 14 control signals that are driven by a single bit, and three others that each require multiple bits, for a total of 23 bits. Thus, we might use a microinstruction format such as:



Each of the first 20 bits of this microinstruction corresponds to a control signal in the PDP- 8. For example, bit 0 might be the Halt, bit 1 the No defer, bit 2 the Max St., bit 3 Acc. Load, bit 4 Acc. clear, etc. This simple representation is effective but inefficient. For example, it uses 4 bits to select the ALU function but only one of those bits should be asserted at any time. Thus, we can save some memory by storing the number of the active bit and using an external "decoder" circuit to translate the two bits into the four lines that control the ALU.

In a simple design such as the PDP-8, it may seem that this (two bit) savings is trivial, but in a CISC ISA, the number of control signals can be quite large and it is important to minimize the number of bits in the microinstruction format. Every location in the microcode memory has to store the same set of bits, so any waste of bits is multiplied by the number of words.

Microinstruction format designs are often classified by the width of the word employed. One approach is to have a very wide word that contains all of the control signals necessary to drive the system. Such a design is referred to as a "horizontal" microinstruction format.

Another approach is to use a narrower word, with a sequence of microinstructions being required to drive all of the control signals. This is called a "vertical" microinstruction format.

At first glance, it appears that a vertical microinstruction is inherently slower -- it takes a sequence of operations to accomplish what the horizontal microinstruction can do in a single cycle. But consider that in many cases, individual control lines are asserted only in certain minor states. If all of these are grouped together by minor state, then they can reuse some of the bits of the microword by having their outputs first fed to a "demultiplexer" that steers them to the proper signal lines according to the current minor state (which may itself be part of the instruction). In effect the microinstruction format is using multiple instruction formats to reduce redundancy.



Horizontal microcode has been employed in massively parallel array processors where every processor in the system shares a single controller. Often the controller is itself a full- fledged computer, and so the microinstruction both contains traditional machine code for the controller itself as well as the control signals that are distributed to the processors that make up the array. A typical horizontal microinstruction for this type of machine is 128 bits wide (16 bytes). Thus, every effort is made to reduce the number of words required. It is important to note, however, that this is an unusual application of microcode and that an implementation of such a microcode engine would not be on a single chip -- thus, standard ROM parts would be used to hold the microcode (instead of building the ROM as part of the processor chip).

One other problem with horizontal microcode is that it is difficult to drive such a large number of signals simultaneously. The switching of so many drivers at once can cause the power supply voltage to sag momentarily (the same as when your lights dim as you turn on a big appliance). This in turn causes noise to appear on signal lines that can cause erroneous behavior in other parts of the computer. Avoiding this requires carefull circuit design and sometimes clever tricks, such as ensuring that the signals are asserted in a series of slightly offset time steps.

Why Microcode?

Of course, the whole reason for using microcode is to manage the complexity of a CISC ISA's control unit. Most RISC designs, even those that have fairly complex implementations, are still sufficiently regular that their control units can be directly constructed from a FSM built with combinational logic. (The advantage of using combinational logic is that it is easier to build a fast decoder with it than with a microcode ROM.)

However, for a CISC ISA, the speed decrease resulting from the use of microcode is often outweighted by the need to manage the complexity of controlling the architecture (a slow processor is, after all, more useful than a faster one that doesn't work). In a CISC architecture there may be a large number of instruction types, each with different fields referring to a wide range of registers that have asymmetrical functions, or referring to one or more memory operands with as many as 20 different addressing modes. (The DEC VAX, Intel 80X86, and Intel iAPX 432 are prime examples).

A control unit for a CISC ISA can thus have to deal with instructions that involve tens and even hundreds of minor states. There would thus be thousands of logic expressions to generate the control signals. CAD tools can simplify and minimize these and even lay them out on the silicon automatically, but it is still a large block of irregular logic on the chip.

More importantly, if a mistake is discovered later (i.e., one of the logic expressions is wrong), then it may be necessary to resimplify the entire design and lay it out again, which could mean a redsign of the rest of the chip to accomodate a change in the size of the control unit. The is obviously a very costly error. Unfortuanately, it is also common. Even with the best design and simulation tools, several commercial chips have gone into production with errors that were discovered later. An early 68000 design had a bug that would cause the processor to hang in certain cases, Intel shipped half a million Pentium processors with an error in the floating-point division instruction before it was caught, and has had bugs in earlier processors.

Since errors do occur, manufacturers of CISC processors use microcode to reduce the cost of correcting the errors (and to help simplify the initial design, which in itself helps to reduce errors). A bug-fix in a microcoded controller is just a matter of changing the ROM, which does not affect the size of the controller at all. It is a very low-cost correction.

In addition, a microcoded design is easier to enhance because unused op-codes can be turned into new machines by simply extending the microcode. This simplicity of extension may be another factor that has lead to the increasingly complex ISAs produced by CISC manufacturers.

At one point, it was even thought that allowing the user to add to the microcode was a good idea. If a user has a particular operation that they want to accelerate, they can code it up as a new instruction in the microcode. Such machines were said to have "writeable control stores." The VAX 11/780, the first model in theVAX line, was one of the most widely sold of these machines. However, the feature was rarely used for two reasons: first, the compilers could not take advantage of the custom instructions so the user had to program in assembly language; second, the microcode is tied to the machine implementation (it refers to the particular control signals in the design) so it is not even portable to another model of the same architecture.

Nanocode

In some cases, such as the Motorola 68000, there is also a nanocode engine. The 68000 uses 544 17-bit words in its microengine and 336 68-bit words in its nanocode engine. It thus has 32,096 bits of ROM. If everything had been some with 68-bit words, it would have required 36,992 bits.

The M68000 microcode is very unusual in that the microcode implicitly calls the nanocode. Each microcode instruction causes a corresponding nanocode engine to be fetched automatically. The nanocode bits are actually the control signals that get distributed across the machine. The microcode instructions thus have only to determine what the next instruction will be. They have two formats, one for an unconditional jump (perhaps just to the next location) or a conditional jump (two bits of the jump address are reserved for the result of the conditional test). This would seem to imply that there are as many nanocode instructions as microcode instructions. Yet we can see that there are 208 fewer. This is accomplished by carefully assigning the addresses so that common nanocode operations can have multiple microcode locations corresponding to them. The address space allows for 1024 instructions, and they are arranged so that if a bit (or several) is ignored, the same nanocode address is produced. Essentially the engineers mapped the microcode operations into locations so that certain of the address bits are "don't cares" and all of those locations are then mapped to the same nanocode address. The don't cares are achieved by removing selected transistors in the address decoders of the nanocode ROM.

Comparison of Some Microcode Engines

Intel 8088: 504 21-bit microwords

Motorola 68000: 544 17-bit microwords, plus 336 68-bit nanowords

DEC LSI-11: 2048 22-bit microwords

IBM 3033 Mainframe: 2048 108-bit microwords plus 2048 126-bit microwords

Texas Instruments 8800: 32K 128-bit microwords in user-programmable RAM

UMass/Hughes IUA-2: 64K 128-bit microwords in RAM



© Copyright 1995, 1996 Charles C. Weems Jr. All rights reserved.
Back to Chip Weems' home page.
Back to courses index page.
Back to Computer Science Department home page.