Using Multiple Clock Cycles to Implement the Datapath

Disadvantages of the Single Cycle Datapath

Executing an instruction in a single clock cycle as shown in the text is an unrealistic approach. For example, we know that at least an instruction register is required to deal with the fact that an instruction fetch from memory may not be completed in a single cycle, and also because it is unreasonable to expect the memory to drive the entire data path. The text assumes a memory that always responds in time, and that has enough power to drive the datapath. We also know that some computations, such as multiplication, typically take multiple cycles.

From our prior discussions we know that memory is generally slower than the computational components. Thus, if we are forced to run a processor with a cycle time corresponding to a memory cycle, we will be holding back the potential speed of computation.

(Side note: this may raise the question, "How can a processor ever run at full speed when it has to fetch instructions from memory on every cycle." The answer is that it takes a fast cache memory with a wide data path that delivers multiple instructions on each cycle to an instruction prefetch buffer that is built with registers. The buffer can feed the IR at full speed, and even though the cache is slower, the fact that it outputs multiple instructions at once allows it to keep up. Notice that this is only possible with a multicycle design.)

Without the IR, the design is also forced to employ a split instruction and data memory because a combined memory could not provide both the instruction and data for an operation in a single cycle. The use of an IR allows us to combine the two memories, but if we do so, then we also need to split an operation into distinct cycles because we must clock the memory to cause the read and then the IR to cause it to store the fetched instruction. (Without even considering that a Load or Store instruction requires additional clock cycles to execute a second memory access cycle.)

In addition, the single cycle design requires that all instructions take the same length of time to execute. In RISC processor, most of the instructions are 1 CPI, but not all. For example, floating point instructions may take an order of magnitude longer to execute. Thus, if there can be only one length of instruction cycle, it would have to be perhaps 10 times longer than the majority of instructions require.

For a CISC ISA it would be impossible to build a single-cycle datapath because many of the instructions involve multiple memory fetch operations, each of which takes a clock signal to drive the memory. Our earlier comparisons of CISC and RISC datapaths were only conceptually possible because an IR was introduced.

There are some machines for which a single-cycle datapath has worked reasonably well. These are narrow data path processors such as microcontrollers and digital signal processors. In such machines, the data width of the processor is perhaps one to 8 bits wide. Because all of the instructions operate on just a few bits, and the processor tends to have only a very simple instruction set, there is just a single execution time for any instruction. Every operation of greater complexity is broken down into multiple instructions. For example, multiplication is programmed using shifts and adds.

Advantages of the Multiple Cycle Datapath

In addition to being able to run simpler instructions in less time, the use of multiple cycles allows us to reuse the computational ALU for address calculation. To see why this is possible, consider that the relative branch offset and PC are available as soon as the IR is filled. But any register operations that go to the ALU require register data to be fetched -- which can't occur until one cycle after the IR is loaded. Thus, we can feed the branch offset and PC to the ALU during this cycle and store the result in a temporary register (Target) for later use. Then, on the next cycle, the register values arrive at the ALU and the computation is carried out.

There is one problem with this approach. What is it?

The instruction is only just being decoded during the same cycle! So how do we know whether we have a branch or not?

The answer is that we don't care. If we don't have a branch, we can just ignore the value in Target and there is no harm done (the ALU would simply have been idle for that cycle anyway).

But what if we do have a branch? We have this useless computational result that has been produced from misinterpreting the fields of the IR. (Recall that in the MIPS as in any RISC ISA, the branch and register instructions have different formats.) Actually, the calculation of the computational result has not been entirely in vain. As long as we’ve been careful in the design of the ISA, the two register fields in a branch can be put in the same place as the two source register fields of a register operation. Thus, the correct values will go through the ALU for either a branch or an execute instruction, and we can use the result of comparing the register values to decide whether to take the branch (that is, we can look at the output that would go to a flag register, and simply ignore the main ALU output). Note that this is another clever trick that would be difficult to make use of in a CISC ISA.

Going back to the case where we have an execution instruction rather than a branch, the same two register fields from the IR direct the register file to feed a pair of values to the ALU. We simply store the main output from the ALU back into a register, instead of doing what we would for a branch (transferring the value in target to the PC).

A Multicycle Datapath

Let’s take a look at how we organize a multicycle datapath. The PC, via a multiplexer, goes to the memory. After being clocked, the output of the memory then goes to the IR. Fields within the IR directly go to the register file and the jump address calculation. A portion of the IR containing op-code bits that determine the ALU operation are also sent to the ALU. On the next clock, the registers output their selected values. Assuming that the ALU operations can be done within a clock cycle, the result emerges and is stored back into the destination register (or the PC) at the end of that cycle. Thus, for a branch or execute instruction, only two clock cycles are required. Obviously, if an execute operation takes multiple cycles (such as multiply) then more than two cycles are required. For a load or store, an address goes back to memory, which must fetch a value that is then stored into a register. Thus, assuming a single-cycle memory, at least one extra clock cycle is required.

These are minimum times, and assume that operations occur on both the rising and falling edges of each clock cycle. A more reasonable estimate would be to assume a whole clock cycle to fetch, another cycle to get the values from the registers, and third cycle to compute, and a fourth cycle to write back to the registers. Under that model, let’s look at how the different instructions proceed through the datapath.

Trace execution of register operations (4 cycles):

Instruction Fetch to IR, PC <- PC + 4

Instruction Decode, Src1 and Src2 Fetch, Target <- PC + (Sign-extended address from IR shifted 2 left)

Execution, ALU operation complete

Write back to destination register

Trace execution for load/store operations (4 or 5 cycles):

Instruction Fetch to IR, PC <- PC + 4

Instruction Decode, Src1 and Src2 Fetch, Target <- PC + (Sign-extended address from IR shifted 2 left)

Execution, ALU computed register + sign-extended address from IR

Write Src2 into Memory(ALU result) OR Read Memory(ALU result)

Write back Read data to Destination Register

Trace execution for branch (3 cycles):

Instruction Fetch to IR, PC <- PC + 4

Instruction Decode, Src1 and Src2 Fetch, Target <- PC + (Sign-extended address from IR shifted 2 left)

IF Src1 = Src2 then PC gets target

Trace execution for jump (3 cycles):

Instruction Fetch to IR, PC <- PC + 4

Instruction Decode, Src1 and Src2 Fetch, Target <- PC + (Sign-extended address from IR shifted 2 left)

PC gets high order bits of PC concatenated with jump address part of IR shifted 2 left

 

We can thus see that the shortest operations are branches and jumps, and the longest are the loads and stores (assuming execution operations all take a single cycle for computation).

 

We have now seen how to construct every part of the processor except the control unit. Many of the components of the instruction an data paths have control inputs – ALU operation, multiplexer select inputs, activation of shifters to align operands, and so on. All of these control signals must be given values for each clock cycle. As we see in the next discussion, the control unit is a finite state machine that takes in various inputs and generates all of these signals at the proper times with respect to instruction execution.

 

 


© Copyright 1995, 1996, 2001 Charles C. Weems Jr. All rights reserved.


Back to Chip Weems' home page.


Back to courses index page.


Back to Computer Science Department home page.