Lecture 25: Sequential execution + Pipelines (by Trek Palmer) ================================================ Execution of a few ARM instructions ----------------------------------- Of course, the ARM itself is considerably more complicated inside, but we can pretend that it is organized in a single-bus manner. Let's examine the execution of ADDS R1, R2, R3 ADDS R1, R2, R3: 1) write R2 onto the bus 2) Store bus value into Y 3) write R3 onto the bus 4) select Y and ADD 5) write ALU output into Z 6) select condition code write 7) write Z onto the bus 8) read bus into R1 LDR R1, [R2, -R3] 1) write R2 onto the bus 2) Store bus value into Y 3) write R3 onto the bus 4) select Y and SUB 5) write ALU output into Z 6) write Z onto the bus 7) read bus into MAR 8) start memory read 9) wait for MFC 10) write MDR out to bus 11) read bus into R1 B 0xFEED 1) write 0xFEED onto the bus (from decode) 2) read bus into Y 3) write 2 onto the bus (from decode) 4) select Y and LSL 5) write ALU output to Z 6) write Z onto the bus 7) read bus into Y 8) write PC onto bus 9) select Y and ADD 10) write ALU output to Z 11) write Z out to bus 12) read bus into PC Branch was complicated because the encoding actually stores the offset right-shifted by two. Note also that because the values were encoded in the instruction itself the values were written onto the bus from the decoder and not a register. Multiple Buses --------------- It is also possible to have multiple parallel buses operating in the core of the processor. For instance, to speed up register-register operations, you could have two buses to hold register values, therefore in one clock tick you could read out both operands. And if you had a third bus that carried the ALU output to all the system registsers you could speed up operations that write to registers (that is, most of them). So, in this 3-bus scheme and ADDS would happen thus: ADDS R1, R2, R3: 1) write R2 onto bus A, and write R3 onto bus B 2) select bus A input and ADD 3) read bus C into R1 That's a reduction of 8 steps to 3. That's pretty good. In modern processors there are many parallel buses. Some carry data from user-visible registers, but many of them are for communication between scattered functional units and the complicated out-of-order control logic. Decoding and Control --------------------- In the systems we have looked at so far the control was basically hardwired. The logic that dictated what to do when a branch instruction was encountered was built out of unchanging gates and wires. This is not the only solution. In a microprogrammed system, there is actually a lower level of programmable control that tells the system how to behave when it encounters instructions. Microprogramming isn't like standard assembly programming. It is usually much more restricted, your only data are the system registers and the IR (usually). You can mask and examine instruction bits, and then set various control signals based on those values. Some microprogramming systems allow for limited control-flow, this complicates things. Microprogramming is not really used much anymore by your average assembly programmer, but the microprogramming concept is alive and well in the heart of the Pentium series of processors. Ever since the Pentium Pro, Intel has implemented the core logic with their own proprietary RISC instruction set (called micro-ops). But the processor still executes x86 instructions, so how do they do it? The trick is to have a really beefy microcode engine that will translate the nasty x86 instructions into a series of uops that will then execute RISC-style in the superscalar core. This is a colossal hack and illustrates just how crucial backwards compatibility is in architecture. Pipelining =========== We saw how going from one shared bus to multiple buses was able to cut the execution time of an ADD instruction in half. Ideally, we would like to be able to reduce the execution time of an instruction down to a single cycle. This is actually possible using a concept called pipelining. It may seem counter-intuitive, but by introducing more steps you can speed up the execution of instructions. 2-stage pipeline ================ For this simple example, we divide the execution of an instruction into two phases: fetch and execute. Fetch is the part of the execution that gets the instruction and its operands out to the ALU, and execute is the part that computes the result and writes it back to the register file. So, if we have a stream of 4 instructions, in a sequential processor their execution would occur thus: F0 | E0 || F1 | E1 || F2 | E2 || F3 | E3 cycle: 1 2 3 4 5 6 7 8 Assuming a single cycle per phase, this stream of instructions would take 2*4 = 8 cycles to execute. Now, pipelining relies on the following observation: while instruction 0 is in the execute phase, the fetch hardware is sitting idle. In a pipeline, the fetch phase for the second instruction will be occuring at the same time as the execute phase for the first instruction. The execution would look something like: F0 | E0 F1 | E1 F2 | E2 F3 | E3 cycle 1 2 3 4 5 Now the same stream of instructions takes only 5 cycles! A pipeline implements a kind of temporal parallelism, so in fact, the pipeline above is actually executing two instructions at the same time. Terminology and layout ---------------------- A pipeline is divided into stages. Each stage implements some phase of an instruction's execution. The stages are arranged linearly and each stage gets its input from the previous stage and feeds its output to the suceeding stage. In hardware each stage is seperated with a latch/buffer that stores the preceeding stage's output and feeds it into the next stage on the next clock tick. For our two-stage pipeline above, the pipeline would look like: +----+ +-----+ | IF |---Buffer---->| E | +----+ +-----+ In an n-stage pipeline, it is possible to have (at most) n instructions executing simultaneously. And if the pipeline is full, you can complete one instruction per cycle. Pipelines also allow you to build processors with higher clock-frequencies. In a sequential proc, the clock speed was limited by the largest chunk of logic that had to execute in one cycle, so the only way to increase the clock cycle was to either reduce the logic (not always possible) or re-implement the chip with different technologies and materials so the logic will execute faster (very expensive). In a pipelined processor, each stage has to complete in one cycle, so if you evenly (or as close as you can manage) divide the work amongst your stages you can crank up the clock by roughly your number of stages. In the two stage example, the piplelined version could have a clock twice as fast as the sequential processor (because it only has to do half as much work on each tick). The 4-stage pipe ---------------- The two-stage example is highly contrived, but the canonical pipeline is the so-called 4-stage pipeline. Instruction execution is divided into a Fetch Stage (F), a Decode stage (D), an Execute stage (E), and a Writeback stage (W). The pipeline is organized thus: F--B1---D--B2--E--B3--W Now we can have 4 instructions being executed at the same time! Note that the buffers may have to hold more information than is needed by the following stage. For instance, the decode stage will ascertain what the source operands are as well as the operation to be performed. All this info is needed by the E stage, but the decoder will also extract the destination from the instruction. E has no need for this info, but W needs it in order to write the output back to the correct register. Therefore B2 actually holds information needed by both E and W. In general, it may be necessary for buffers to hold info for many stages down the line. Performance issues ------------------- A pipeline can only keep up its breakneck pace if it's kept fed with instructions. It's performance starts to degrade if it's not maintaining a certain rate of computation. Consider the initial case where a 4 stage pipeline is just starting up: cycle: 1 2 3 4 5 6 7 8 9 10 inst1 F D E W inst2 F D E W inst3 F D E W inst4 F D E W Because the pipeline was initially empty, some of the stages were idle for multiple cycles. In the end, it took 7 cycles to execute 4 instructions through an empty pipe. This is only 1 cycle better than the sequential case. In the worst case, an underfed pipeline will have worse performance than a sequential system. Modern systems are almost exclusively pipelined, so lots (and I do mean LOTS) of effort is expended to keep their massive pipelines (the latest Pentium IV has over 30 stages) fed. Hazards and stalls ------------------ So, in a pipeline, a stall is when a stage has to wait for multiple cycles to complete execution. This will cause all the upstream stages to have to wait as well. Stalls are bad (duh). A hazard is a condition in the instruction stream that causes a stall. So, where do these hazards come from? There are several different classes of hazards: data hazards, structural hazards, and control hazards. Data hazards ------------ A data hazard occurs when a stage doesn't have the inputs it needs to compute at the start of the clock cycle. There are a variety of reasons why this could be the case. -variable-length operations A given stage may occasionally take more than one cycle to complete. Most ALU operations are simple (Add, sub, and, orr, etc) but some operations are very expensive. Multiplication isn't cheap, but division is by far the most expensive standard ALU op (probably why the ARM folks left it out). It routinely takes multiple cycles to divide two integers, so the execute stage would be occupied for more than one cycle, thereby depriving the W stage of valid input, and thereby causing a data hazard. Consider the following code sequence: ADD R1, R2, R2 MUL R5, R5, R5 MOV R3, R4, LSL #2 Assuming that MUL takes 3 cycles, on a four-stage pipe, the instructions would look like: clock 1 2 3 4 5 6 7 8 ADD F D E W MUL F D E E E W MOV F D * * E W Without the hazard, these three instructions would get through the pipe in 6 cycles, but the stall causes it to stretch out to 8. -Data dependencies The pipeline exploits a kind of paralellism, but it only possible to fully paralellise things without any dependencies between them. Consider the following instruction stream: ADD R2, R3, R2 ADD R4, R2, R2 MOV R3, R4 Now, if R2 is originally 2, R3=3, and R4 = 4, then after the MOV, R2 should be 5, R3 = 10, R4 = 10. Lets run this through a naive pipeline clock: 1 2 3 4 5 6 7 8 ADD F D E W ADD F D E W MOV F D E W R2 2 2 2 2 5 5 5 R3 3 3 3 3 3 3 4 R4 4 4 4 4 4 4 4 What happened? Because instruction execution is staged, the registers in the register file don't get updated till after the W stage is finished. So, when the second add fetches its operands (in D) it sees the old values. Whoops. This is an example of a data dependancy. The output of the second ADD is dependant upon the output of the first ADD, and the MOV is dependant upon the output of the second ADD. A simple solution to this problem would be to insert logic in the decode stage that will detect a data dependency and cause a stall until the values are written back into the register file. In that case the execution would look like: clock: 1 2 3 4 5 6 7 8 9 A B C D E F ADD F D E W ADD F D * * E W MOV F D * * * * E W R2 2 2 2 2 5 5 5 5 5 5 5 R3 3 3 3 3 3 3 3 3 3 3 A R4 4 4 4 4 4 4 4 A A A A