Lecture 26: Pipelines (by Trek Palmer) ================================================ Data hazards ------------ A data hazard occurs when a stage doesn't have the inputs it needs to compute at the start of the clock cycle. There are a variety of reasons why this could be the case. Data dependencies ------------------ The pipeline exploits a kind of parallelism, but it only possible to fully paralellise things without any dependencies between them. Consider the following instruction stream: ADD R2, R3, R2 ADD R4, R2, R2 MOV R3, R4 Now, if R2 is originally 2, R3=3, and R4 = 4, then after the MOV, R2 should be 5, R3 = 10, R4 = 10. Lets run this through a naive pipeline clock: 1 2 3 4 5 6 7 8 ADD F D E W ADD F D E W MOV F D E W R2 2 2 2 2 5 5 5 R3 3 3 3 3 3 3 4 R4 4 4 4 4 4 4 4 What happened? Because instruction execution is staged, the registers in the register file don't get updated till after the W stage is finished. So, when the second add fetches its operands (in D) it sees the old values. Whoops. This is an example of a data dependency. The output of the second ADD is dependent upon the output of the first ADD, and the MOV is dependent upon the output of the second ADD. A simple solution to this problem would be to insert logic in the decode stage that will detect a data dependency and cause a stall until the values are written back into the register file. In that case the execution would look like: clock: 1 2 3 4 5 6 7 8 9 A B C D E F ADD F D E W ADD F D * * E W MOV F D * * * * E W R2 2 2 2 2 5 5 5 5 5 5 5 R3 3 3 3 3 3 3 3 3 3 3 A R4 4 4 4 4 4 4 4 A A A A Everything is now correct, but it has slowed down our pipeline. There is a more involved hardware solution called forwarding that fixes this problem and doesn't hurt performance as much. In a forwarding pipeline, there is an extra set of wires running from the E|W buffer back to E. This means that last cycle's output can be used as input on this cycle. You will need to introduce more logic to decide when to switch between forwarded inputs and the inputs that D got for you, but it changes your execution profile to: clock: 1 2 3 4 5 6 7 8 9 A B C D E F ADD F D E W ADD F D E W MOV F D E W R2 2 2 2 2 5 5 5 R3 3 3 3 3 3 3 A R4 4 4 4 4 4 A A Now that's nice. Of course, this kind of forwarding only works if the data dependencies are between adjacent instructions. Consider the following variation: ADD R2, R3, R2 ADD R9, R10, R10 ADD R4, R2, R2 SUB R10, R9, R10 MOV R3, R4 Now the forwarding logic is useless, but the non-dependent instructions have given us some breathing room. Only 1 cycle of stall will be introduced in ADD R4, R2, R2 and the MOV. There is actually a more general hardware solution to these problems called out-of-order execution. But this is a big topic and out of the scope of this class. Fixing data dependencies with software (instruction scheduling) -------------------------------------- Because we know ahead of time that the processor we're executing on is pipelined, we may be able to fix some of these problems during compilation. Because the compiler has to emit an instruction stream, it can reorder the instructions to avoid data hazards. So, if we have our old instruction sequence, we can avoid stalling by inserting 3 unrelated instructions between each of the dependent ones. ADD R2, R3, R2 ADD R4, R2, R2 MOV R3, R4 Could become: ADD R2, R3, R2 MUL R10, R8, R8 ADD R9, R9, R9 CMP R8, R7 ADD R4, R2, R2 MOV R3, R4 This would avoid the first stall. Of course, in order to rearrange instructions, you must have unrelated instructions handy to fill in the gaps. The larger the body of code being compiled (usually a loop or function body), the easier this is. And a lot of work has gone into efficiently finding the optimal (or a sufficiently good approximation) schedule for a block of code. Subtle dependencies ------------------- Dependencies are not always as cut and dried as just looking for input and output registers. For example, what are the dependencies in this block of code? SUBS R0, R4, R5 BNE foo Here, there's an implicit dependence on the condition codes. It can get even worse. For example, what are the dependencies in this block of code? ADDS R4, R5, R6 ADC R3, R4, R6 ADC is add with carry, which means that it computes the result of adding R4, R6, and the carry bit from the cond. codes. Here again is an implicit dependence through the condition codes. On systems that support memory addresses as arguments (like x86), dependencies must be traced through memory. Structural hazards ------------------ A structural hazard is when the pipeline must stall because some instruction is still using a stage in the pipeline when the next clock tick arrives. Therefore, the instructions behind it in the pipeline cannot advance, and this causes a stall. We've already seen an example of this with expensive ALU ops (MUL). But memory can also cause a stall. Consider the following code: LDR R4, [R5] LDR R6, [R7] ADD R5, R4, R6 Ignoring data dependence issues, this can cause structural hazards if retrieving the values at R5 or R7 takes more than one cycle. Although we have not discussed caching yet, it is possible in modern architectures for one memory access to take a couple of cycles and another access to take a couple of hundred cycles. For the purposes of this example, let's assume that the memory value at R5 takes 5 cycles to access and the value at R7 takes 1. cycle 1 2 3 4 5 6 7 8 9 A B C D E LDR F D * * * * * E W LDR F * * * * * D E W ADD * * * * * F D E W Because operand fetching occurs at decode, the memory access occurs then (in our particular 4-stage pipe). Therefore the first LDR stalls the pipe at D for 5 cycles, stretching out something that should take 6 cycles into something that takes 11. How to fix it? Well, instruction scheduling can help in some cases, but much of the problem with LDR (and STR) is that the act of accessing memory and writing a value to a register are rolled up into one instruction. Some more exotic architectures have actually decoupled these two operations into a memory request instruction and a register-load instruction. This can really complicate writing asm code, so these systems are basically compile-only. Control Hazards ---------------- A Control hazard is when the fetch unit can't figure out which instruction to get next. Usually because there's a control dependency that isn't resolved yet. Consider the following code: 0xCAFE0000 ADD R5, R6, R6 CMP R7, R8 BNE func1 ;func1 is at 0xCAFEBAB0 MOV R8, R7 Now, if we put the code through our naive 4-stage pipe cycle 1 2 3 4 5 6 7 8 9 A B C D E F ADD F D E W CMP F D E W BNE F D E W <-- if we don't forward will stall til CMP is done MOV ?????????? Here's the problem, what if R7 == R8?, then the next instruction to be fetched should be the MOV (at address 0xCAFE000B), but if R7 != R8, then the next instruction to be fetched is the one at address 0xCAFEBAB0. The fetch logic doesn't know which address to grab, so it stalls. Note that the system won't know which address is the correct one until cycle 7, when BNE is finished. You could argue that you would know at cycle 6, but that still doesn't fix the problem, you still have a two cycle delay. Branches are common, more so in compiled code, and even more so in object-oriented code (like every 5 instructions, or so); so control hazards are a serious issue. Branches and prediction ----------------------- There are a couple of quick hacks that could alleviate the problem somewhat. -unconditional branches Some branches are unconditional, that is to say that they don't really have to inspect the condition codes. In ARM these branches are B (branch always) and BNV (branch never). In this case we can just hardwire the decode logic to inform the fetch unit that this is an unconditional branch and that it needs to fetch from an address that it (the decode logic) has already calculated. Consider the following variation of the previous code: 0xCAFE0000 ADD R5, R6, R6 CMP R7, R8 B func1 ;func1 is at 0xCAFEBAB0 MOV R8, R7 func1: SUB R8, R7, R8 Now, if we put the code through our naive 4-stage pipe cycle 1 2 3 4 5 6 7 8 9 A B C D E F ADD F D E W CMP F D E W B F D E W <-- if we don't forward will stall til CMP is done SUB * F D E W ^ | Caused because we wait on the decode logic This is good, we've reduced a 3-4 cycle stall to a single cycle stall, but this doesn't do us any good if we're dealing with conditional branches. Prediction ---------- The solution used by modern pipelines is to have a piece of hardware that predicts (ahead of the actual branch calculation) which way the branch is going to go, and then feeds that information to the fetch unit. This is not as mysterious as it sounds. Consider a so-called static predictor that always assumes the branch is taken (the always-taken predictor). 0xCAFE0000 ADD R5, R6, R6 CMP R7, R8 BNE func1 ;func1 is at 0xCAFEBAB0 MOV R8, R7 func1: SUB R7, R8, R7 Now, if we put the code through our always-taken 4-stage pipe cycle 1 2 3 4 5 6 7 8 9 A B C D E F ADD F D E W CMP F D E W BNE F D E W <-- if we don't forward will stall til CMP is done SUB F D E W So, if R7 != R8, then this system will have correctly predicted the branch and will have no stalls whatsoever. Cool. But what if R7 == R8? In that case, we have what is known as a mis-prediction. The results of the instructions executed after the branch will have to be discarded (the technical term is quashed). So now if R7 == R8: 0xCAFE0000 ADD R5, R6, R6 CMP R7, R8 BNE func1 ;func1 is at 0xCAFEBAB0 MOV R8, R7 func1: SUB R7, R8, R7 ADD R4, R6, R8 CMP R6, R8 Now, if we put the code through our always-taken 4-stage pipe cycle 1 2 3 4 5 6 7 8 9 A B C D E F ADD F D E W CMP F D E W BNE F D E W <-- if we don't forward will stall til CMP is done SUB F D E W \ ADD F D E W |---Quash these three insts on mispredict CMP F D E W / MOV F D E W <-- first good inst after branch After writeback for BNE (on cycle 7), we'll know that we've had a misprediction, and so we'll have to disable register writes for the next three cycles (to stop the bogus instructions from affecting our state). In this case we pay a misprediction penalty of 3 cycles. In general, the longer your pipeline, the higher your misprediction penalty (so mispredicts on current Intel procs can cost upwards of 20 cycles). Static predictors are pretty bad actually, so real systems use dynamic predictors. They are considerably more complicated and use hashing functions and past histories to approximate branch behavior with an approximation of a Markov process (which may mean something to the math nerds among you). Why does this work? -------------------- Why does branch prediction work? Well, it turns out that if you take some code samples (known as a benchmark). Compile them, and then run them while doing measurements on them, you'll find that a lot of the branches in the system are highly biased. This means that most of the time they are either taken or not taken. You can see this happening in the case of error conditions or exceptions. Most of the time, there will be no exception, so the branch that leads to the exception handling code is strongly biased to be not-taken. This also is the case for loops, consider the following pseudo-code: for(int i = 0 ; i < 1000 ; i++) { do something; } The branch that exits the loop will be not-taken 1000 times and taken just once. Therefore it is strongly biased. Also, depending on how you compile this code, there could be a conditional branch to the top of the loop which would be very strongly biased towards taken. Because code is often littered with these strongly-biased branches, it is possible-in practice-to have branch predictors with 90+% accuracy. Indirect branching ------------------ Another source of control hazards is indirect branching. On the ARM, there is no way to have a B or BL instruction take a register as an argument, but many architectures support such an instruction. On the ARM, to do indirect branching, you just overwrite PC. Consider the following code sequence: ADD R4, R5, R5 LDMIA R13!, {R8, R14} MOV PC, R14 This should be painfully familiar to many of you. It's the archetypal function epilogue we've been using in projects and labs for over a month now. But what happens when we pass this little gem through our pipeline (even if it has prediction) cycle 1 2 3 4 5 6 7 8 9 A B C D E F ADD F D E W LDMIA F D E W <-- assuming that mem. access is in one cycle MOV F D * E W <-- data hazard with R14 ???? ??????????? Here, we have no idea where to fetch the next instruction from. Because the value is stored in R14, which we are-I might add-retrieving from memory! Gah! This is the sort of thing that causes hardware architects to wake up in the middle of the night, covered in cold sweat. In our case, this means that every function return (and therefore every function call) will cause a 3-cycle stall. In Pentium 4 land, this means that a function call could cause a 20+ cycle stall. This is bad, because it is not uncommon to have a function call every 20 instructions. There is actually a hardware solution for this. Many processors have a small stack of return addresses, and the fetch unit will pop the top of the stack to get the next address on a return instruction. Because this stack is of fixed size, there is a limit to its effectiveness, but-in practice-this greatly reduces the function return penalty