Lecture 26: Pipelines (by Trek Palmer) ================================================ Control Hazards ---------------- Branches and prediction ----------------------- Prediction ---------- The solution used by modern pipelines is to have a piece of hardware that predicts (ahead of the actual branch calculation) which way the branch is going to go, and then feeds that information to the fetch unit. This is not as mysterious as it sounds. Consider a so-called static predictor that always assumes the branch is taken (the always-taken predictor). 0xCAFE0000 ADD R5, R6, R6 CMP R7, R8 BNE func1 ;func1 is at 0xCAFEBAB0 MOV R8, R7 func1: SUB R7, R8, R7 Now, if we put the code through our always-taken 4-stage pipe cycle 1 2 3 4 5 6 7 8 9 A B C D E F ADD F D E W CMP F D E W BNE F D E W <-- if we don't forward will stall til CMP is done SUB F D E W So, if R7 != R8, then this system will have correctly predicted the branch and will have no stalls whatsoever. Cool. But what if R7 == R8? In that case, we have what is known as a mis-prediction. The results of the instructions executed after the branch will have to be discarded (the technical term is quashed). So now if R7 == R8: 0xCAFE0000 ADD R5, R6, R6 CMP R7, R8 BNE func1 ;func1 is at 0xCAFEBAB0 MOV R8, R7 func1: SUB R7, R8, R7 ADD R4, R6, R8 CMP R6, R8 Now, if we put the code through our always-taken 4-stage pipe cycle 1 2 3 4 5 6 7 8 9 A B C D E F ADD F D E W CMP F D E W BNE F D E W <-- if we don't forward will stall til CMP is done SUB F D E W \ ADD F D E W |---Quash these three insts on mispredict CMP F D E W / MOV F D E W <-- first good inst after branch After writeback for BNE (on cycle 7), we'll know that we've had a misprediction, and so we'll have to disable register writes for the next three cycles (to stop the bogus instructions from affecting our state). In this case we pay a misprediction penalty of 3 cycles. In general, the longer your pipeline, the higher your misprediction penalty (so mispredicts on current Intel procs can cost upwards of 20 cycles). Static predictors are pretty bad actually, so real systems use dynamic predictors. They are considerably more complicated and use hashing functions and past histories to approximate branch behavior with an approximation of a Markov process (which may mean something to the math nerds among you). Why does this work? -------------------- Why does branch prediction work? Well, it turns out that if you take some code samples (known as a benchmark). Compile them, and then run them while doing measurements on them, you'll find that a lot of the branches in the system are highly biased. This means that most of the time they are either taken or not taken. You can see this happening in the case of error conditions or exceptions. Most of the time, there will be no exception, so the branch that leads to the exception handling code is strongly biased to be not-taken. This also is the case for loops, consider the following pseudo-code: for(int i = 0 ; i < 1000 ; i++) { do something; } The branch that exits the loop will be not-taken 1000 times and taken just once. Therefore it is strongly biased. Also, depending on how you compile this code, there could be a conditional branch to the top of the loop which would be very strongly biased towards taken. Because code is often littered with these strongly-biased branches, it is possible-in practice-to have branch predictors with 90+% accuracy. Indirect branching ------------------ Another source of control hazards is indirect branching. On the ARM, there is no way to have a B or BL instruction take a register as an argument, but many architectures support such an instruction. On the ARM, to do indirect branching, you just overwrite PC. Consider the following code sequence: ADD R4, R5, R5 LDMIA R13!, {R8, R14} MOV PC, R14 This should be painfully familiar to many of you. It's the archetypal function epilogue we've been using in projects and labs for over a month now. But what happens when we pass this little gem through our pipeline (even if it has prediction) cycle 1 2 3 4 5 6 7 8 9 A B C D E F ADD F D E W LDMIA F D E W <-- assuming that mem. access is in one cycle MOV F D * E W <-- data hazard with R14 ???? ??????????? Here, we have no idea where to fetch the next instruction from. Because the value is stored in R14, which we are-I might add-retrieving from memory! Gah! This is the sort of thing that causes hardware architects to wake up in the middle of the night, covered in cold sweat. In our case, this means that every function return (and therefore every function call) will cause a 3-cycle stall. In Pentium 4 land, this means that a function call could cause a 20+ cycle stall. This is bad, because it is not uncommon to have a function call every 20 instructions. There is actually a hardware solution for this. Many processors have a small stack of return addresses, and the fetch unit will pop the top of the stack to get the next address on a return instruction. Because this stack is of fixed size, there is a limit to its effectiveness, but-in practice-this greatly reduces the function return penalty Prediction Hints ----------------- Some instruction sets have incorporated branch prediction into the design of their ISAs. Branches can have prediction bits set to inform the processor's branch prediction hardware which way the branch is likely to go. These range from a single bit (taken, not taken) to multiple bits (indicating which direction and how strongly biased the branch is). The utility in this is that dynamic branch predictors have some default starting state. If this default is wrong, it may take several (maybe dozens) of mispredicts to get the predictor trained on this particular branch. If the programmer (or more usually, compiler) knows the tendency ahead of time, they can avoid these mispredictions. The hard part is figuring out which branches are biased and how. Compilers use two techniques: 1) ad-hoc rules (like exceptions are strongly biased towards not taken) and 2) profiling (where you run the program on a couple of test inputs and mark biased branches). The 3-stage ARM pipeline ------------------------- The first generation of 32-bit ARM processors (ARM6) had a 3-stage pipeline, and many of the embedded versions of the arm are ARM7-based which means that they also have the same 3-stage pipeline. Later ARM architecture versions have pipelines with more stages, but becasue the PC is exposed, they actually have to support the 3-stage semantics (for backwards compatibility). The three stages are: fetch, decode, execute. Although this is simple, the execute stage is considerably more complicated than the first two and this can lead to implementation headaches. This also explains why the PC is always the address of the currently executing instruction + 8, because the PC is being used by the fetch unit to grab the instruction two ticks ahead. A Few Examples --------------- 0x0: ADD R3, R3, R2 ADD R4, R2, R2 MOV R3, R2 clock 1 2 3 4 5 6 7 8 9 A B C ADD F D E ADD F D E MOV F D E ---- R2 2 2 2 2 2 2 R3 3 3 3 5 5 2 R4 4 4 4 4 4 4 PC 0 4 8 B 10 14 0x0: ADD R5, R6, R6 CMP R7, R8 MOVEQ R8, R7 BNE func1 ;func1 is at 0xCAFEBAB0 clock 1 2 3 4 5 6 7 8 9 ADD F D E CMP F D E MOVEQ F D E <--- doesn't do anything BNE F D E ---- R5 5 5 5 B B B B R6 6 6 6 6 6 6 6 R7 7 7 7 7 7 7 7 R8 8 8 8 8 8 8 8 PC 0 4 8 B 10 14 0xCAFEBAB8 That's about it for simple single-issue in-order pipelines. Advanced Architecture ---------------------- -Superscalar A superscalar chip is one that has multiple functional units. For instance, a chip can have 4 integer units and 2 floating point units allowing up to 6 instructions to be executing at the same time. Note that this rarely happens, and much of the benefit to superscalar organization is to hide the latency of long-running operations (such as memory accesses and floating-point divide). For instance, the Pentium 4 has 2 memory units (known as load-store units), 2 simple ALUs (no mul or div among other things), 1 full ALU, and 2 floating point units (FPUs). Although this means that theoretically, the pentium 4 can execute 7 instructions per cycle (actually 7 uOps), in practice the number of instructions per cycle floats around 2. Superscalar processors are also known as multi-issue processors. A processor with 4 integer units can said to be a 4-issue or a 4-wide machine. -Out-of-order An out-of-order chip is one where the hardware may re-order instructions on the fly to execute more effeciently. The hardware (if it's correct, that is) will respect the dependencies and semantics of instructions, but it can usually re-arrange them into a more effecient schedule. For instance, on an out-of-order ARM, if it was handed the following three instructions: ADD R2, R3, R2 ADD R4, R2, R2 MOV R3, R5 It may execute the MOV instruction after the first ADD, and thereby eliminate a stall-cycle caused by the ADD-ADD dependency. Out-of-order cores also allow instruction results to be broadcast around the entire chip. This is effectively like allowing every stage of the pipeline to forward its results to every other stage. As you can imagine the out-of-order logic is intensely complicated. Many modern processors are both superscalar and out-of-order (like the Pentium 4 and the G4, G5). -Speculation Branch prediction is technically a kind of speculation. The processor is "guessing" which way a branch will go and betting its performance on that guess. It is possible to speculate over other things including: memory conflicts, thread locks, register values (very tricky), basically anything where you can assume a common case. The idea is that you assume that things will often enough be a certain way, march ahead with the computation and then check yourself when you can and maybe abort the speculative work and start over (if you guessed wrong) -Threads Threaded processors are processors that can actually be executing multiple programs at the same time. Note that this is different from superscalar (which executes multiple instructions, which are usually from the same program), and different from the multi-tasking effect of operating systems (which is actually an illusion caused by context switching). On a threaded CPU two programs can have their instructions executing in parallel, and the hardware ensures that they won't screw each other up. There are many ways of implementing threading and of presenting it to the user. In some systems there are explicit thread creation and control instructions, whereas in other systems, the execution mode of the instructions dictates which thread it will execute in (this is how Intel Hyperthreading works, user code on one thread supervisor code on another). An interesting idea is to combine threading and speculation. Consider branch prediction, if we have a thread sitting idle, why don't we execute both branches of the branch in parallel? That way, whenever we find out which branch is the correct one, we can just switch to it immediately. In the best case, this can completely eliminate the branch penalty. -Vector processing This is what made the Cray supercomputers so super. A vector processor is a SIMD (single-instruction multiple-data) processor. What this means is that a single instruction can cause multiple registers full of distinct data to be operated on. An example would be a vector multiply, in a vector unit with 4 registers, a vector multiply would perform the same multiply operation on all 4 registers in one go. So, this speeds up certain kinds of code. Instead of having a loop that performs four multiplies per iteration, or worse, iterates four times as much, you can have a loop that does one vector multiply per iteration. Most vector units have all kinds of odd instructions for making loops particularly efficient, so a vectorized loop can execute many many times faster than its scalar (the opposite of vector) equivalent. Vector instructions are becoming routine. Both the pentium 4 (SSE2) and the G{4,5} (Altivec) have vector instructions and a seperate vector unit. There are many other topics in modern advanced architecture (VLIW, dataflow, reconfigurable hardware, clustered processors, typed caches, code-morphing, transactional memory, locking memory), but they're either too researchy or too obscure to describe in detail in an intro course like this. Disks and Storage ==================