Lecture 28: Advanced Arch. Overview, Cache Intro (by Trek Palmer) ================================================ Advanced Architecture ---------------------- -Superscalar A superscalar chip is one that has multiple functional units. For instance, a chip can have 4 integer units and 2 floating point units allowing up to 6 instructions to be executing at the same time. Note that this rarely happens, and much of the benefit to superscalar organization is to hide the latency of long-running operations (such as memory accesses and floating-point divide). For instance, the Pentium 4 has 2 memory units (known as load-store units), 2 simple ALUs (no mul or div among other things), 1 full ALU, and 2 floating point units (FPUs). Although this means that theoretically, the pentium 4 can execute 7 instructions per cycle (actually 7 uOps), in practice the number of instructions per cycle floats around 2. Superscalar processors are also known as multi-issue processors. A processor with 4 integer units can said to be a 4-issue or a 4-wide machine. -Out-of-order An out-of-order chip is one where the hardware may re-order instructions on the fly to execute more effeciently. The hardware (if it's correct, that is) will respect the dependencies and semantics of instructions, but it can usually re-arrange them into a more effecient schedule. For instance, on an out-of-order ARM, if it was handed the following three instructions: ADD R2, R3, R2 ADD R4, R2, R2 MOV R3, R5 It may execute the MOV instruction after the first ADD, and thereby eliminate a stall-cycle caused by the ADD-ADD dependency. Out-of-order cores also allow instruction results to be broadcast around the entire chip. This is effectively like allowing every stage of the pipeline to forward its results to every other stage. As you can imagine the out-of-order logic is intensely complicated. Many modern processors are both superscalar and out-of-order (like the Pentium 4 and the G4, G5). -Speculation Branch prediction is technically a kind of speculation. The processor is "guessing" which way a branch will go and betting its performance on that guess. It is possible to speculate over other things including: memory conflicts, thread locks, register values (very tricky), basically anything where you can assume a common case. The idea is that you assume that things will often enough be a certain way, march ahead with the computation and then check yourself when you can and maybe abort the speculative work and start over (if you guessed wrong) -Threads Threaded processors are processors that can actually be executing multiple programs at the same time. Note that this is different from superscalar (which executes multiple instructions, which are usually from the same program), and different from the multi-tasking effect of operating systems (which is actually an illusion caused by context switching). On a threaded CPU two programs can have their instructions executing in parallel, and the hardware ensures that they won't screw each other up. There are many ways of implementing threading and of presenting it to the user. In some systems there are explicit thread creation and control instructions, whereas in other systems, the execution mode of the instructions dictates which thread it will execute in (this is how Intel Hyperthreading works, user code on one thread supervisor code on another). An interesting idea is to combine threading and speculation. Consider branch prediction: if we have a thread sitting idle, why don't we execute both branches of the branch in parallel? That way, whenever we find out which branch is the correct one, we can just switch to it immediately. In the best case, this can completely eliminate the branch penalty. -Vector processing This is what made the Cray supercomputers so super. A vector processor is a SIMD (single-instruction multiple-data) processor. What this means is that a single instruction can cause multiple registers full of distinct data to be operated on. An example would be a vector multiply, in a vector unit with 4 registers, a vector multiply would perform the same multiply operation on all 4 registers in one go. So, this speeds up certain kinds of code. Instead of having a loop that performs four multiplies per iteration, or worse, iterates four times as much, you can have a loop that does one vector multiply per iteration. Most vector units have all kinds of odd instructions for making loops particularly efficient, so a vectorized loop can execute many many times faster than its scalar (the opposite of vector) equivalent. Vector instructions are becoming routine. Both the pentium 4 (SSE2) and the G{4,5} (Altivec) have vector instructions and a seperate vector unit. There are many other topics in modern advanced architecture (VLIW, dataflow, reconfigurable hardware, clustered processors, typed caches, code-morphing, transactional memory, locking memory), but they're either too researchy or too obscure to describe in detail in an intro course like this. Caches ================== Throughout the semester you've heard me mention that in modern systems, memory accesses can take hundreds of cycles. Let me first demonstrate this. According to the good folks at newegg.com, a decently fast Pentium operates at 3.4GHz. That's 3.4*10^9 cycles/sec. Inverting that, we find that each clock tick on the P4 takes .294 nanoseconds. After further consultation at newegg, we find that SDRAM access times are all over the map, but basically the high end (DDR2-700) has a 3.75ns access time and the low end (PC-133) has a 7.5ns access time. After some division, we find that it will take between 13 and 26 cycles for our P4 to access memory. This is actually the best case. In practice, it's much worse. You see, memory has this attribute known as CAS. It stands for column address strobe, and basically represents the number of cycles between when an address is requested and when it actually becomes available. CAS ranges from 2 or three (memory) cycles to up to 12 for the high end. In the presence of this access latency, in the best case it takes about 36 cycles to access memory and in the worst up to 156 cycles! So, hopefully our investigation of pipelining has convinced you that waiting 156 cycles for the completion of a memory access is bad. In fact, very bad. If we got lucky and all our memory accesses took 13 cycles, it would still seriously slow down our CPU. Consider if memory accesses occur only once every 20 instructions (which is low), and if memory takes 13 cycles to access, and all our other instructions execute in 1 cycle, we will still take 1.6 times as long to execute our code if memory was only 1 cycle distant! So, it seems obvious that memory access speed is of crucial importance to performance. How is this fixed? One solution is to design your out-of-order superscalar CPU to execute instructions around a blocked memory operation. This can help, but consider the 13 cycle case. This means that you need to find 13 instructions not dependant upon the memory op in order to avoid a stall. This is probably not going to be the usual case. So, we need to do something else to decrease the average cost of accessing memory. The solution used in modern processors is known as memory cacheing. Caching, conceptually ====================== Caching is actually a general concept in computer science. The basic idea is to keep local (fast) copies of regularly used data so that most of your accesses are local (therefore fast) and to wrap the cache structure in code so that it can transparently fetch needed data from slow storage as necessary. This can be made more clear with an analogy. Say you wanted to cook something. The food items on the stove (or in the oven) are like data in your registers, they're right there where you (the processor) can operate on them. The grocery store is like main memory. It has all the foodstuffs (data) you will need to make your meal. But if you find you need something, you don't want to have to drive (or walk) all the way to the store to buy it and bring it back. This is a high latency operation, it takes a lot of time to go food shopping. So, we all have food caches. We call them refrigerators. Our fridges can hold more than our stove tops. But it takes a little longer to get at stuff in the fridge. But our fridges are much smaller than the grocery store. So we cache needed food in the fridge so we can get at it quicker than the store if we're cooking. Now, our fridges may not have everything that we need to make a meal, so in that case (known as a cache miss), we'll still have to go to the store to fetch the stuff we don't have in our fridge. But, if we are good at filling our fridge (caching our food) we won't have to go to the store often. This should reduce our average food preperation time significantly. So, in the reality of CPU organization, we have small fast memories close to the CPU (2-5 cycle access time) that can store needed data that won't fit in the registers. But, this memory is flip-flop based, so it is larger and more expensive than DRAM. It has to be small to keep it fast, so caches are much smaller than main memory (usual sizes are 16K for the really fast ones, up to 512 K for the slower ones). Therefore, a modern CPU can keep about 16-512K of needed data on the chip and only pay a 2-5 cycle penalty for accessing it! Average memory-access time =========================== To actually see the benefit of caching we need to do a little math. Without a cache, the average memory-access time is just the access time. With a cache, it gets more complicated. If you hit in the cache (that is, the data you're looking for is in the cache), you only have to wait for the cache access time, but if you miss you have to wait for main memory to return the data. Because we care about the average access time, we need to weight these two access times by their frequency. An important parameter in caches then is the miss rate. This is the percentage of the time that data won't be in the cache. If we have the miss rate, and the two access times, our average access time is then: Avg = Hit time + miss rate * miss penalty. Depending on the cache organization the miss penalty may be higher than the cost of directly accessing the higher level of memory. So, using the numbers derived above, with a RAM access time of 36 cycles, and a cache access time of 3 cycles, and a miss rate of 10% (not at all unreasonable), our average access time would be: Avg = 3c + .1*36c = 6.6 cycles. So, with a 90% hit-rate cache, our average access time has dropped by 29 cycles! This is why caches are so important. This cache has sped up memory accesses by 5 times! The Memory Heirarchy ===================== Now that we've introduced caches, we can talk about the memory heirarchy. The heirarchy basically describes two features of memory, size and access time. As memory size increases, access time also increases. So, for a single layer of cache, the memory heirarchy would be: Registers (fastest, smallest) 16-256 bytes, single cycle | Cache (slower, larger) 16-512K 3-8 cycles | RAM (slower, larger) 128MB - 4GB 30-150 cycles | Disk (slowest, largest) 6GB - several TB ~20M cycles Moving values from memory to registers is explicitly dictated by the user program, and moving values from RAM to disk (a process called swapping) is managed transparently by the OS. Moving values from RAM to cache is managed by dedicated hardware (cache control logic). This is the simplest picture of the memory heirarchy. Many systems have many levels of cache. For instance the L1 (level-1) cache is often small (8-16K) but very fast (3 cycles), on misses, the L1 queries the L2 cache (not main memory) which is larger (128-512K) but slower (8 cycles). Some systems even have an L3 cache before you actually hit RAM. How caches work (the short form) --------------------------------- So, a cache is basically a piece of hardware with dedicated fast memory that sits between the processor and the actual memory system. When the processor asks for an address, the cache first looks to see if it has the data, if it does it returns it otherwise it forwards the request to memory. When memory eventually responds, the cache will both return the value to the processor as well as store the values so that the next time the processor asks for them, they'll be in cache. This sounds simple, but there are many many issues. One is, how do you simply cover up to 4GB of memory with 16 or 512K of cache? This is a question of associativity, and is a big subject to be discussed in detail by Mr Maxwell on friday. Another question is what do we do when the cache is full? If the processor asks for data, and we don't have it, we need to evict some data in order to make room for the new stuff. This is known as the cache's eviction policy, and is an equally large subject. There are other design points, like how large should the cache lines be, how large should the cache itself be, should the cache update main memory on all writes or only on eviction? These are all serious design questions, and there's no single right answer. It's all a question of tradeoffs. Hitrate vs. speed, speed vs. cost, etc. Locality (or why caching works) ================================ All this time, we haven't really discussed how or why caching works. All we really know is that caches are fast memory near the CPU. To explain how caches work, we actually need to understand why they work. To do that we need to explain locality. Locality is a feature of programs that describes the way they access memory. Temporal locality means that a program's access to a piece of data are clustered together in time (lots of accesses very few instructions apart). Spatial locality means that a program accesses data items stored near each other in memory. Temporal example ----------------- Consider the following Java code: Integer foo = new Integer(4096); int val = foo.intValue(); int powerOfTwo = 0; while(val > 2) { powerOfTwo++; val = val / 2; } foo = new Integer(powerOfTwo); Just before we enter the while loop; foo, val, and powerOfTwo have each been accessed once. With each iteration of the while loop, we read val twice and write it once; and we read powerOfTwo once and write it once. So after the loop we've accessed foo once, val 37 times, and powerOfTwo 25 times. See how we accessed val, did a comparison, some arithmetic and then overwrote val? This is a standard kind of behavior for a loop. This is an illustration of temporal locality. We access val and powerOfTwo once, and then a bunch more times in rapid succession. Note, too that foo isn't very temporally local (at least in this part of the code). So, it seems that a good cache should store things that are accessed in the hopes that they'll be temporally local and will be accessed a lot more in the near future. Spatial example ---------------- Consider the following code that counts the number of odd integers in an array int[] nums; int num_odds = 0; for(int i = 0 ; i < nums ; i++) { if((nums[i] % 2) == 1) num_odds++; } This code traverses an array, one element at a time, left to right. This code has spatial locality because accesses to one element of an array presage accesses to subsequent elements of the array. If the cache only brings in integers when they're requested, then there will be a cache miss with every array access. This basically means that caches buy you nothing. But, if you recognize that spatial locality is important, and you bring in integers around the requested integer, then some of them will be accessed later. And because they're already in the cache, the accesses will be at cache speed, not main memory speed. In caches, spatial locality is exploited by bringing in whole cache lines at a time. A line is a group of words. Common line sizes are 4, 8, even 16 words. These lines are aligned. So, on a system with an 8 word line size (thats 32 bytes), when any address in that line is requested, the whole line is brought in. Consider an 8 word line that begins at 0xCAFE0000 (and ends at 0xCAFE001F), if the integer at 0xCAFE0004 is requested, it, the integer before it in memory and the 6 integers after it will all be brought into the cache. So, if the neighboring data is needed, it will already be in the cache. Complex spatial example ------------------------ Consider the following Java code to do matrix multiplication: int[][] y; int[][] z; int[][] x; //y*z for(int i = 0 ; i < N ; i++) for(int j = 0 ; j < N ; j++) { int r = 0; for(int k = 0 ; k < m2.length ; k++) r += y[i][k] * z[k][j]; x[i][j] = r; } By default multi-dimensional arrays are stored in what is know as row-major order. That is, y[i][j+1] is at the memory address right after y[i][j]. What this means is that accesses to y[i][k] get the spatial locality benefits (because they're crawling along the rows one element at a time). But accesses to z[k][j] are different. If the matrix is larger than the line size, there's no way that the cache will accidentally fetch the element one row lower. So, the z[k][j] access causes a cache miss each time. This is bad. It exposes spatial locality for a bit of a hack. It's really just heuristic, depending upon array and object access idioms. So, if you (or your compiler) is aware that the code is running on a machine with a cache of a given line size, you can change the code to compute sub-matrices that will each fit in the cache. Given a blocking factor, B, which is some value smaller than the cache size, you can rewrite the code thus: for(jj = 0 ; jj < N ; jj = jj + B) for(kk = 0 ; kk < N ; kk = kk + B) for(i = 0 ; i < N ; i = i + 1) for(j = jj ; j < min(jj+B, N) ; j = j + 1) { r = 0; for(k = kk ; k < min(kk+B,N) ; k = k + 1) r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; } So now, if the line size and B are comperable, most (if not all) of the rows of the sub matrix will be in the cache when the multiplication is done. This greatly speeds up the running time of the code. Of course, it also makes it considerably less legible. A simple address trace ---------------------- Now for a simple example, assume an empty cache with 4-word lines and a size of 16K. Consider the following sequence of memory accesses from the CPU. 0xCAFE0000 miss 0xCAFE0001 hit (spatial) 0xCAFE0003 hit (spatial) 0xCAFE0000 hit (temporal) 0xCAFEBAB2 miss 0xCAFEBAB1 hit (spatial) 0xCAFEBAB1 hit (temporal) 0xCAFEBAB2 hit (temporal) 0xCAFEF000 miss