Lecture 9 Memory Hierarchy and Caching

Memory Hierarchy

There are many different technologies for storing data. In the early years of computing, cathode ray tubes, acoustical delay lines, magnetic drums, magnetic cores, punched cards, and open-reel magnetic tape were popular.

Today, these technologies have essentially been replaced by semiconductor memory, magnetic disk, optical disk, and helical-scan magnetic tape cartridges. Other technologies have been proposed, such as CCD and magnetic bubble memory, but have not succeeded in gaining a significant part of the market. Part of the reason that these devices failed, even though they had great potential, is that there is a tremendous amount of infrastructure behind the current technologies, which were adopted with (and in part enabled) the explosion in computer usage.

Given the huge amount of capital available to support development of the next RAM, disk, or tape technology, it is difficult for a radically new technology to enter the field. In the case of optical disk technology, the development capital came from entertainment companies, and it was realized that the new storage medium had a secondary market for non-volatile data and software storage, especially at high density and low duplication cost, and with random access -- a niche not filled by any other storage medium. However, the development cost for CD technology was probably in the $100M - $500M range and people are still struggling to find uses for the new medium. (Collections of shovelware are common.) The biggest applications are in video-rich computer games, and the use of writable CD media for backup of magnetic disk. DVD-ROM is replacing CD-ROM in most new computers, but there are few computer applications that can take advantage of the extra space. They can be used to turn laptop computers into portable video players, and with writable DVD, the capacity for disk backup is greater.

Each of the technologies has a place in the market because it covers a range of access times with inversely proportional cost per unit, which in turn is proportional to density. CD-ROM filled a gap in access time between disk and tape that had emerged as disks got faster and tapes did not. The price of CD-ROM and DVD-ROM is basically determined by a market analysis that places it between disk and tape -- it is actually much cheaper to produce than the retail price would indicate.

Bubble memory tried to provide slightly faster access rates than disk with a promise of slightly higher density. However, the gap was not wide enough, and disk density, access rate and cost caught up with it by the time it was developed. CCD memory was likewise intended to fill the same gap, but with faster speeds and higher density. However, its cost was close to that of semiconductor memory so there was little advantage over simply using slower RAM. This gap has essentially closed due to cheap DRAM and Redundant Arrays of Inexpensive Disks (RAID) that provide higher bandwidth transfers, especially when they contain RAM buffers.

There appears to be a wide gap between the main and secondary memory, but we are used to the fact that this gap exists, and both systems and languages are structured to work around it. That is, it is part of the programming model.

In a similar manner, transfers between the top two levels can be programmed because they are at a level of granularity that corresponds to the linguistic elements that a compiler can easily extract and manipulate.

Currently the gap that is emerging that seems to open a possibility for new memory devices is between the on-chip and the off-chip memory levels. The problem is that when a-priori or empirically estimated locality assumptions break down, the cost of going to main memory is relatively high. That is, a cache miss is very costly. Thus, we have seen some new RAM technologies, such as RAMbus, enter the market to provide a high-performance off-chip memory.

The memory system architect's approach is to try to reduce that cost by speeding up the main memory. Thus, we see schemes that fetch many words at once in order to support a new region of locality in cache. In other words, if we get a miss, the assumption is that we are seeing a switch in the region of locality accessed by the program, and thus should try to load that region quickly so that we do not suffer further misses.

On the other hand, if the language made it possible for the compiler to consider a larger context, it might be possible to schedule the transfers between these levels so that their effects are lessened. In addition, it might be possible for the compiler to reorder the processing to maximize the locality of reference, thereby giving the memory system time to transfer the data.

As an example, consider that if a program runs sequentially through a terabyte of data, performing one operation on each value, the caches will effectively miss continuously. But if the program does this over and over, it may be possible to reorder the statements so that more operations are performed for each data reference, providing time to prefetch the next value. This is already done for loops on arrays, but is hard to do for linked lists and other non-array structures.

Hwang defines five parameters associated with memory technologies arranged in a hierarchy:

Access Time: Time for the CPU to fetch a value from memory -- including delays through any intermediate levels.

Memory Size: The amount of memory of a given type in a system.

Cost Per Unit (byte): Cost per unit times size roughly equals total cost.

Transfer bandwidth: Units (bytes) per second transferred to the next level.

Unit of transfer: Number of units moved between adjacent levels in a single move.

He also defines three properties of a memory hierarchy:

Inclusion: If a value is found at one level, it is present at all of the levels below it.

Coherence: The copies at all of the levels are consistent.

Locality: Programs access a restricted portion of their address space in any time window.

Obviously, none of these is strictly true. Most hierarchies are inclusive from the registers to the main memory (although we could imagine a multi-level cache that skips a level on loading, and only copies out to the lower level when writing back). However, most tape units do not spool to disk before going to main memory -- they are DMA devices, just like the disks.

Coherence would only be true for a write-through cache, but not for a write-back cache.

Locality is entirely program-dependent. For example, LISP programs have logical locality that does not correspond to physical address locality. Most caches assume array type data access and sequential code. The book identifies three aspects of this form of locality:

Temporal locality: Recently accessed items tend to be accessed again in the near future.

Spatial locality: Accesses are clustered in the address space.

Sequential locality: Instructions tend to be accessed in sequential memory locations.

Temporal locality tends to hold under all conditions for code (even parallel or recursive codes) although it may break down in a rule-based system. For data, it is most effective in array-based applications.

There is less spatial locality to exploit in the data accesses when they are being processed in parallel. For example, if a cache has 8 words per line, and the data has perfect spatial locality, then a scalar machine exploits this by getting 7 hits for every miss. But a superscalar machine that is consuming the values two at a time gets just 3 hits for each miss. The presence of spatial locality implies some temporal locality (i..e. if all accesses are within a small block of memory, then it is probable that many of them are being accessed repeatedly). However, temporal locality does not imply spatial locality (e.g. a pointer-oriented program might repeatedly run through a short list who's elements are scattered).

Sequential locality holds for instructions and certain array access patterns. Interestingly, sequential locality can actually increase for programs on massively parallel processors because some loops are eliminated. For example, if an array is distributed so that one of its dimensions is spread across the processors, there is no need for the loop that would normally step through that dimension.

One of the implications of the differences in locality that we may encounter between data and instructions is the benefit of having separate instruction and data caches. The main benefit of separate caches, however, is that instructions and operands can be fetched simultaneously -- a design known as a Harvard architecture, after the Harvard Mark series of electromechanical machines, in which the instructions were supplied by a separate unit.

The hit ratio is an important measure of the performance of a memory level and is the probability that a reference is to a value already in a given level of the hierarchy.  The miss ratio is 1 - h.

The access frequency is the product of the hit ratio for the given level with the miss ratios of all higher levels.

The effective access time of the memory system is the sum of the access frequencies times their corresponding access times.

This formula is overly simplistic, as it assumes there is some probability that you have to go to tape to get data, for example. It also ignores short-circuit mechanisms, such as TLBs, that might have a variable effect on access time. However, one point should not be minimized -- proper memory hierarchy design depends on extensive simulation.

Cache Memory Design

Basic Cache Structures

A cache consists of lines of data, usually containing values from two or more consecutive addresses of main memory. Each line has associated with it a tag that stores the high order bits of the address for the data in the line.

The differences in the cache mostly have to do with how the comparison is implemented.

In a fully associative cache, there is a comparator for each tag, and any line can contain any block of memory.

In a direct-mapped cache, there is a block number in addition to the tag field of the address. The block number specifies a line in the cache, and the tag field of the address is compared with the tag field attached to the line. If they are equal, then there is a hit.

Direct mapping is very fast, but it also suffers from the problem that a collision between two main memory lines that map to the same line in cache can thrash in and out, even when there is plenty of space left in the cache to hold them both. With increased cache size, this probability is reduced and thus many modern machines have used direct-mapped caches.

In a K-way set associative memory, the cache is divided into sets of K lines. There are S = L/K sets, where L is the number of lines in the cache. The address contains a set field that selects a set. The K tags for the set are read out and compared to the tag for the address. If there is a match, then the corresponding line is selected and the data is fetched from it.

Up to K collisions between memory lines with the same set number can occur before replacement must occur. It has been shown that for reasonably small K (e.g. 2 or 4), the vast majority of conflicts can be handled without thrashing replacements.

A set associative memory requires only K comparators, rather than the L comparators of the fully associative memory. When S = 1, then the cache is fully associative (K = L). When K = 1, the cache is direct mapped. Thus, set associativity can provide an economical midpoint between these two extremes. The majority of caches today are set associative. Typical levels of set associativity are 2-way, 4-way, and 8-way. The Alpha 21264 provided a 3-way set associative cache, and it was noted that performance was significantly better than 2-way set associativity. The designers believe that this is because there are many array processing operations in which two arrays are input to a formula, and the result is assigned to a third array. With two-way set associativity, there are not enough ways to accommodate this triad of values, but with 3 ways, it is possible to operate on all of the values without evicting some.

Auxiliary Cache Structures

Caches are often augmented by adding other small structures to support their operation. These provide added resources that are, in effect, available to use in critical cases.

Between the cache and the CPU there are usually a set of write-buffers. These form a small, fully associative cache that is built like a set of registers to be very fast. When the CPU writes to memory, it tends to do so in bursts (e,g., spilling registers, pushing stack frames, etc.). The CPU would have to stall as the slower cache tries to write these values. The problem is made worse by the higher probability that bursty writes such as these will miss in the cache. Thus, the write buffer provides a bit of slack in the write path to compensate for this mismatch in speeds. Of course, if the CPU tries to read recently written data, then the write buffer must respond in place of the cache. Also, the write buffer must constantly monitor the status of the cache so that it can immediately start writing the next value when the cache is free. And, when there is a coherency check in a multiprocessor, the write buffer may have to respond to it as well. That is, when another processor attempts to access a value  that has been changed, the new value may still be in the write buffer.

Between the cache and the next lower level in the hierarchy it is common to find another small, fully associative cache: a victim buffer. When a line is evicted from the cache, it is placed in the buffer. Accesses to the cache also check the tags in the victim buffer in parallel, and if a needed line is found there, then it supplies the value and swaps the line with the line in the cache that is now being evicted due to the miss. Lines in the victim buffer are usually replaced with an LRU policy. Thus, in a four-entry victim buffer, if a line is evicted from the cache and not accessed again, it remains in the buffer until four more lines have been evicted. In essence, the victim buffer provides a mechanism for selectively increasing the associativity of a small number of sets when they are under excessive pressure.

Miss status holding registers are used as a means of keeping track of in-progress misses. Many systems support non-blocking cache reads in which instructions are allowed to proceed while one or more loads await responses from a cache miss. Often, instructions can be scheduled behind a load that are independent of it, and can proceed without waiting for it to complete. However, in a significant number of cases, a load miss will be followed by additional references to words in the same line. These also generate misses, and the MSHR catches these and prevents them from generating additional transactions in the memory hierarchy. When the line becomes available, then the MSHR responds to the pending loads, and processing proceeds. Architects refer to this merging of misses as coalescing.

Prefetch buffers may be used to enable a cache to fetch more than one line at a time, in case spatial locality is especially high.  For example, when a cache takes a miss, it typically fetches just the line containing the missing word. But it is usually a less expensive read operation to immediately fetch the next line from main memory as well. With a prefetch buffer, the next line is put into a special register, and if the CPU runs off the end of the current line and generates a new miss, then the data is immediately available to be moved into the cache and a new prefetch operation is initiated. Before going out to prefetch from main memory, however, it is important to check the tags in the cache and victim buffer to avoid generating a memory transaction for a line that is already available.

An interesting variation on this scheme is to expand the prefetch buffer into a parallel cache with long lines Ð a spatial locality buffer. Then the main cache can be reserved for values with strong temporal locality, and its lines can be as short as a single word. Misses load into the spatial buffer, which is fully associative with an LRU policy. The buffer keeps track of which words in the long line are then accessed, and when the line is finally evicted, these words are promoted to the main cache. Thus, the main cache contains only words that have shown strong temporal locality. In effect, the spatial buffer provides a mechanism for monitoring the access pattern of a set of words for a short period, and using the observed behavior to make decisions regarding promotion for longer tenancy in the cache. This mechanism is especially useful when the main cache is small Ð as is the case in embedded processors.

As CPU clock rate rises, it becomes more difficult to keep the level 1 cache access operating in a single cycle. We may see a need to start shrinking the size of LI, or to partition L1 into smaller sub-caches, or to insert an L0 cache that is smaller. The difference in L0 from other levels is that it probably cannot load every value that is missed. It must be more selective in what it chooses to store for fast access. This is an open area for research.