We have now seen the two extremes for tag comparison – check every tag in the cache (fully associative) or check just one tag (direct mapped). As is often the case with extremes, there are obvious drawbacks to either. What we really want is something in the middle. We want better cache utilization and fewer conflicts, but we also want speed and low cost. The solution is to organize the cache so that each block of memory maps to a small number of cache lines.
Suppose we allow each block to map to four places in cache. We can then tolerate up to four conflicting references at a time – each has a unique place in cache. But we also have to compare the tags for each of the four locations on every reference. However, a four-way comparison is not too costly, and the fan-in of 4:1 for the results of the comparisons is easy to implement in a single level of logic. So we still retain most of our speed. Such an organization is called a 4-way set associative cache.

In a K-way set associative memory, the cache is divided into sets of K lines. There are S = L/K sets, where L is the number of lines in the cache. The address contains a set field (a truncated version of the block number) that selects a set. The K tags for the set are read out and compared to the tag for the address. If there is a match, then the corresponding line is selected and the data is fetched from it.

Up to K collisions between memory lines with the same set number can occur before a value in the cache must be returned to memory. It has been shown that for reasonably small K (e.g. 2 or 4), the vast majority of conflicts can be handled without thrashing (repeated cache line replacements).

A set associative memory requires only K comparators, rather than the L comparators of the fully associative memory. When S = 1, then the cache is fully associative (K = L). When K = 1, the cache is direct mapped. Thus, set associativity can provide an economical midpoint between these two extremes.
In experimental analyses of the level of associativity necessary in a cache, it has been found that there is only about a 10% difference in utilization and performance between a 4-way seta associative cache and a fully associative cache. Thus, the majority of the benefit is obtained with just four comparators, and the returns diminish quickly if more comparators are added. There is, however, a significant jump between 2-way and 4-way associativity, and this can be partly attributed to the need to support the two operands and the result of a dyadic array operation. With just two ways, it is possible that the three arrays involved do not fit into the cache because of line conflicts, whereas with more than two ways the arrays can all be mapped to the cache at once. The DEC Alpha has used a 3-way cache to good effect in this regard.
In addition to the tag and data, each cache line contains a valid bit that indicates whether it contains valid data or not. Initially all of the valid bits indicate that the cache is empty. Eventually the cache is filled and the valid bits become true. However, there is a condition that can cause a valid bit to once again become false. What that might be?
When a DMA I/O operation changes the value of the corresponding location in main memory, the value in the cache is no longer valid.
Thus, the cache includes logic that watches every memory write that originates from someplace other than the CPU, and if an address is seen that is in the cache, the cache line is marked as invalid. A subsequent read or write with that location will miss in the cache and go to main memory to retrieve the recently input value.
Replacement
In a direct-mapped cache, if a new value is being read that conflicts with a value already in the cache, then there is only one possible action: the conflicting value must be returned to main memory to make room for the new value. This process is called replacement.
When a cache is associative, there are K possible cache locations where the new value can be stored. If all of them are full, then one must be replaced. The problem is, how do we determine which one should be replaced? The algorithm for determining replacement is called the replacement policy.
Given that we would like to keep values that we will need again soon, we want to get rid of the one that won't be needed for the longest time. Since we can't look into the future, we have to make a best guess.
There are several popular replacement policies that try to approximate this optimal algorithm. For example, we can consider temporal locality and guess that any value that has not been used recently is unlikely to be needed again soon. Thus, we can keep track of the last time each value was accessed, and which ever was least recently used (LRU) is the one we will replace. Unfortunately, LRU requires us to keep a history of access for every cache line, which takes up a lot of space and also slows down the operation of the cache. Even so, modern processors mostly use this scheme because of its superiority.
Another approach is to approximate LRU. We can hypothesize that the value fetched first is now the oldest reference and thus that, if it hasn't been accessed, it is the LRU value. Obviously, if spatial locality is good, this is a poor approximation.
Yet another tack is to take a line at random. The problem with LRU and FIFO is that there are degenerate referencing situations in which they can be made to thrash. Some researchers have argued that random replacement, while it sometimes throws out data that will be needed soon, never thrashes. Unfortunately, it is difficult to have truly random replacement, and it does decrease average performance in order to improve rare worst cases.
Writing
Thus far we have mostly discussed what happens when a value is read. When a value is written to the cache by the CPU, there are two things we can do: we can write it to the cache and simultaneously write it through to the main memory so that the master copy is kept up to date. Or, we can write it to the cache and not write it back to the main memory until it is replaced. The former approach is called write through while the latter is called write back.
Write through has the disadvantage that it can slow the cache down to the speed of the main memory for a string of writes. Write back only slows the cache for a replacement, which is already a main-memory speed operation. However, write back requires keeping a bit to remember that a value has been written (the dirty bit), and it also presents a problem for DMA I/O where an I/O device gets output data directly from memory. Because the memory contains stale information, the I/O device could output garbage. A write back cache must also detect when I/O is taking place (by watching the address lines into the main memory for matches to its tags) and then save the data before the I/O device reads it, or alternatively it can simply respond in place of memory to the I/O device's read request.
Write through is thus more complex and costly to implement, and the constant checking of DMA addresses can slow the cache down by tying up the comparators. Many modern systems thus provide both write back and write through modes of operation. It is also noteworthy that instruction caches do not have to take this into consideration because they cannot be written to.
We also need to consider what happens in the case where the CPU writes to an invalid location in the cache. This can either be because the CPU is generating a write address for a location that wasn't previously read, or because it is writing to a location that was invalidated by DMA input.
So if we try to read from a location in that line other than the one just written we may get garbage. There are two solutions to this problem. One is to declare a write miss, and read the rest of the line from main memory, filling it with valid data. The other is to gamble that the processor will write more values into this line before reading it, and instead mark the individual words in the line as valid or invalid. If a subsequent read attempts to access an invalid word in the line, we have a miss and read in the invalid part(s) of the line. However, if the processor eventually fills the line with writes, we avoid taking the miss.
We also have to handle conflicts for writes. Consider that the CPU may write to a set in which all of the ways are full. The mechanism that handles replacement on reads also handles replacement on writes -- essentially it watches for anything being stored into a valid line in the cache and reacts accordingly regardless of the source of the store.
Write Buffer
It can be observed in many programs that writes occur in clusters rather than being evenly distributed in time. This presents a problem for the cache when it is operating in write-through mode or if the writes cause a series of replacement operations. In either case, the access involves main memory (or next level of cache), which may be a factor of 10 to 100 slower. Thus, the processor will have to wait while the data is being written before it can proceed. Many processors address this problem with a small set of registers called a write buffer. The data and its destination address are written to the write buffer where they wait for memory to become available so the data can be actually stored. In the meantime, the processor can continue on just as if the data has actually been written. What is the one remaining problem that needs to be addressed in order for this to work?
What happens if the processor reads from that location before the data is actually written to it?
The write buffer must include logic that checks whether a read address matches any of the addresses waiting in the write buffer. If there is a match, then the data is read back from the write buffer instead of from the cache or from memory. The write buffer also helps to address the problem of the cache being busy (such as completing a line-fill operation) when a write is issued.
Split vs. Unified Caches
We have already noted that instruction fetches and data fetches have different patterns of access and locality. In particular, instructions are not written and so they do not require write buffers or other write-related logic. These differences encourage the use of caches that are split between instructions and data. In addition, having two separate caches provides the potential to fetch from both at once (or fetch from one while writing to the other), thereby doubling the rate at which values are transferred between memory and the CPU.
On the other hand, when caches are split, they are usually each half of the size of a single unified cache. Different programs may have larger or smaller working sets in data versus instructions. That is, one program might process a very small amount of data with a complex algorithm while another might process a large amount of data with a simple algorithm. In a unified cache there is added flexibility to accommodate these variations. Thus, the hit ratio of a unified cache is always larger than that of an equivalent split cache.
Furthermore, when a cache is external to the processor chip, there may not be enough pins to carry the values from two caches at once into the CPU chip. Thus, it may not be possible to take advantage of the potential bandwidth.
The trend in technology, however, has been toward fitting larger and larger caches onto the processor chip. As the sizes of the split caches have grown, the difference in hit ratio between split and unified designs has decreased to around 1%. Because an on-chip cache is not restricted by the number of pins at the edge of the chip, the doubled bandwidth can be fully exploited. Hence we see a period in microprocessor design where unified caches dominated that leads to a nearly universal use of split cache designs for the cache nearest the CPU.
However, for many processors there are additional levels of cache both internal and external to the chip. At the levels below the first (called level-1 cache), caches are generally unified. There are several reasons for this choice. One is that, the larger the cache, the harder it is to split it in a manner that balances program space against data space. This may seem contradictory to the reasoning used for the level-1 cache (where the cache needed to grow to a certain size before we could split it), but it is not.
At very small sizes, we cannot afford the loss in hit rate that results from wasting a fraction of the resources. The small amount of space results in dramatic and dynamic swings between the need for caching instructions and data. As the cache grows in size to the point that each half can hold most of the working code and data for an active portion of a program, we can safely split it. There is enough capacity to absorb the dynamism in the storage requirement. Yet it is still small enough that we are guaranteed to use the resource fully. If we grow the cache beyond that point, we are starting to use it as a holding area for chunks of code and data that are waiting to be promoted to the cache. The balance between code and data at this level is more affected by the ratio of code locality to data locality in the overall program. By splitting the cache, we would be favoring a certain ratio, and other ratios would be suboptimal.
It may help to envision this through an analogy. Suppose we have a road that must carry traffic east and west. (east = instructions, west = data). If the road is very narrow, say just one lane, then it has to carry both types of traffic, under some protocol for sharing the limited space. In the morning, perhaps most traffic is east, and in the evening it returns west – its dynamic nature is best handled by the unified nature of the road. But suppose we widen the road to two lanes – we still allow passing, so the road is still unified, although the traffic capacity is more balanced. Widening the road still more, we can turn it into a divided highway. The road is split in two, with completely separate paths for each type of traffic. Going beyond the road, we have parking lots where vehicles sit idle, waiting to be used. It makes no sense to divide these lots into east and west because we don’t know at any given time how many of the vehicles will be bound in either direction. Thus, the parking lots are unified.
As the preceding two paragraphs indicate, there is a range of sizes in which it makes sense to split the cache. Outside of that range, either larger or smaller, the cache should be unified.
In addition, the decision to unify the cache is also influenced by the presence of a virtual memory system. We will look at virtual memory later in the class, but for now it should be noted that virtual memory maps a name space (the program’s references) to an address space (the physical addresses of RAM, I/O devices, etc.). The translation usually occurs between level 1 and level 2 caches, because we do not want to introduce any delay in the L1 access. So L1 is said to by “virtually mapped” and L2 and below are “physically mapped.” It is easier to build the virtual memory system when the memory is unified in the physical address space – otherwise, we would have two physical address spaces to map to.
Multi-Level On-Chip Caches
As technology enables us to place more transistors on a chip, it would seem logical to continue to increase the size of the cache. A larger cache has a higher hit ratio, and so we should be able to operate at cache access speeds for a greater percentage of the execution time. However, as we saw earlier in the course, as a circuit grows larger it often grows slower. If we continue to increase the size of the cache, the time required to read data from it increases and eventually we have to start slowing the clock cycle of the processor to compensate (or else the processor waits frequently).
The solution is to employ a hierarchy of caches on the chip. A small, fast cache is used that can keep up with the clock rate of the processor, and a larger secondary cache is made as large as we can fit on the chip. The result is that we get a lower hit ratio for the first level cache (especially since it is probably split) but we also get relatively fast access to the next level (perhaps just two cycles instead of the usual ten or more). And at that second level we see a hit ratio that is very close to 100% because of its size. Accesses to an external cache are then very infrequent and those to main memory are primarily due to unusual events such as context switches or I/O.
Some researchers hypothesize that eventually processor chips will be almost entirely memory, and that external RAM will be treated much like we currently treat secondary storage devices -- it will be used for file storage and will be explicitly read and written in large blocks with programmed I/O. A more radical view says that all memory will eventually contain processors and that operations on data stored in external memory will send code to accomplish the processing remotely rather than fetching the data. However, this view neglects processing in which large amounts of separate data must be combined, essentially forcing the data to be moved to a common processor where it can be acted upon.
There is also concern that processor clock rates are going to grow so fast that there is no way to continue to build caches large enough to hold the active code and data, and still avoid processor delays. Various ideas are in the experimental stage, such as adding an L0 cache that is very small, and which only holds the most frequently used data and instructions (this could be thought of as a bank of additional registers that are managed by the hardware). Another possibility is to split the L1 cache even more (space is not the limit, but distance from the pipelines is a major contributor to delay). Thus, we might see separate caches for loads and stores, branches, execute instructions, integers, and floating point values.
Case Study Processors
|
|
PPC 601 |
PPC 603 |
PPC 604 |
PPC 620 |
SPARC |
R10000 |
R4400 |
Pentium |
P6 |
|
Blocking |
1 |
0 |
4 |
|
|
|
|
|
|
|
Split? |
No |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
|
Data |
32K |
8K |
16K |
32K |
16K |
32K |
16K |
8K |
8K |
|
Instruct. |
Unified |
8K |
16K |
32K |
16K |
32K |
16K |
8K |
8K |
|
Associativity |
8 |
2 |
4 |
|
D: 1 I: 2 |
2 |
1 |
2 |
D: 4 I: 2 |
|
Line size |
64 B |
32 B |
32 B |
|
|
|
16 or 32 B |
32 B |
|
|
Data Write |
Back (through) |
Back (through) |
Back (through) |
Back (through) |
|
|
|
Back (through) |
Back (through) |
|
Replacement |
|
|
LRU |
|
|
|
|
|
|
|
Notes: |
|
|
|
I-cache predecodes instr. |
|
I-cache predecodes instr. |
|
|
256KB 4-way L2 cache |
© Copyright 1995, 1996, 2001 Charles C. Weems Jr.
All rights reserved.
Back to Chip Weems' home page.
Back to Computer Science Department home page.