Lecture 10: Cache Simulation

The entire basis for gaining increased performance from a cache is that the distribution of memory values accessed by a typical application has statistically significant biases -- i.e. locality of reference that can exploited with a smaller, faster memory.

While we can hypothesize what these biases will be, the only way to determine them with any accuracy is to measure them. As in any empirical measurement methodology, one must take care to construct the experiment so that the method itself does not bias the measurements. This means that the sample must accurately reflect the population being measured, that the measurements must gauge the correct aspects of the system, and the measurements must not affect the behavior of the system.

Define a valid sample.

It is very difficult to define a population space of programs. While a theoretician may be able to define an enumeration for all possible programs, we are interested only in those that are useful in some sense. Depending on our goals, we may further restrict the population to some subset that are specific to certain application domains, or even specific applications.

Most benchmark suites of complete programs that are used today were developed on the basis of what somebody had available. Clearly an inadequate sampling strategy. The Perfect Club benchmark is probably one of the more carefully considered suites. Manufacturers often have their own internal suites of applications that are more representative than these public benchmarks.

Other benchmarks are kernels of code that somebody judged to be representative of what was used frequently in an application domain (e.g. Livermore Loops). While useful for testing specific aspects of performance, these codes do not provide a valid statistical sample of the behavior of real programs whose access behavior is much more complex.

Measure the correct information.

In order to simplify the measurement process and to make the experiment repeatable, it is attractive to measure the behavior of an application in isolation. That is, the effect of the operating system and of other concurrent processes is factored out. This is usually done by running the application on an instruction set architecture simulation of a bare machine, and turning off measurement for system call simulations. In most cases, supervisor mode instructions and memory space are not even simulated. Instead, the simulator provides operations that emulate the specific system calls that are encountered in the benchmark code, and no other elements of the operating system are simulated. Such simulations are also limited to static binaries, and preclude research on the behavior of dynamically linked code, or runtime code recompilation.

Unfortunately, the complex mix of behavior encountered in a loaded system is more representative of the population we are trying to sample. In effect, even though the population of applications may be carefully selected to be representative, the actual population to be sampled is the dynamic combinations of executions, rather than some set of static codes (unless, of course, the goal is to provide optimum performance in a uniprocessing mode).

It can be very difficult to simulate a complete system that is subject both to multiple configurations and asynchronous events. In particular it is difficult to provide such a simulation in a manner that has the repeatability necessary for scientific research.

Many of the early RISC designs were based on simplified simulation environments, and their performance turned out to match predictions only very roughly because of this. Modern processors, with their many pipelines, functional units, branch predictors, memory hierarchy, and so on are even more difficult to simulate accurately.

Another aspect of measuring the correct information is to be sure that the measurements are not being overwhelmed by other factors (such as a bias in a compiler).

Once the systematic biases and sources of experimental error have been identified, it should be possible to determine the error range in our measurements, and to establish the statistical significance of our results. Unfortunately, this is almost never done in architectural research today.

There was an embarrassing period in the early years of cache research when it was discovered that many of the most widely cited papers had been based on samples that were too small, and to biased by cache initialization, to justify the claims that were being made. Cache researchers are now careful to run simulations with enough accesses to ensure statistical significance and a reasonable level of error. They also wait to begin sampling until after a program has passed through its initializations. Of course, ignoring the initialization period is another bias, but the argument for this approach is that the initialization effect would become insignificant if the main part of the program runs long enough. Thus, we can get similar results by ignoring initialization, and running the main part of the program for a shorter time. This technique is known as Òfast forwarding.Ó

While the use of fast forwarding is convenient, few researchers bother to quantify the error that is induced by its use. For example, if the main part of the program would not actually run for long enough to overcome the effects of initialization, it is inaccurate to say that the behavior after fast-forwarding is representative of the whole program. Even if the initialization effect would change the results by just a few percent, this could be significant if the experiments are measuring small changes in performance.

Outside of the use of larger sample sizes as pioneered in cache research, it is still common to see architecture papers that report improvements of just a few percent with respect to different schemes, while saying nothing about sources of bias or experimental error.

Don't let the measurements affect the system.

The obvious way to get around the difficulty of simulating a complete system is to measure the performance of real hardware. However, if the measurements are done in software, their overhead can bias the results to a modest degree. In this case, a hardware monitor might be needed to measure without bias.

Most modern processors now provide performance monitoring hardware. However, this hardware is usually designed only for engineering analysis by the designers of the processor. Accessing it requires special system calls that are usually unsupported by the manufacturer. Each model of the processor may have different configurations of the hardware, making code that relies on them non-portable. In some cases, the hardware doesnÕt even work (since it isnÕt part of the public ISA, the processor can go into production without having the circuits work). Even when it does work, it may not be completely accurate. For example, one processor doesnÕt account correctly for transitions between user and supervisor state, and thus, during intensive floating point processing it appears that the operating system is making some use of the floating point unit, when it contains no floating point instructions at all.

Of course, hardware is also inflexible, so it is difficult to experiment with new designs in this manner. Software simulation provides that flexibility, and can be made immune to measurement biases. But it is also quite costly in terms of experimental runtime. Because one instruction can affect so many parts of the simulated processor, it can take from tens of instructions to several hundred instructions to simulate execution of each one. Simulations also tend to have poor data memory locality, and thus they run even slower. Given that achieving statistical significance can require simulation of billions of instructions to obtain enough load and store operations, it can take hours to days of simulation time to run one experiment. We thus need to find ways to reduce this time, or to make better use of the data that we can obtain from it.

Trace Generation

In designing a cache, the measurements of interest are the pattern of memory accesses, and especially those that generate misses. Thus, we can record a trace of just the memory accesses from a simulation of a full execution. The trace can then be reused in new experiments, saving the time to run a full simulation again. There are, of course, limits to the reusability of a trace as we shall see below.

There are two issues that we must consider in generating memory access traces. One is the  initialization, or cache warm-up period, the other is obtaining a statistically significant number of references.

The warm-up transient results from the fact that when a cache starts out empty, a larger proportion of the references generate misses. This can be factored out by noting which loads are replacing cache lines that were initially empty and not counting these are true misses.

Another approach is to warm up the cache with some initial set of references prior to starting the measurements, but this may be subject to biases in the pattern of access in the code. For example, the start of the trace may be where the code itself is doing initializations that are not representative of its steady-state behavior. Fast forwarding may thus need to advance through the programÕs initialization code and run enough of the main computation to refill the cache with non-initialization values.

It is natural to think that once a cache has been filled and begins to replace lines that were previously filled, then it has passed the warm-up transient and the pattern of misses can be recorded and analyzed. But if the initial filling is entirely the initialization of large numerical arrays to zero, then the cache may not really be warmed up.

In a set-associative cache, each of the sets can suffer an independent warm-up transient, so the reloading must be monitored for each one. For example, in a 4K-line 4-way set associate cache, there are 1000 sets, each of which must suffer 4 misses before it is filled.

Of course, a fully associative cache must have been completely filled to be fully warmed, just as is the case for a direct mapped cache. However, the fully associative cache will fill with K misses, where K is the size of the cache, whereas a direct mapped cache could take much longer to fill.

To appreciate the significance of this initialization, we must keep in mind that the number of misses per reference may be a very small fraction. For example, in a 100,000 reference trace, we may generate only 600 misses [Smith and Goodman, 1985]. If the cache has 512 lines, then in the best case we have warmed it up with 512 of the 600 misses and we have 88 misses to analyze -- hardly a statistically significant number.

In order to attain statistical significance, we must use far longer traces. To achieve a high enough level of confidence may require 100 misses per set after initialization. For a 4-way cache with 16-byte lines, Stone has calculated the following minimum trace lengths to reach this number of misses with a 1 percent miss ratio:

32 KB 2K lines           512 sets           5 M references

128 KB           8K lines           2K sets            40 M references

512 KB           32K lines         8K sets            320 M references

 2  MB 128K lines       32K sets          2.56 B references

It turns out that much of the early work on RISC architectures used traces that were far too small even to have warmed up the cache. On the other hand, the cost of generating traces of the appropriate length is high.

Remember that a RISC processor operates mostly out of its registers, so a simulation is likely to have to execute 10 instructions for every reference, and probably has a dilation factor of more than 100 per instruction. Thus, on a 1000 MIPS system, about an hour of CPU time would be required to generate the 2.56B reference trace, not to mention about 20 GB of disk storage.

Each time a change to a cache is made, the reference trace is passed through it, saving the time for simulation of the individual instructions. But it will still take over fifteen minutes just to read the data off of the disk at maximum throughput (20 MB/S), and thus more likely several hours to process the trace.

It is important to understand that traces do not preserve timing information. We can only determine hit/miss ratios using traces. They do not generate cycle-accurate performance information. To see why this is so, consider that a modern cache can allow a certain number of misses to be resolved in parallel. If the misses come in rapid succession, then the limit for this parallelism may be reached and the load/store unit stalls. But when the misses are separated by enough non-memory access operations, then all of the misses can be issued without waiting, so that they respond sooner. However, out trace does not keep track of the time between references, so we cannot accurately determine the execution time.

Trace Compaction (Stripping)

WeÕve now seen how to amortize the cost of the full simulation across multiple experiments by saving a memory reference trace. Even so, running through the trace is still expensive. There clearly needs to be a way to reduce the cost of processing the traces.

One way is to perform multiple analyses per pass. For example, if we are simulating a set associative cache of a particular degree with an LRU replacement policy, then it turns out that it is easy to simulate set associative caches of lower degree at the same time. This is called the inclusion principle: A trace of a K-way associative cache includes all K' (< K) associative cache misses for the same number of sets and lines under an LRU replacement policy (this does not hold for random replacement).

However a more effective technique is to reduce the size of the trace itself.

If a cache with a particular total size and line size is to be simulated, then output a trace for a direct mapped cache in which all of the hit references are counted and then discarded. Because a typical hit ratio is at least 90%, that percentage of the trace references are discarded.

If this reduced trace is executed, it produces the same pattern of misses in the cache. If we know the number of references in the original trace, we can still calculate the miss ratio. Puzak showed that such a stripped trace can be used to simulate caches with any combination of size and associativity provided that

1. The line size remains the same.

2. The number of sets remains the same or increases.

Note that in the direct mapped cache the number of sets (S) equals the number of lines (L). In a K-way associative cache, the number of sets is S = L/K, so the size of the cache must increase by a factor of K to compensate if the same trace is to be used.

Essentially, all caches that meet these criteria will have the same number of misses in the reduced trace as they would in the full trace. That number may be less than the number of misses recorded, but will be no greater.

Changing the line size, however, disrupts the basic pattern of misses. One can simply run the trace simulation for a variety of line sizes to generate a set of traces. Wang and Baer noted that many of these misses are duplicated between the traces. Thus, by creating a trace that is the union of these traces, they generate a single trace that is about 50% larger but is applicable for five different line sizes.

Set Sampling

If the patterns of reference are reasonably regular, as they often are with instruction and array references, then the misses between sets are highly correlated. Statistical techniques can be used to determine the number of sets required to achieve a particular accuracy at a given confidence level for a given trace. Often this turns out to be roughly 10% of the sets in the cache.

When sampling and stripping are combined, a factor of 100 reduction in the trace size can be obtained. Of course, if there is little correlation between the sets (as might be expected with list-oriented codes), then sampling has a much smaller effect. Yet even by itself, compaction could reduce the time for our example to less than an hour, which is still useful.

Types of Misses

There are three broad classes of misses that are recognized in the architecture community.

Compulsory or first-reference -- each time a value that isn't in the cache is accessed for the first time.

Capacity -- a miss in which a value would still be present if it had not been previously evicted due to insufficient cache size.

Conflict -- a miss due to a value having been evicted by another with the same mapping, regardless of whether the cache was full. Can also be thought of as associativity capacity.

As the size of a cache grows, the number of compulsory misses grows in proportion to the other types because capacity misses diminish. However, the number of compulsory misses depends only on the code and the organization of the cache (length of lines). Smaller lines result in more compulsory misses. Larger lines result in fewer compulsory misses, but of course we can't arbitrarily increase the size of lines because the cost of a miss increases with line size, and the total number of misses eventually starts to increase as conflict misses will grow.

Conflict misses steadily decrease with increased cache size, but their relative percentages stay about the same because the capacity misses are decreasing at a rate that roughly matches. Conflict misses decrease significantly with an increase in associativity for a given cache size. They are sensitive to placement of code and data in memory. As noted previously, it is costly in terms of both speed an real estate to excessively increase associativity.

Capacity misses decrease monotonically with increased cache size. If associativity is increased, then the proportion of all misses due to capacity increases because the number of conflict misses decreases.

This classification for misses was extended by Temam and McKinley to consider whether they resulted from intra-loop-nest conflicts, or inter-nest conflicts, as a way of relating the miss rates to specific loop behaviors. They showed that, for example, a significant fraction of conflict misses are due to optimizations that consider a single loop nest, and thus evict data from the cache that could be reused in a following loop nest. By being somewhat less aggressive in optimizing the first loopÕs reuse, the miss rate for the combined loops could be reduced.

What this shows is that simple statistics that are gathered for complete execution of a program can mask locally significant features in the data. Conversely, trying to optimize those statistics too locally can also lead to lower overall levels of optimization.

Advanced Cache Organizations

Shadow Cache

Cache behavior can sometimes be divided into two categories: Transient loads and repeated loads. A transient load is one in which a value is loaded, operated upon, and then never referenced again. For example, many initialization references are transient. A repeated load is one that occurs many times. The hope is that repeated loads will always be hits, but sometimes a burst of transient activity can force many of the repeatedly used values out of cache, so that they must again be reloaded.

Although the effect of transients is modest, it is desirable to avoid the effect if possible. Pomerene proposed the use of a shadow cache that stores the address information for any value that is replaced in cache. Old shadow values get overwritten, usually with an LRU policy within some set size. Thus, the shadow cache remembers the most recently replaced addresses.

When a replacement is to occur in the main cache, the replacement policy can include a step that checks the shadow cache and gives priority to replacing values that are not logged there. That is, if a value has been written out and loaded again, it is less likely to be written out again because it is probably a repeatedly accessed value. If a value has not been recently written out, it is more likely to be a transient.

Split Cache

As we've noted, instructions and data may have different patterns of access, and thus could benefit from being loaded into separate caches. Empirical studies have shown that this has a reasonable benefit when caches are large, but that performance may actually decrease for small caches.

The reason is that, when the cache is small, splitting it in half may increase the miss ratio for data or instructions disproportionately because each has a smaller cache to work with. By keeping the two combined, there is greater flexibility in how the cache is allocated between the two.

Once the caches become large, however, the decrease in size due to splitting has a smaller effect, and the improvement in the instruction hit ratio especially can increase performance.

In addition, split caches enable the use of simultaneous instruction and data fetch by the CPU. If the caches are exposed to the outside world (i.e. do not appear unified to lower levels of the hierarchy), then there is also the option to have a secondary cache dedicated to data.

Write Buffer

In some processors, if a miss is followed by an independent write, the write is allowed to proceed. However, if the cache is busy loading the missing line, the write may be forced to wait. In that case, a write buffer may be used to temporarily hold the write until the cache is free.

The buffer must contain logic to know when it is full, know when the cache is free, and be able to initiate a cache write followed by clearing its full status. If the buffer can contain multiple values in a queue, then it is even more complex.

The cache controller must also contain logic to recognize when a read is attempting to access a value that is in the write buffer and fetch from there instead of the cache.

In fact, to minimize the control logic for the most common case, it may be that all writes pass through the write buffer on the way to cache.

The write buffer effectively acts to smooth the irregular frequency of writes into a more regular rate. It should be noted that there are usually fewer writes than reads because most of the results of processing are stored in registers, and many of those results are temporary; having no requirement to be saved in memory. Before the advent of large sets of general purpose registers, the statistical split of reads vs. writes was not quite as large, but was still significant. In RISC processors, the figures today are 26% reads vs. 9% writes.

Write Miss Policies

When a write misses in the cache, there are two possible actions. We can fetch the missing line and then complete the write, or we can bypass the cache and store directly to a lower level memory. The former is called write-allocate because it requires us to allocate space in the cache as part of the write. The latter is called write no-allocate because it doesn't allocate any space in the cache.

These two policies often correspond to write back and write through, respectively. In a write-back cache, we try to keep the current copy closest to the processor, so write-allocate is a logical extension. Likewise, with write-through, we store the data all the way out to the level in common with all consumers of data, and so bypassing the cache is consistent.

Victim Buffer

Many of the conflict misses seen in direct-mapped caches are due to short term thrashing between a small set of aligned values. Introducing a small, fully associative cache (<= 5 entries) that holds recently evicted values, and which can quickly refill a line in the direct mapped cache, can greatly reduce the thrashing-induced conflict misses.

The victim cache can be useful even in set-associative caches as a means to quickly recover from transient cache evictions such as exception handlers and interrupts. Such a small fully associative cache is not particularly expensive to implement.

The effect of the victim buffer is thus to provide a small pool of extra lines that are used to temporarily increase the associativity of individual sets in response to transient spikes in capacity misses.

Pseudo-Associative Caches

An alternative to associative retrieval in many instances is hashing. The direct mapped cache uses a simple address-based hash, and does not rehash if a miss is detected. In a pseudo-associative cache, the hashing scheme is extended with a single level of rehashing (usually with a simple bit-flip) on a miss. The advantage, of course, is that the same effect as a two-way associative cache is obtained without having to add a comparator and the wider data paths out of the cache circuitry block. Multiple levels could be used, but the benefit diminishes quickly as the time to read and check the tags adds up. Furthermore, having a variable-length miss time greatly complicates the timing of the processor, so at least the first level cache is unlikely to use this technique.

Prefetch (Hardware or Software)

While increasing the line size has a negative effect on conflict misses, it does improve compulsory misses. One way around this tradeoff is to provide a prefetch buffer that holds an additional line beyond the one most recently fetched. If the pattern of access then needs the next line, there is no miss penalty and another prefetch can be started. Although there are still compulsory misses in that the data must still enter the cache for the first time, they do not cause a penalty. Unlike the increased line size, however, the prefetching of data does not evict existing cached data unless it is necessary.

It is also possible to include instructions that specify a prefetch. The compiler can then insert these instructions in advance of a change in working set in order to cause the data to be precached. However, if the prefetch operation requires a virtual memory translation, it can be quite costly, especially if the use of the prefetched data is branch dependent. It may be desirable to ignore a prefetch, and so the instruction set architecture may not guarantee that a prefetch will always occur.

Non-Blocking Cache

In a superscalar processor, it is possible that different pipelines will simultaneously need to load data. If one of them misses, then the others would normally be forced to wait. However, it is possible to let them proceed while the miss is being serviced. As long as the pipeline control maintains sequential semantics, then this isn't a problem (although the other pipelines might eventually stall, if they are dependent on the missing data).

It is even possible to support multiple misses at once by pipelining them and carefully retiring them as the data is loaded. However, this can be complicated by virtual memory faults and exceptions.

Miss Status Holding Registers

In a non-blocking cache, we typically have support for a limited number of in-flight misses. Once this number of misses is pending, no more loads can be issued. However, misses tend to come in groups that are to a common line. Consider that a loop may miss on reading from the first word of a line. Its next access is to the second word on the line, and also misses. Thus, one actual miss could use up all of slots in the non-blocking logic.

An MSHR is used to keep track of the pending misses. If another miss is detected that is going to be serviced by an already pending miss, then the MSHR logs this information and causes the miss to respond to all of the loads that are waiting for it. Thus, the non-blocking logic need only handle truly independent misses, and the pipelines are allowed to run further ahead while awaiting data from memory.

Tags On Data Off

When reading data from a cache, the step that must happen first is the tag check. Thus, the actual data read can be slightly delayed. This opens up the possibility of keeping the tags on the processor chip (for faster access) and the data off the chip (for a larger cache than can fit on the chip). While this scheme works, it has the disadvantage of constraining the system-level design options. For example, if a chip has on-board tags to support 16K lines of level 3 cache, then the system designer cannot easily make the L3 cache larger than this. Given that most processors now have two levels of cache entirely on the chip, the frequency of access to L3 is small, and doesnÕt justify the added on-chip hardware.


Physical vs. Logical (Virtual) Cache

Caches fall in the memory hierarchy between the CPU and the main memory. Unfortunately, so does the memory management unit that handles the virtual memory mapping. This forms a relationship that can complicate the design of both. We must choose where the MMU falls in this hierarchy, i.e., above or below the cache. If it is above the cache, then the cache is said to be physical, because it sees physical memory address references coming from the CPU+MMU. If the MMU is below the cache, then it is said to be virtual or logical, as it sees these types of addresses coming from the CPU. And if there are multiple levels of cache, then it is possible for the MMU to appear between them, so we might have a logical primary cache and a physical secondary cache (as in the MIPS R4400).

Because a logical cache precedes the delay inducing MMU, it can operate with lower latency. However, it also must deal with what is known as the synonym problem. Due to the nature of virtual memory, it is possible for a single physical address to be assigned multiple virtual addresses. When the CPU writes to the logical cache, it writes to just one of these addresses. and that write may propagate back to the physical address in the main memory. If the CPU then reads from a different logical address assigned to that same location, it may fetch an obsolete cached copy.

In the R4400 the secondary cache is physically addressed and has tags that note when a location is shared. A miss that tries to read a shared location into a different logical address causes the secondary cache to emulate a snooping operation by a second processor so that the primary cache's earlier copy of the value is invalidated. Thus, the logical cache can have just one copy of the shared variable active at a time, but it remains in the secondary cache so it is only moderately costly to switch between copies frequently.

Case Study Processors

 

 

PPC 601

PPC 603

PPC 604

PPC 620

SPARC

R10000

R4400

Pentium

P6

Blocking

1

0

4

 

 

 

 

 

 

Split?

No

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Data

32K

8K

16K

32K

16K

32K

16K

8K

8K

Instruct.

Unified

8K

16K

32K

16K

32K

16K

8K

8K

Associativity

8

2

4

 

D: 1

I: 2

2

1

2

D: 4

I: 2

Line size

64 B

32 B

32 B

 

 

 

16 or 32 B

32 B

 

Data Write

Back (through)

Back (through)

Back (through)

Back (through)

 

 

 

Back (through)

Back (through)

Replacement

 

 

LRU

 

 

 

 

 

 

Notes:

 

 

 

I-cache predecodes instr.

 

I-cache predecodes instr.

 

 

245KB 4-way L2 cache