The entire basis for gaining increased performance
from a cache is that the distribution of memory values accessed by a typical
application has statistically significant biases -- i.e. locality of reference
that can exploited with a smaller, faster memory.
While we can hypothesize what these biases will be,
the only way to determine them with any accuracy is to measure them. As in any
empirical measurement methodology, one must take care to construct the
experiment so that the method itself does not bias the measurements. This means
that the sample must accurately reflect the population being measured, that the
measurements must gauge the correct aspects of the system, and the measurements
must not affect the behavior of the system.
Define a valid sample.
It is very difficult to define a population space of
programs. While a theoretician may be able to define an enumeration for all
possible programs, we are interested only in those that are useful in some
sense. Depending on our goals, we may further restrict the population to some
subset that are specific to certain application domains, or even specific
applications.
Most benchmark suites of complete programs that are
used today were developed on the basis of what somebody had available. Clearly
an inadequate sampling strategy. The Perfect Club benchmark is probably one of
the more carefully considered suites. Manufacturers often have their own
internal suites of applications that are more representative than these public
benchmarks.
Other benchmarks are kernels of code that somebody
judged to be representative of what was used frequently in an application
domain (e.g. Livermore Loops). While useful for testing specific aspects of
performance, these codes do not provide a valid statistical sample of the
behavior of real programs whose access behavior is much more complex.
Measure the correct information.
In order to simplify the measurement process and to
make the experiment repeatable, it is attractive to measure the behavior of an
application in isolation. That is, the effect of the operating system and of
other concurrent processes is factored out. This is usually done by running the
application on an instruction set architecture simulation of a bare machine,
and turning off measurement for system call simulations. In most cases,
supervisor mode instructions and memory space are not even simulated. Instead,
the simulator provides operations that emulate the specific system calls that
are encountered in the benchmark code, and no other elements of the operating
system are simulated. Such simulations are also limited to static binaries, and
preclude research on the behavior of dynamically linked code, or runtime code
recompilation.
Unfortunately, the complex mix of behavior encountered
in a loaded system is more representative of the population we are trying to
sample. In effect, even though the population of applications may be carefully
selected to be representative, the actual population to be sampled is the
dynamic combinations of executions, rather than some set of static codes
(unless, of course, the goal is to provide optimum performance in a
uniprocessing mode).
It can be very difficult to simulate a complete system
that is subject both to multiple configurations and asynchronous events. In
particular it is difficult to provide such a simulation in a manner that has
the repeatability necessary for scientific research.
Many of the early RISC designs were based on
simplified simulation environments, and their performance turned out to match
predictions only very roughly because of this. Modern processors, with their
many pipelines, functional units, branch predictors, memory hierarchy, and so
on are even more difficult to simulate accurately.
Another aspect of measuring the correct information is
to be sure that the measurements are not being overwhelmed by other factors
(such as a bias in a compiler).
Once the systematic biases and sources of experimental
error have been identified, it should be possible to determine the error range
in our measurements, and to establish the statistical significance of our
results. Unfortunately, this is almost never done in architectural research
today.
There was an embarrassing period in the early years of
cache research when it was discovered that many of the most widely cited papers
had been based on samples that were too small, and to biased by cache initialization,
to justify the claims that were being made. Cache researchers are now careful
to run simulations with enough accesses to ensure statistical significance and
a reasonable level of error. They also wait to begin sampling until after a
program has passed through its initializations. Of course, ignoring the
initialization period is another bias, but the argument for this approach is
that the initialization effect would become insignificant if the main part of
the program runs long enough. Thus, we can get similar results by ignoring
initialization, and running the main part of the program for a shorter time.
This technique is known as Òfast forwarding.Ó
While the use of fast forwarding is convenient, few
researchers bother to quantify the error that is induced by its use. For
example, if the main part of the program would not actually run for long enough
to overcome the effects of initialization, it is inaccurate to say that the behavior
after fast-forwarding is representative of the whole program. Even if the
initialization effect would change the results by just a few percent, this
could be significant if the experiments are measuring small changes in
performance.
Outside of the use of larger sample sizes as pioneered
in cache research, it is still common to see architecture papers that report
improvements of just a few percent with respect to different schemes, while
saying nothing about sources of bias or experimental error.
Don't let the measurements affect the system.
The obvious way to get around the difficulty of
simulating a complete system is to measure the performance of real hardware.
However, if the measurements are done in software, their overhead can bias the
results to a modest degree. In this case, a hardware monitor might be needed to
measure without bias.
Most modern processors now provide performance monitoring
hardware. However, this hardware is usually designed only for engineering
analysis by the designers of the processor. Accessing it requires special
system calls that are usually unsupported by the manufacturer. Each model of
the processor may have different configurations of the hardware, making code
that relies on them non-portable. In some cases, the hardware doesnÕt even work
(since it isnÕt part of the public ISA, the processor can go into production
without having the circuits work). Even when it does work, it may not be completely
accurate. For example, one processor doesnÕt account correctly for transitions
between user and supervisor state, and thus, during intensive floating point
processing it appears that the operating system is making some use of the
floating point unit, when it contains no floating point instructions at all.
Of course, hardware is also inflexible, so it is
difficult to experiment with new designs in this manner. Software simulation
provides that flexibility, and can be made immune to measurement biases. But it
is also quite costly in terms of experimental runtime. Because one instruction
can affect so many parts of the simulated processor, it can take from tens of
instructions to several hundred instructions to simulate execution of each one.
Simulations also tend to have poor data memory locality, and thus they run even
slower. Given that achieving statistical significance can require simulation of
billions of instructions to obtain enough load and store operations, it can
take hours to days of simulation time to run one experiment. We thus need to
find ways to reduce this time, or to make better use of the data that we can
obtain from it.
Trace Generation
In designing a cache, the measurements of interest are
the pattern of memory accesses, and especially those that generate misses.
Thus, we can record a trace of just the memory accesses from a simulation of a
full execution. The trace can then be reused in new experiments, saving the
time to run a full simulation again. There are, of course, limits to the
reusability of a trace as we shall see below.
There are two issues that we must consider in
generating memory access traces. One is the initialization, or cache warm-up period, the other is
obtaining a statistically significant number of references.
The warm-up transient results from the fact that when
a cache starts out empty, a larger proportion of the references generate
misses. This can be factored out by noting which loads are replacing cache
lines that were initially empty and not counting these are true misses.
Another approach is to warm up the cache with some initial
set of references prior to starting the measurements, but this may be subject
to biases in the pattern of access in the code. For example, the start of the
trace may be where the code itself is doing initializations that are not
representative of its steady-state behavior. Fast forwarding may thus need to
advance through the programÕs initialization code and run enough of the main
computation to refill the cache with non-initialization values.
It is natural to think that once a cache has been
filled and begins to replace lines that were previously filled, then it has
passed the warm-up transient and the pattern of misses can be recorded and
analyzed. But if the initial filling is entirely the initialization of large
numerical arrays to zero, then the cache may not really be warmed up.
In a set-associative cache, each of the sets can
suffer an independent warm-up transient, so the reloading must be monitored for
each one. For example, in a 4K-line 4-way set associate cache, there are 1000
sets, each of which must suffer 4 misses before it is filled.
Of course, a fully associative cache must have been
completely filled to be fully warmed, just as is the case for a direct mapped
cache. However, the fully associative cache will fill with K misses, where K is
the size of the cache, whereas a direct mapped cache could take much longer to
fill.
To appreciate the significance of this initialization,
we must keep in mind that the number of misses per reference may be a very
small fraction. For example, in a 100,000 reference trace, we may generate only
600 misses [Smith and Goodman, 1985]. If the cache has 512 lines, then in the
best case we have warmed it up with 512 of the 600 misses and we have 88 misses
to analyze -- hardly a statistically significant number.
In order to attain statistical significance, we must
use far longer traces. To achieve a high enough level of confidence may require
100 misses per set after initialization. For a 4-way cache with 16-byte lines,
Stone has calculated the following minimum trace lengths to reach this number
of misses with a 1 percent miss ratio:
32 KB 2K lines 512
sets 5
M references
128 KB 8K
lines 2K
sets 40
M references
512 KB 32K
lines 8K
sets 320
M references
2 MB 128K
lines 32K
sets 2.56
B references
It turns out that much of the early work on RISC
architectures used traces that were far too small even to have warmed up the
cache. On the other hand, the cost of generating traces of the appropriate
length is high.
Remember that a RISC processor operates mostly out of
its registers, so a simulation is likely to have to execute 10 instructions for
every reference, and probably has a dilation factor of more than 100 per
instruction. Thus, on a 1000 MIPS system, about an hour of CPU time would be
required to generate the 2.56B reference trace, not to mention about 20 GB of
disk storage.
Each time a change to a cache is made, the reference
trace is passed through it, saving the time for simulation of the individual
instructions. But it will still take over fifteen minutes just to read the data
off of the disk at maximum throughput (20 MB/S), and thus more likely several
hours to process the trace.
It is important to understand that traces do not
preserve timing information. We can only determine hit/miss ratios using
traces. They do not generate cycle-accurate performance information. To see why
this is so, consider that a modern cache can allow a certain number of misses
to be resolved in parallel. If the misses come in rapid succession, then the
limit for this parallelism may be reached and the load/store unit stalls. But
when the misses are separated by enough non-memory access operations, then all
of the misses can be issued without waiting, so that they respond sooner.
However, out trace does not keep track of the time between references, so we
cannot accurately determine the execution time.
Trace Compaction (Stripping)
WeÕve now seen how to amortize the cost of the full
simulation across multiple experiments by saving a memory reference trace. Even
so, running through the trace is still expensive. There clearly needs to be a
way to reduce the cost of processing the traces.
One way is to perform multiple analyses per pass. For
example, if we are simulating a set associative cache of a particular degree
with an LRU replacement policy, then it turns out that it is easy to simulate
set associative caches of lower degree at the same time. This is called the inclusion
principle: A trace of a K-way associative cache includes all K' (< K)
associative cache misses for the same number of sets and lines under an LRU
replacement policy (this does not hold for random replacement).
However a more effective technique is to reduce the
size of the trace itself.
If a cache with a particular total size and line size
is to be simulated, then output a trace for a direct mapped cache in which all
of the hit references are counted and then discarded. Because a typical hit
ratio is at least 90%, that percentage of the trace references are discarded.
If this reduced trace is executed, it produces the
same pattern of misses in the cache. If we know the number of references in the
original trace, we can still calculate the miss ratio. Puzak showed that such a
stripped trace can be used to simulate caches with any combination of size and
associativity provided that
1. The line size remains the same.
2. The number of sets remains the same or increases.
Note that in the direct mapped cache the number of
sets (S) equals the number of lines (L). In a K-way associative cache, the
number of sets is S = L/K, so the size of the cache must increase by a factor
of K to compensate if the same trace is to be used.
Essentially, all caches that meet these criteria will
have the same number of misses in the reduced trace as they would in the full
trace. That number may be less than the number of misses recorded, but will be
no greater.
Changing the line size, however, disrupts the basic
pattern of misses. One can simply run the trace simulation for a variety of
line sizes to generate a set of traces. Wang and Baer noted that many of these
misses are duplicated between the traces. Thus, by creating a trace that is the
union of these traces, they generate a single trace that is about 50% larger
but is applicable for five different line sizes.
Set Sampling
If the patterns of reference are reasonably regular,
as they often are with instruction and array references, then the misses
between sets are highly correlated. Statistical techniques can be used to
determine the number of sets required to achieve a particular accuracy at a
given confidence level for a given trace. Often this turns out to be roughly
10% of the sets in the cache.
When sampling and stripping are combined, a factor of
100 reduction in the trace size can be obtained. Of course, if there is little
correlation between the sets (as might be expected with list-oriented codes),
then sampling has a much smaller effect. Yet even by itself, compaction could
reduce the time for our example to less than an hour, which is still useful.
Types of Misses
There are three broad classes of misses that are
recognized in the architecture community.
Compulsory or first-reference -- each time a value
that isn't in the cache is accessed for the first time.
Capacity -- a miss in which a value would still be
present if it had not been previously evicted due to insufficient cache size.
Conflict -- a miss due to a value having been evicted
by another with the same mapping, regardless of whether the cache was full. Can
also be thought of as associativity capacity.
As the size of a cache grows, the number of compulsory
misses grows in proportion to the other types because capacity misses diminish.
However, the number of compulsory misses depends only on the code and the
organization of the cache (length of lines). Smaller lines result in more
compulsory misses. Larger lines result in fewer compulsory misses, but of
course we can't arbitrarily increase the size of lines because the cost of a
miss increases with line size, and the total number of misses eventually starts
to increase as conflict misses will grow.
Conflict misses steadily decrease with increased cache
size, but their relative percentages stay about the same because the capacity
misses are decreasing at a rate that roughly matches. Conflict misses decrease
significantly with an increase in associativity for a given cache size. They
are sensitive to placement of code and data in memory. As noted previously, it
is costly in terms of both speed an real estate to excessively increase
associativity.
Capacity misses decrease monotonically with increased
cache size. If associativity is increased, then the proportion of all misses
due to capacity increases because the number of conflict misses decreases.
This classification for misses was extended by Temam
and McKinley to consider whether they resulted from intra-loop-nest conflicts,
or inter-nest conflicts, as a way of relating the miss rates to specific loop
behaviors. They showed that, for example, a significant fraction of conflict
misses are due to optimizations that consider a single loop nest, and thus
evict data from the cache that could be reused in a following loop nest. By
being somewhat less aggressive in optimizing the first loopÕs reuse, the miss
rate for the combined loops could be reduced.
What this shows is that simple statistics that are
gathered for complete execution of a program can mask locally significant features
in the data. Conversely, trying to optimize those statistics too locally can
also lead to lower overall levels of optimization.
Advanced Cache Organizations
Shadow Cache
Cache behavior can sometimes be divided into two
categories: Transient loads and repeated loads. A transient load is one in
which a value is loaded, operated upon, and then never referenced again. For
example, many initialization references are transient. A repeated load is one
that occurs many times. The hope is that repeated loads will always be hits,
but sometimes a burst of transient activity can force many of the repeatedly
used values out of cache, so that they must again be reloaded.
Although the effect of transients is modest, it is
desirable to avoid the effect if possible. Pomerene proposed the use of a
shadow cache that stores the address information for any value that is replaced
in cache. Old shadow values get overwritten, usually with an LRU policy within
some set size. Thus, the shadow cache remembers the most recently replaced
addresses.
When a replacement is to occur in the main cache, the
replacement policy can include a step that checks the shadow cache and gives
priority to replacing values that are not logged there. That is, if a value has
been written out and loaded again, it is less likely to be written out again
because it is probably a repeatedly accessed value. If a value has not been
recently written out, it is more likely to be a transient.
Split Cache
As we've noted, instructions and data may have
different patterns of access, and thus could benefit from being loaded into
separate caches. Empirical studies have shown that this has a reasonable
benefit when caches are large, but that performance may actually decrease for
small caches.
The reason is that, when the cache is small, splitting
it in half may increase the miss ratio for data or instructions
disproportionately because each has a smaller cache to work with. By keeping
the two combined, there is greater flexibility in how the cache is allocated
between the two.
Once the caches become large, however, the decrease in
size due to splitting has a smaller effect, and the improvement in the
instruction hit ratio especially can increase performance.
In addition, split caches enable the use of
simultaneous instruction and data fetch by the CPU. If the caches are exposed
to the outside world (i.e. do not appear unified to lower levels of the
hierarchy), then there is also the option to have a secondary cache dedicated
to data.
Write Buffer
In some processors, if a miss is followed by an
independent write, the write is allowed to proceed. However, if the cache is
busy loading the missing line, the write may be forced to wait. In that case, a
write buffer may be used to temporarily hold the write until the cache is free.
The buffer must contain logic to know when it is full,
know when the cache is free, and be able to initiate a cache write followed by
clearing its full status. If the buffer can contain multiple values in a queue,
then it is even more complex.
The cache controller must also contain logic to
recognize when a read is attempting to access a value that is in the write
buffer and fetch from there instead of the cache.
In fact, to minimize the control logic for the most
common case, it may be that all writes pass through the write buffer on the way
to cache.
The write buffer effectively acts to smooth the
irregular frequency of writes into a more regular rate. It should be noted that
there are usually fewer writes than reads because most of the results of
processing are stored in registers, and many of those results are temporary;
having no requirement to be saved in memory. Before the advent of large sets of
general purpose registers, the statistical split of reads vs. writes was not
quite as large, but was still significant. In RISC processors, the figures
today are 26% reads vs. 9% writes.
Write Miss Policies
When a write misses in the cache, there are two
possible actions. We can fetch the missing line and then complete the write, or
we can bypass the cache and store directly to a lower level memory. The former
is called write-allocate because
it requires us to allocate space in the cache as part of the write. The latter
is called write no-allocate because it doesn't allocate any space in the cache.
These two policies often correspond to write back and
write through, respectively. In a write-back cache, we try to keep the current
copy closest to the processor, so write-allocate is a logical extension.
Likewise, with write-through, we store the data all the way out to the level in
common with all consumers of data, and so bypassing the cache is consistent.
Victim Buffer
Many of the conflict misses seen in direct-mapped
caches are due to short term thrashing between a small set of aligned values.
Introducing a small, fully associative cache (<= 5 entries) that holds
recently evicted values, and which can quickly refill a line in the direct
mapped cache, can greatly reduce the thrashing-induced conflict misses.
The victim cache can be useful even in set-associative
caches as a means to quickly recover from transient cache evictions such as
exception handlers and interrupts. Such a small fully associative cache is not
particularly expensive to implement.
The effect of the victim buffer is thus to provide a
small pool of extra lines that are used to temporarily increase the associativity
of individual sets in response to transient spikes in capacity misses.
Pseudo-Associative Caches
An alternative to associative retrieval in many
instances is hashing. The direct mapped cache uses a simple address-based hash,
and does not rehash if a miss is detected. In a pseudo-associative cache, the
hashing scheme is extended with a single level of rehashing (usually with a
simple bit-flip) on a miss. The advantage, of course, is that the same effect
as a two-way associative cache is obtained without having to add a comparator
and the wider data paths out of the cache circuitry block. Multiple levels
could be used, but the benefit diminishes quickly as the time to read and check
the tags adds up. Furthermore, having a variable-length miss time greatly
complicates the timing of the processor, so at least the first level cache is
unlikely to use this technique.
Prefetch (Hardware or Software)
While increasing the line size has a negative effect
on conflict misses, it does improve compulsory misses. One way around this
tradeoff is to provide a prefetch buffer that holds an additional line beyond
the one most recently fetched. If the pattern of access then needs the next
line, there is no miss penalty and another prefetch can be started. Although
there are still compulsory misses in that the data must still enter the cache
for the first time, they do not cause a penalty. Unlike the increased line
size, however, the prefetching of data does not evict existing cached data
unless it is necessary.
It is also possible to include instructions that
specify a prefetch. The compiler can then insert these instructions in advance
of a change in working set in order to cause the data to be precached. However,
if the prefetch operation requires a virtual memory translation, it can be
quite costly, especially if the use of the prefetched data is branch dependent.
It may be desirable to ignore a prefetch, and so the instruction set
architecture may not guarantee that a prefetch will always occur.
Non-Blocking Cache
In a superscalar processor, it is possible that
different pipelines will simultaneously need to load data. If one of them
misses, then the others would normally be forced to wait. However, it is
possible to let them proceed while the miss is being serviced. As long as the
pipeline control maintains sequential semantics, then this isn't a problem
(although the other pipelines might eventually stall, if they are dependent on
the missing data).
It is even possible to support multiple misses at once
by pipelining them and carefully retiring them as the data is loaded. However,
this can be complicated by virtual memory faults and exceptions.
In a non-blocking cache, we typically have support for
a limited number of in-flight misses. Once this number of misses is pending, no
more loads can be issued. However, misses tend to come in groups that are to a
common line. Consider that a loop may miss on reading from the first word of a
line. Its next access is to the second word on the line, and also misses. Thus,
one actual miss could use up all of slots in the non-blocking logic.
An MSHR is used to keep track of the pending misses.
If another miss is detected that is going to be serviced by an already pending
miss, then the MSHR logs this information and causes the miss to respond to all
of the loads that are waiting for it. Thus, the non-blocking logic need only
handle truly independent misses, and the pipelines are allowed to run further ahead
while awaiting data from memory.
Tags On Data Off
When reading data from a cache, the step that must
happen first is the tag check. Thus, the actual data read can be slightly
delayed. This opens up the possibility of keeping the tags on the processor chip
(for faster access) and the data off the chip (for a larger cache than can fit
on the chip). While this scheme works, it has the disadvantage of constraining
the system-level design options. For example, if a chip has on-board tags to
support 16K lines of level 3 cache, then the system designer cannot easily make
the L3 cache larger than this. Given that most processors now have two levels
of cache entirely on the chip, the frequency of access to L3 is small, and
doesnÕt justify the added on-chip hardware.
Physical vs. Logical (Virtual) Cache
Caches fall in the memory hierarchy between the CPU
and the main memory. Unfortunately, so does the memory management unit that
handles the virtual memory mapping. This forms a relationship that can
complicate the design of both. We must choose where the MMU falls in this
hierarchy, i.e., above or below the cache. If it is above the cache, then the
cache is said to be physical, because it sees physical memory address
references coming from the CPU+MMU. If the MMU is below the cache, then it is
said to be virtual or logical, as it sees these types of addresses coming from
the CPU. And if there are multiple levels of cache, then it is possible for the
MMU to appear between them, so we might have a logical primary cache and a
physical secondary cache (as in the MIPS R4400).
Because a logical cache precedes the delay inducing
MMU, it can operate with lower latency. However, it also must deal with what is
known as the synonym problem. Due to the nature of virtual memory, it is
possible for a single physical address to be assigned multiple virtual
addresses. When the CPU writes to the logical cache, it writes to just one of
these addresses. and that write may propagate back to the physical address in the
main memory. If the CPU then reads from a different logical address assigned to
that same location, it may fetch an obsolete cached copy.
In the R4400 the secondary cache is physically
addressed and has tags that note when a location is shared. A miss that tries
to read a shared location into a different logical address causes the secondary
cache to emulate a snooping operation by a second processor so that the primary
cache's earlier copy of the value is invalidated. Thus, the logical cache can
have just one copy of the shared variable active at a time, but it remains in
the secondary cache so it is only moderately costly to switch between copies
frequently.
Case
Study Processors
|
|
PPC 601 |
PPC 603 |
PPC 604 |
PPC 620 |
SPARC |
R10000 |
R4400 |
Pentium |
P6 |
|
Blocking |
1 |
0 |
4 |
|
|
|
|
|
|
|
Split? |
No |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
|
Data |
32K |
8K |
16K |
32K |
16K |
32K |
16K |
8K |
8K |
|
Instruct. |
Unified |
8K |
16K |
32K |
16K |
32K |
16K |
8K |
8K |
|
Associativity |
8 |
2 |
4 |
|
D: 1 I: 2 |
2 |
1 |
2 |
D: 4 I: 2 |
|
Line size |
64 B |
32 B |
32 B |
|
|
|
16 or 32 B |
32 B |
|
|
Data Write |
Back (through) |
Back (through) |
Back (through) |
Back (through) |
|
|
|
Back (through) |
Back (through) |
|
Replacement |
|
|
LRU |
|
|
|
|
|
|
|
Notes: |
|
|
|
I-cache predecodes instr. |
|
I-cache predecodes instr. |
|
|
245KB 4-way L2 cache |