Lecture 9 Memory Hierarchy and Caching
Memory Hierarchy
There are many different technologies for storing
data. In the early years of computing, cathode ray tubes, acoustical delay
lines, magnetic drums, magnetic cores, punched cards, and open-reel magnetic
tape were popular.
Today, these technologies have essentially been
replaced by semiconductor memory, magnetic disk, optical disk, and helical-scan
magnetic tape cartridges. Other technologies have been proposed, such as CCD
and magnetic bubble memory, but have not succeeded in gaining a significant
part of the market. Part of the reason that these devices failed, even though
they had great potential, is that there is a tremendous amount of
infrastructure behind the current technologies, which were adopted with (and in
part enabled) the explosion in computer usage.
Given the huge amount of capital available to support
development of the next RAM, disk, or tape technology, it is difficult for a
radically new technology to enter the field. In the case of optical disk
technology, the development capital came from entertainment companies, and it
was realized that the new storage medium had a secondary market for
non-volatile data and software storage, especially at high density and low
duplication cost, and with random access -- a niche not filled by any other
storage medium. However, the development cost for CD technology was probably in
the $100M - $500M range and people are still struggling to find uses for the
new medium. (Collections of shovelware are common.) The biggest applications
are in video-rich computer games, and the use of writable CD media for backup
of magnetic disk. DVD-ROM is replacing CD-ROM in most new computers, but there
are few computer applications that can take advantage of the extra space. They
can be used to turn laptop computers into portable video players, and with
writable DVD, the capacity for disk backup is greater.
Each of the technologies has a place in the market
because it covers a range of access times with inversely proportional cost per
unit, which in turn is proportional to density. CD-ROM filled a gap in access
time between disk and tape that had emerged as disks got faster and tapes did
not. The price of CD-ROM and DVD-ROM is basically determined by a market
analysis that places it between disk and tape -- it is actually much cheaper to
produce than the retail price would indicate.

Bubble memory tried to provide slightly faster access
rates than disk with a promise of slightly higher density. However, the gap was
not wide enough, and disk density, access rate and cost caught up with it by
the time it was developed. CCD memory was likewise intended to fill the same
gap, but with faster speeds and higher density. However, its cost was close to
that of semiconductor memory so there was little advantage over simply using
slower RAM. This gap has essentially closed due to cheap DRAM and Redundant
Arrays of Inexpensive Disks (RAID) that provide higher bandwidth transfers,
especially when they contain RAM buffers.
There appears to be a wide gap between the main and
secondary memory, but we are used to the fact that this gap exists, and both
systems and languages are structured to work around it. That is, it is part of
the programming model.
In a similar manner, transfers between the top two
levels can be programmed because they are at a level of granularity that
corresponds to the linguistic elements that a compiler can easily extract and
manipulate.
Currently the gap that is emerging that seems to open
a possibility for new memory devices is between the on-chip and the off-chip
memory levels. The problem is that when a-priori or empirically estimated
locality assumptions break down, the cost of going to main memory is relatively
high. That is, a cache miss is very costly. Thus, we have seen some new RAM technologies,
such as RAMbus, enter the market to provide a high-performance off-chip memory.
The memory system architect's approach is to try to
reduce that cost by speeding up the main memory. Thus, we see schemes that
fetch many words at once in order to support a new region of locality in cache.
In other words, if we get a miss, the assumption is that we are seeing a switch
in the region of locality accessed by the program, and thus should try to load
that region quickly so that we do not suffer further misses.
On the other hand, if the language made it possible
for the compiler to consider a larger context, it might be possible to schedule
the transfers between these levels so that their effects are lessened. In
addition, it might be possible for the compiler to reorder the processing to
maximize the locality of reference, thereby giving the memory system time to
transfer the data.
As an example, consider that if a program runs
sequentially through a terabyte of data, performing one operation on each
value, the caches will effectively miss continuously. But if the program does
this over and over, it may be possible to reorder the statements so that more
operations are performed for each data reference, providing time to prefetch
the next value. This is already done for loops on arrays, but is hard to do for
linked lists and other non-array structures.
Hwang defines five parameters associated with memory
technologies arranged in a hierarchy:
Access Time: Time for the CPU to fetch a value from
memory -- including delays through any intermediate levels.
Memory Size: The amount of memory of a given type in a
system.
Cost Per Unit (byte): Cost per unit times size roughly
equals total cost.
Transfer bandwidth: Units (bytes) per second
transferred to the next level.
Unit of transfer: Number of units moved between
adjacent levels in a single move.
He also defines three properties of a memory
hierarchy:
Inclusion: If a value is found at one level, it is
present at all of the levels below it.
Coherence: The copies at all of the levels are
consistent.
Locality: Programs access a restricted portion of
their address space in any time window.
Obviously, none of these is strictly true. Most
hierarchies are inclusive from the registers to the main memory (although we
could imagine a multi-level cache that skips a level on loading, and only
copies out to the lower level when writing back). However, most tape units do
not spool to disk before going to main memory -- they are DMA devices, just
like the disks.
Coherence would only be true for a write-through
cache, but not for a write-back cache.
Locality is entirely program-dependent. For example,
LISP programs have logical locality that does not correspond to physical
address locality. Most caches assume array type data access and sequential
code. The book identifies three aspects of this form of locality:
Temporal locality: Recently accessed items tend to be
accessed again in the near future.
Spatial locality: Accesses are clustered in the
address space.
Sequential locality: Instructions tend to be accessed
in sequential memory locations.
Temporal locality tends to hold under all conditions
for code (even parallel or recursive codes) although it may break down in a
rule-based system. For data, it is most effective in array-based applications.
There is less spatial locality to exploit in the data
accesses when they are being processed in parallel. For example, if a cache has
8 words per line, and the data has perfect spatial locality, then a scalar
machine exploits this by getting 7 hits for every miss. But a superscalar
machine that is consuming the values two at a time gets just 3 hits for each
miss. The presence of spatial locality implies some temporal locality (i..e. if
all accesses are within a small block of memory, then it is probable that many
of them are being accessed repeatedly). However, temporal locality does not
imply spatial locality (e.g. a pointer-oriented program might repeatedly run
through a short list who's elements are scattered).
Sequential locality holds for instructions and certain
array access patterns. Interestingly, sequential locality can actually increase
for programs on massively parallel processors because some loops are
eliminated. For example, if an array is distributed so that one of its
dimensions is spread across the processors, there is no need for the loop that
would normally step through that dimension.
One of the implications of the differences in locality
that we may encounter between data and instructions is the benefit of having
separate instruction and data caches. The main benefit of separate caches,
however, is that instructions and operands can be fetched simultaneously -- a
design known as a Harvard architecture, after the Harvard Mark series of electromechanical
machines, in which the instructions were supplied by a separate unit.
The hit ratio is an important measure of the
performance of a memory level and is the probability that a reference is to a
value already in a given level of the hierarchy. The miss ratio is 1 - h.
The access frequency is the product of the hit ratio
for the given level with the miss ratios of all higher levels.
The effective access time of the memory system is the
sum of the access frequencies times their corresponding access times.
This formula is overly simplistic, as it assumes there
is some probability that you have to go to tape to get data, for example. It
also ignores short-circuit mechanisms, such as TLBs, that might have a variable
effect on access time. However, one point should not be minimized -- proper
memory hierarchy design depends on extensive simulation.
Cache Memory Design
Basic Cache Structures
A cache consists of lines of data, usually containing
values from two or more consecutive addresses of main memory. Each line has
associated with it a tag that stores the high order bits of the address for the
data in the line.

The differences in the cache mostly have to do with
how the comparison is implemented.
In a fully associative cache, there is a comparator
for each tag, and any line can contain any block of memory.
In a direct-mapped cache, there is a block number in
addition to the tag field of the address. The block number specifies a line in
the cache, and the tag field of the address is compared with the tag field
attached to the line. If they are equal, then there is a hit.
Direct mapping is very fast, but it also suffers from
the problem that a collision between two main memory lines that map to the same
line in cache can thrash in and out, even when there is plenty of space left in
the cache to hold them both. With increased cache size, this probability is
reduced and thus many modern machines have used direct-mapped caches.

In a K-way set associative memory, the cache is
divided into sets of K lines. There are S = L/K sets, where L is the number of
lines in the cache. The address contains a set field that selects a set. The K
tags for the set are read out and compared to the tag for the address. If there
is a match, then the corresponding line is selected and the data is fetched
from it.
Up to K collisions between memory lines with the same
set number can occur before replacement must occur. It has been shown that for
reasonably small K (e.g. 2 or 4), the vast majority of conflicts can be handled
without thrashing replacements.

A set associative memory requires only K comparators,
rather than the L comparators of the fully associative memory. When S = 1, then
the cache is fully associative (K = L). When K = 1, the cache is direct mapped.
Thus, set associativity can provide an economical midpoint between these two
extremes. The majority of caches today are set associative. Typical levels of
set associativity are 2-way, 4-way, and 8-way. The Alpha 21264 provided a 3-way
set associative cache, and it was noted that performance was significantly
better than 2-way set associativity. The designers believe that this is because
there are many array processing operations in which two arrays are input to a
formula, and the result is assigned to a third array. With two-way set
associativity, there are not enough ways to accommodate this triad of values,
but with 3 ways, it is possible to operate on all of the values without
evicting some.
Caches are often augmented by adding other small
structures to support their operation. These provide added resources that are, in
effect, available to use in critical cases.
Between the cache and the CPU there are usually a set
of write-buffers. These form a small, fully associative cache that is built
like a set of registers to be very fast. When the CPU writes to memory, it
tends to do so in bursts (e,g., spilling registers, pushing stack frames, etc.).
The CPU would have to stall as the slower cache tries to write these values.
The problem is made worse by the higher probability that bursty writes such as
these will miss in the cache. Thus, the write buffer provides a bit of slack in
the write path to compensate for this mismatch in speeds. Of course, if the CPU
tries to read recently written data, then the write buffer must respond in
place of the cache. Also, the write buffer must constantly monitor the status
of the cache so that it can immediately start writing the next value when the
cache is free. And, when there is a coherency check in a multiprocessor, the
write buffer may have to respond to it as well. That is, when another processor
attempts to access a value that
has been changed, the new value may still be in the write buffer.
Between the cache and the next lower level in the
hierarchy it is common to find another small, fully associative cache: a victim
buffer. When a line is evicted from the cache, it is placed in the buffer.
Accesses to the cache also check the tags in the victim buffer in parallel, and
if a needed line is found there, then it supplies the value and swaps the line with
the line in the cache that is now being evicted due to the miss. Lines in the victim
buffer are usually replaced with an LRU policy. Thus, in a four-entry victim
buffer, if a line is evicted from the cache and not accessed again, it remains
in the buffer until four more lines have been evicted. In essence, the victim
buffer provides a mechanism for selectively increasing the associativity of a
small number of sets when they are under excessive pressure.
Miss status holding registers are used as a means of
keeping track of in-progress misses. Many systems support non-blocking cache
reads in which instructions are allowed to proceed while one or more loads
await responses from a cache miss. Often, instructions can be scheduled behind
a load that are independent of it, and can proceed without waiting for it to
complete. However, in a significant number of cases, a load miss will be
followed by additional references to words in the same line. These also
generate misses, and the MSHR catches these and prevents them from generating
additional transactions in the memory hierarchy. When the line becomes
available, then the MSHR responds to the pending loads, and processing
proceeds. Architects refer to this merging of misses as coalescing.
Prefetch buffers may be used to enable a cache to
fetch more than one line at a time, in case spatial locality is especially
high. For example, when a cache
takes a miss, it typically fetches just the line containing the missing word.
But it is usually a less expensive read operation to immediately fetch the next
line from main memory as well. With a prefetch buffer, the next line is put
into a special register, and if the CPU runs off the end of the current line
and generates a new miss, then the data is immediately available to be moved
into the cache and a new prefetch operation is initiated. Before going out to
prefetch from main memory, however, it is important to check the tags in the
cache and victim buffer to avoid generating a memory transaction for a line
that is already available.
An interesting variation on this scheme is to expand
the prefetch buffer into a parallel cache with long lines Ð a spatial locality
buffer. Then the main cache can be reserved for values with strong temporal
locality, and its lines can be as short as a single word. Misses load into the
spatial buffer, which is fully associative with an LRU policy. The buffer keeps
track of which words in the long line are then accessed, and when the line is
finally evicted, these words are promoted to the main cache. Thus, the main
cache contains only words that have shown strong temporal locality. In effect,
the spatial buffer provides a mechanism for monitoring the access pattern of a
set of words for a short period, and using the observed behavior to make decisions
regarding promotion for longer tenancy in the cache. This mechanism is
especially useful when the main cache is small Ð as is the case in embedded processors.
As CPU clock rate rises, it becomes more difficult to
keep the level 1 cache access operating in a single cycle. We may see a need to
start shrinking the size of LI, or to partition L1 into smaller sub-caches, or
to insert an L0 cache that is smaller. The difference in L0 from other levels
is that it probably cannot load every value that is missed. It must be more
selective in what it chooses to store for fast access. This is an open area for
research.