Given a bus that supports multiple masters, especially
one that does so symmetrically, it is natural to try to increase performance by
placing multiple processors on the bus. However, the limited bandwidth of the
bus greatly reduces the performance of the system because the processors must
share this single path to memory.
To reduce bus traffic to memory, private caches are
employed. With a miss ratio of 10%, the ability to support parallel processors
increases by a factor of 10. In reality, however, this is rarely achieved when
data-parallelism is being employed because the processors are all loosely
synchronized, so they tend to have bursts of misses at the same times, which
slows memory access. Caches only really improve memory access performance over
the bus when the processors are running in true MIMD mode, with independent
programs.
But if data is shared, the presence of multiple caches
implies that there can be multiple copies of the data. If different processors
can write to the shared variable, then its copies may be inconsistent with each
other, and either a means must be found to restore consistency, or invalid
results may be computed. This is known as the cache coherency problem.
Cache Coherence and Synchronization
As the book points out, there are three common sources
of cache inconsistency:
Writing shared variables.
Process migration
DMA I/O -- present even in a uniprocessor
Prevention
One way to avoid incoherency is to prevent it from
happening. If writable, shared values are prohibited from being cached, then
there is only one copy of these values and inconsistency is prevented.
This does not prevent one processor from overwriting a
value stored from a another processor, but his can be handled in the same
manner as in a multiprocessing OS executing on a uniprocessor.
Of course, this means that all accesses to these
values are at cache-miss speeds. If a large amount of data is involved, then
performance decreases significantly. The access latency can be reduced somewhat
by permitting values to be cached in critical sections that end with a flush of
the values from the cache. All other processors are locked out during the
critical section. Of course, the locks cannot be cached under any
circumstances, and must be accessed through atomic operations. Furthermore,
this means that sharing of writable data is effectively sequentialized.
It is precisely this sequentialization of access that
cache coherency mechanisms try to preserve while at the same time providing
greater flexibility for caching data.
Sequential Consistency Semantics
The desired for a shared memory model arises from the
desire to preserve the sequential model of programming, so it is only natural
to also impose sequential semantics on writable shared values.
That is, we want it to always appear that every
program executes its instructions in sequential order from the point of view of
every other program.
But modern uniprocessors sometimes execute
instructions out of order. For example, if there is a read-miss followed by an
independent write, then the write may proceed to cache while the read does a
main memory fetch. In a uniprocessor, this makes no difference, and improves
performance.
The problem is that in a multiprocessor, suppose the
read is fetching a shared value and the write is releasing a lock on that
value. Depending on the cache mechanism, the write may actually reach main
memory first. Depending on the memory access arbitration mechanism, another
processor may actually be able to write to the value before the read completes.
(It may seem that this could happen in a multitasking
environment on the uniprocessor, but there a context switch can easily be
forced to follow the completion of outstanding accesses.)
Strong sequential consistency requires that all loads
and stores to shared memory be atomic. That is, before another shared access
can occur, all preceding shared accesses must have been completed and all
copies updated. Atomic load-store operations must complete without any
intervening loads or stores to the same value.
Note that this does not require that loads and stores
occur immediately to main memory, just that they occur before any other
processor attempts a shared access. Thus, if one processor caches shared data,
it does not have to write it back until another processor has to access it. Of
course, if two processors are actively sharing data, then they are likely to
have to continually update the main memory copy.
Weak consistency relaxes this constraint, but places
more of the burden for maintaining consistency on the software.
For example, the Dubois, Scheurich, Briggs model
basically says that:
1. synchronizations must be atomic and consistent with
each other across all processors (i.e. they have a consistent global ordering),
2. loads and stores that precede the synchronization
point in the individual processors must be completed before they synchronize,
and
3. all synchronizations that precede a shared load or
store must complete before the operations occur.
Essentially, this requires that the software maintain
consistency by forcing synchronizations to occur so that loads and stores
update main memory as necessary. It provides the programmer with the ability to
abandon sequential semantics when they are not required, but may be more costly
for each operation.
The DSB model defines "completion" to occur
in several stages (from Stone).
1. A value is written to the write buffer. The write
has completed with respect to the processor. If it fetches the value back, it
must come from the write buffer.
2. The write buffer is written to the cache. It has
now completed with respect to storage. This point in time determines the
position that the write should occupy in the global ordering.
3. The cache coherency mechanism sends an update or
invalidate message to all other caches with respect to this location. As each
other processor receives this message, the write is complete with respect to
that processor.
4. When all processors finish updating, the write is
globally complete.
For reading
1. When the read reaches the cache and is granted
access, no other processor can change that value before the fetch completes.
2. If there is a miss, the fetch passes to main memory
and to the other processors. As it reaches each processor, it is complete with
respect to that processor, and it cannot be altered.
3. When it has reached all processors, and they are in
states where they won't change the value, then the read is complete with
respect to all processors.
4. Later, no processor that has issued a read to the
same address will see an earlier value, and the read is globally complete.
In other words, during a read miss, other processors
may be changing the value and/or starting to read it at the same time. Once
notified of the read, changes must stop. The read then takes place. Any other
reads in progress that follow the notification must fetch the same or later
value (i.e., if a change has occurred, then their cache copy is invalid, and
they must go to main memory).
The weak consistency model requires global completion
for all synchronization operations (1) and that normal loads and stores must be
globally complete before a synchronization and not be issued until after a
synchronization is complete (2, 3). The latter cases make sure that the regular
accesses don't get out of order with the synchronizations. (As Stone puts it,
"leak into or out of a critical section".)
Implementation
For a write, the write operation must wait for
acknowledgement of receipt of invalidate messages from all of the other
processors sharing the data. This can be done with a wired AND circuit, where
the other processors raise their connection to the acknowledge line when they
have received the message.
A shared read halts execution until all writes are
complete, but this is costly.
Release Consistency
Gharachorloo, et al propose a model using Release and
Acquire for synchronization. Acquire is a synchronizing read, and Release is a
synchronizing write.
1. Reads and writes preceding a Release must complete
with respect to all processors.
2. Acquire must complete before successive reads and
writes begin.
3. Writes to memory from Releases must occur in the
same order that they are issued by their processor, but do not have to be
sequentially consistent with other processors.
Acquire is used to synchronize a locking operation
upon entry into a critical section. That is, if a lock is granted, and an
Acquire then reads the lock, entry into the critical section is delayed until
the lock's value is globally consistent.
If a Release is used to write the unlocking value into
the lock, then the operations inside of the critical section are forced to
complete globally before the section is exited.
In the Release consistency model, Acquire must
complete before everything that follows, and everything that precedes a Release
must complete.
The DSB synchronization is like a Release - Acquire
combination. In the DSB model, everything must complete before the
synchronization, and everything that follows must wait until the
synchronization completes before starting. Thus, DSB requires completion of
normal reads and writes to be verified before entering a critical section.
In Release consistency, the Acquire simply prevents
the accesses in the critical section from starting until it is complete (no
testing of normal operations is required). Thus, shared reads need not be
verified, and shared writes need only be verified at the Release (as in the DSB
model).
Hardware
Support for Shared Memory
Cache Flush
(From Jim Handy's Cache Memory Book)
When a DMA write to memory is detected, the entire
cache directory is invalidated. There are three simple ways of achieving this.
1) Use special hardware to write Invalid into the
state bit of every line
2) Use a special cache-tag RAM with hardware reset
3) Reset a main Valid tag for the cache
Cache flush is simple and easy to implement, at least
for a simple write-through cache, but forces a cache refill after every DMA
transfer. This is just a time penalty for unitasking OS like MS-DOS or early
versions of Windows, but for a multitasking OS, which tries to switch to
another task while DMA occurs, the result is an unnecessary increase in context
switch overhead.
Snoopy Protocols
Rather than flush the cache completely, hardware can
be provided to "snoop" on the bus, watching for writes to main memory
locations that are cached.
In a write-back cache, the snooping logic must also
watch for reads that access main memory locations corresponding to dirty
locations in the cache (locations that have been changed by the processor but
not yet written back).
The snooping logic may keep its own copy of the cache
directory, which is used to check against main memory accesses, or the actual
cache directory may be dual-ported.
When dual porting is used, there are several
approaches:
Snooping logic implementation:
1) The snooping logic stops the processor so that it
can't collide with its access to the cache directory (more appropriate to CISC
than RISC designs).
2) The snooping logic stalls cache accesses only.
Action on an outside access:
A) The address is compared and if there is a hit, it
is marked Invalid in the cache
B) The address is compared and if there is a hit, it
is written in parallel to the cache and remains valid
C) The corresponding cache line is marked invalid
without comparing
In (A) the assumption is that there is a reasonable
probability that the processor won't access that line again, and so it should
not be updated until it is actually read. Any cache access that collides must
be stalled for the time it takes to change the valid bit (although it may be
possible to dual-port the valid bits so that it only stalls if accessing the
same line).
In (B) the assumption is the opposite -- that the data
will be read, and thus it should be updated now. Any cache access must be
stalled while the update occurs (it would be costly to dual-port the
data-portion of the cache).
Both (A) and (B) stall cache access during their
comparison stages. Because (C) does no comparison, the cache stalls only during
the invalidate (and maybe not even then). However, the cost is an increased
number of invalid misses.
Another approach is to have the DMA go through the
cache, as if the processor is writing it to memory. This results in all valid
cache locations. However, any processor cache accesses are stalled during that
time, and it clearly does not work well in a multiprocessor, as it would
require copies being written to all caches and a protocol for write-back to
memory that avoids inconsistency.
Duplicate directories can be expensive to implement,
and there is a problem with keeping them consistent when processor and bus
accesses are asynchronous. For a write-through cache, consistency is not a
problem because the cache has to go out to the bus anyway, precluding any other
master from colliding with its access.
But in a write-back cache, care must be taken to stall
processor cache writes that change the directory while other masters have
access to the main memory.
On the other hand, if the system includes a secondary
cache that is inclusive of the primary cache, a copy of the directory already
exists. Thus, the snooping logic can use the secondary cache directory to
compare with the main memory access, without stalling the processor in the main
cache. If a match is found, then the comparison must be passed up to the
primary cache, but the number of such stalls is greatly reduced due to the
filtering action of the secondary cache comparison.
A variation on this approach that is used with
write-back caches is called dirty inclusion, and simply requires that when a primary
cache line first becomes dirty, the secondary line is similarly marked. This
saves writing through the data, and writing status bits on every write cycle,
but still enables the secondary cache to be used by the snooping logic to
monitor the main memory accesses. This is especially important for a read-miss,
which must be passed to the primary cache to be satisfied.
Multiprocessor Coherence
The preceding is applicable to uniprocessors as well
as multiprocessors. Here we focus on issues unique to multiprocessors.
One approach to maintaining coherence is to recognize
that not every location needs to be shared (and in fact most don't), and simply
reserve some space for non-cacheable data such as semaphores, called a
coherency domain.
Using a fixed area of memory, however, is very
restrictive. Restrictions can be reduced by allowing the MMU to tag segments or
pages as non-cacheable. However, that requires the OS, compiler, and programmer
to be involved in specifying data that is to be coherently shared. For example,
it would be necessary to distinguish between the sharing of semaphores and
simple data so that the data can be cached once a processor owns its semaphore,
but the semaphore itself should never be cached.
In general, coherency must be managed in an oblivious
manner because designers don't have the luxury of getting to start over with
the OS and compilers in going to a multiprocessor configuration (i.e. the
mechanisms are add-ons to existing designs). This is especially cumbersome when
the processor has been designed with a logical caching scheme (that is, the
cache is accessed with virtual locations prior to MMU translation into physical
space). The trouble is that coherency problems can usually be detected only in
the physical address domain, because each processor can have its own virtual
mapping of the shared locations.
MIPS R4000
The R4000 requires the use of an inclusive secondary
cache to support efficient coherence protocols. The primary cache is a logical
cache -- that is, it is tagged with virtual locations prior to MMU translation.
The secondary cache is a physical cache -- its address tags correspond to the
physical address output by the MMU.
The primary cache is 8 pages in size, and the MMU is
designed so that there can be up to 8 aliases in primary cache for a given page
offset address. Thus, the secondary cache must have its tag extended with three
extra bits that allow it to determine the actual address in the primary cache
to be invalidated (otherwise it would have to invalidate all 8 of the potential
locations).
The R4000 also has two modes of operation, depending
on whether a secondary cache is present. When the cache is not present, the TLB
supports two coherency attributes:
uncached -- data cannot be cached, only the TLB tag --
accesses go to main memory
noncoherent -- coherency need not be maintained --
accesses are not coherent
If the secondary cache is present, three additional
attributes are supported:
sharable
update
exclusive
Note that in the following, a miss implies invalidation
has occurred, because normally a TLB hit would indicate a cache hit.
For the sharable attribute, a coherent block read
request is issued for a load miss to a location within the page. For a store
miss, a coherent block read request is issued that also requests exclusivity.
If exclusivity is not granted, then processing may be less efficient because
coherence must be maintained on subsequent accesses. Coherency is maintained by
forcing an external protocol to execute for each access, and this may involve a
transfer of ownership.
When the processor writes to a line with the update
attribute, a request is also issued to update copies in other caches and main
memory with the write update protocol. That is, coherency is maintained by
broadcasting writes.
For either a load or store miss to a location within a
page for which a cache line has the exclusive attribute, the processor issues a
coherent block read request that requests exclusivity. The usual situation here
is that exclusivity was given up and now must be regained. Coherency is
maintained by explicitly transferring ownership among the processors.
Write-through vs. Write-back
At first it would seem that the simplest way to
maintain coherence is to use a write-through policy so that every cache can
snoop every write. However, the number of extra writes can easily saturate a
bus. The solution to this problem is to use a write-back policy, but that leads
to additional problems because there can be multiple writes that do not go to
the bus, leading to incoherent data.
One approach is called write-once. In this scheme, the
first write is a write-through to signal invalidation to other caches. After
that, further writes can occur in write-back mode as long as there is no
invalidation. Essentially, the first write takes ownership of the data, and
another write from another processor must first deal with the invalidation and
may then take ownership. Thus, a cache line has four states:
Invalid
Valid unwritten (valid)
Valid written once (reserved)
Valid written multiple (dirty)
The last two states indicate ownership. The trouble
with this scheme is that if a non-owner frequently accesses an owned shared
value, it can slow down to main memory speed or slower, and generate excessive
bus traffic because all accesses must be to the owning cache, and the owning
cache would have to perform a broadcast on its next write to signal that the
line is again invalid.
One solution is to grant ownership to the first
processor to write to the location and not allow reading directly from the
cache. This eliminates the extra read cycles, but then the cache must
write-through all cycles in order to update the copies.
We can change the scheme so that when a write is
broadcast, if any other processor has a snoop hit, it signals this back to the
owner. Then the owner knows it must write through again. However, if no other
processor has a copy (signals snooping), it can proceed to write privately. The
processor's cache must then snoop for read accesses from other processors and
respond to these with the current data, and by marking the line as snooped. The
line can return to private status once a write-through results in a no-snoop
response.
One interesting side effect of ownership protocols is
that they can sometimes result in a speedup greater than the number of
processors because the data resides in faster memory. Thus, other processors
gain some speed advantage on misses because instead of fetching from the slower
main memory, they get data from another processor's fast cache. However, it takes
a fairly unusual pattern of access for this to actually be observed in real
system performance.
Directory Based Coherence
The previous schemes have all relied heavily on
broadcast operations, which are easy to implement on a bus. However, buses are
limited in their capacity and thus other structures are required to support
sharing for more than a few processors. These structures may support broadcast,
but even so, broadcast-based protocols are limited.
The problem is that broadcast is an inherently limited
means of communication. It implies a resource that all processors have access
to, which means that either they contend to transmit, or they saturate on
reception, or they have a factor of N hardware for dealing with the N potential
broadcasts.
In a system where memory is a shared resource and
processors have only cache as local memory, then each bank of main memory can
keep a directory of all caches that have copied a particular line (block).
Then, when a processor writes to a location in the block, individual messages
are sent to any other caches that have copies. Thus, network traffic is limited
to only essential updates.
The memory must keep a bit-vector for each line that
has one bit per processor, plus a bit to indicate ownership (in which case there
is only one bit set in the processor vector).
This full-map protocol is extremely expensive in terms
of memory for more than a few processors. It thus defeats the purpose of
leaving a bus-based architecture.
A limited-map protocol stores a small number of
processor ID tags with each line in main memory. The assumption here is that
only a few processors share data at one time. If there is a need for more
processors to share the data than there are slots provided in the directory,
then broadcast is used instead.
Chained directories have the main memory store a
pointer to a linked list that is itself stored in the caches. Thus, an access
that invalidates other copies goes to memory and then traces a chain of
pointers from cache to cache, invalidating along the chain. The actual write
operation stalls until the chain has been traversed. Obviously this is a slow
process (and a prime candidate for use of microthreading to hide the latency).