Lecture 15:  Shared Memory

Given a bus that supports multiple masters, especially one that does so symmetrically, it is natural to try to increase performance by placing multiple processors on the bus. However, the limited bandwidth of the bus greatly reduces the performance of the system because the processors must share this single path to memory.

To reduce bus traffic to memory, private caches are employed. With a miss ratio of 10%, the ability to support parallel processors increases by a factor of 10. In reality, however, this is rarely achieved when data-parallelism is being employed because the processors are all loosely synchronized, so they tend to have bursts of misses at the same times, which slows memory access. Caches only really improve memory access performance over the bus when the processors are running in true MIMD mode, with independent programs.

But if data is shared, the presence of multiple caches implies that there can be multiple copies of the data. If different processors can write to the shared variable, then its copies may be inconsistent with each other, and either a means must be found to restore consistency, or invalid results may be computed. This is known as the cache coherency problem.

Cache Coherence and Synchronization

As the book points out, there are three common sources of cache inconsistency:

Writing shared variables.

Process migration

DMA I/O -- present even in a uniprocessor

Prevention

One way to avoid incoherency is to prevent it from happening. If writable, shared values are prohibited from being cached, then there is only one copy of these values and inconsistency is prevented.

This does not prevent one processor from overwriting a value stored from a another processor, but his can be handled in the same manner as in a multiprocessing OS executing on a uniprocessor.

Of course, this means that all accesses to these values are at cache-miss speeds. If a large amount of data is involved, then performance decreases significantly. The access latency can be reduced somewhat by permitting values to be cached in critical sections that end with a flush of the values from the cache. All other processors are locked out during the critical section. Of course, the locks cannot be cached under any circumstances, and must be accessed through atomic operations. Furthermore, this means that sharing of writable data is effectively sequentialized.

It is precisely this sequentialization of access that cache coherency mechanisms try to preserve while at the same time providing greater flexibility for caching data.

Sequential Consistency Semantics

The desired for a shared memory model arises from the desire to preserve the sequential model of programming, so it is only natural to also impose sequential semantics on writable shared values.

That is, we want it to always appear that every program executes its instructions in sequential order from the point of view of every other program.

But modern uniprocessors sometimes execute instructions out of order. For example, if there is a read-miss followed by an independent write, then the write may proceed to cache while the read does a main memory fetch. In a uniprocessor, this makes no difference, and improves performance.

The problem is that in a multiprocessor, suppose the read is fetching a shared value and the write is releasing a lock on that value. Depending on the cache mechanism, the write may actually reach main memory first. Depending on the memory access arbitration mechanism, another processor may actually be able to write to the value before the read completes.

(It may seem that this could happen in a multitasking environment on the uniprocessor, but there a context switch can easily be forced to follow the completion of outstanding accesses.)

Strong sequential consistency requires that all loads and stores to shared memory be atomic. That is, before another shared access can occur, all preceding shared accesses must have been completed and all copies updated. Atomic load-store operations must complete without any intervening loads or stores to the same value.

Note that this does not require that loads and stores occur immediately to main memory, just that they occur before any other processor attempts a shared access. Thus, if one processor caches shared data, it does not have to write it back until another processor has to access it. Of course, if two processors are actively sharing data, then they are likely to have to continually update the main memory copy.

Weak consistency relaxes this constraint, but places more of the burden for maintaining consistency on the software.

For example, the Dubois, Scheurich, Briggs model basically says that:

1. synchronizations must be atomic and consistent with each other across all processors (i.e. they have a consistent global ordering),

2. loads and stores that precede the synchronization point in the individual processors must be completed before they synchronize, and

3. all synchronizations that precede a shared load or store must complete before the operations occur.

Essentially, this requires that the software maintain consistency by forcing synchronizations to occur so that loads and stores update main memory as necessary. It provides the programmer with the ability to abandon sequential semantics when they are not required, but may be more costly for each operation.

The DSB model defines "completion" to occur in several stages (from Stone).

1. A value is written to the write buffer. The write has completed with respect to the processor. If it fetches the value back, it must come from the write buffer.

2. The write buffer is written to the cache. It has now completed with respect to storage. This point in time determines the position that the write should occupy in the global ordering.

3. The cache coherency mechanism sends an update or invalidate message to all other caches with respect to this location. As each other processor receives this message, the write is complete with respect to that processor.

4. When all processors finish updating, the write is globally complete.

For reading

1. When the read reaches the cache and is granted access, no other processor can change that value before the fetch completes.

2. If there is a miss, the fetch passes to main memory and to the other processors. As it reaches each processor, it is complete with respect to that processor, and it cannot be altered.

3. When it has reached all processors, and they are in states where they won't change the value, then the read is complete with respect to all processors.

4. Later, no processor that has issued a read to the same address will see an earlier value, and the read is globally complete.

In other words, during a read miss, other processors may be changing the value and/or starting to read it at the same time. Once notified of the read, changes must stop. The read then takes place. Any other reads in progress that follow the notification must fetch the same or later value (i.e., if a change has occurred, then their cache copy is invalid, and they must go to main memory).

The weak consistency model requires global completion for all synchronization operations (1) and that normal loads and stores must be globally complete before a synchronization and not be issued until after a synchronization is complete (2, 3). The latter cases make sure that the regular accesses don't get out of order with the synchronizations. (As Stone puts it, "leak into or out of a critical section".)

Implementation

For a write, the write operation must wait for acknowledgement of receipt of invalidate messages from all of the other processors sharing the data. This can be done with a wired AND circuit, where the other processors raise their connection to the acknowledge line when they have received the message.

A shared read halts execution until all writes are complete, but this is costly.

Release Consistency

Gharachorloo, et al propose a model using Release and Acquire for synchronization. Acquire is a synchronizing read, and Release is a synchronizing write.

1. Reads and writes preceding a Release must complete with respect to all processors.

2. Acquire must complete before successive reads and writes begin.

3. Writes to memory from Releases must occur in the same order that they are issued by their processor, but do not have to be sequentially consistent with other processors.

Acquire is used to synchronize a locking operation upon entry into a critical section. That is, if a lock is granted, and an Acquire then reads the lock, entry into the critical section is delayed until the lock's value is globally consistent.

If a Release is used to write the unlocking value into the lock, then the operations inside of the critical section are forced to complete globally before the section is exited.

In the Release consistency model, Acquire must complete before everything that follows, and everything that precedes a Release must complete.

The DSB synchronization is like a Release - Acquire combination. In the DSB model, everything must complete before the synchronization, and everything that follows must wait until the synchronization completes before starting. Thus, DSB requires completion of normal reads and writes to be verified before entering a critical section.

In Release consistency, the Acquire simply prevents the accesses in the critical section from starting until it is complete (no testing of normal operations is required). Thus, shared reads need not be verified, and shared writes need only be verified at the Release (as in the DSB model).

Hardware Support for Shared Memory

Cache Flush

(From Jim Handy's Cache Memory Book)

When a DMA write to memory is detected, the entire cache directory is invalidated. There are three simple ways of achieving this.

1) Use special hardware to write Invalid into the state bit of every line

2) Use a special cache-tag RAM with hardware reset

3) Reset a main Valid tag for the cache

Cache flush is simple and easy to implement, at least for a simple write-through cache, but forces a cache refill after every DMA transfer. This is just a time penalty for unitasking OS like MS-DOS or early versions of Windows, but for a multitasking OS, which tries to switch to another task while DMA occurs, the result is an unnecessary increase in context switch overhead.

Snoopy Protocols

Rather than flush the cache completely, hardware can be provided to "snoop" on the bus, watching for writes to main memory locations that are cached.

In a write-back cache, the snooping logic must also watch for reads that access main memory locations corresponding to dirty locations in the cache (locations that have been changed by the processor but not yet written back).

The snooping logic may keep its own copy of the cache directory, which is used to check against main memory accesses, or the actual cache directory may be dual-ported.

When dual porting is used, there are several approaches:

Snooping logic implementation:

1) The snooping logic stops the processor so that it can't collide with its access to the cache directory (more appropriate to CISC than RISC designs).

2) The snooping logic stalls cache accesses only.

Action on an outside access:

A) The address is compared and if there is a hit, it is marked Invalid in the cache

B) The address is compared and if there is a hit, it is written in parallel to the cache and remains valid

C) The corresponding cache line is marked invalid without comparing

In (A) the assumption is that there is a reasonable probability that the processor won't access that line again, and so it should not be updated until it is actually read. Any cache access that collides must be stalled for the time it takes to change the valid bit (although it may be possible to dual-port the valid bits so that it only stalls if accessing the same line).

In (B) the assumption is the opposite -- that the data will be read, and thus it should be updated now. Any cache access must be stalled while the update occurs (it would be costly to dual-port the data-portion of the cache).

Both (A) and (B) stall cache access during their comparison stages. Because (C) does no comparison, the cache stalls only during the invalidate (and maybe not even then). However, the cost is an increased number of invalid misses.

Another approach is to have the DMA go through the cache, as if the processor is writing it to memory. This results in all valid cache locations. However, any processor cache accesses are stalled during that time, and it clearly does not work well in a multiprocessor, as it would require copies being written to all caches and a protocol for write-back to memory that avoids inconsistency.

Duplicate directories can be expensive to implement, and there is a problem with keeping them consistent when processor and bus accesses are asynchronous. For a write-through cache, consistency is not a problem because the cache has to go out to the bus anyway, precluding any other master from colliding with its access.

But in a write-back cache, care must be taken to stall processor cache writes that change the directory while other masters have access to the main memory.

On the other hand, if the system includes a secondary cache that is inclusive of the primary cache, a copy of the directory already exists. Thus, the snooping logic can use the secondary cache directory to compare with the main memory access, without stalling the processor in the main cache. If a match is found, then the comparison must be passed up to the primary cache, but the number of such stalls is greatly reduced due to the filtering action of the secondary cache comparison.

A variation on this approach that is used with write-back caches is called dirty inclusion, and simply requires that when a primary cache line first becomes dirty, the secondary line is similarly marked. This saves writing through the data, and writing status bits on every write cycle, but still enables the secondary cache to be used by the snooping logic to monitor the main memory accesses. This is especially important for a read-miss, which must be passed to the primary cache to be satisfied.

Multiprocessor Coherence

The preceding is applicable to uniprocessors as well as multiprocessors. Here we focus on issues unique to multiprocessors.

One approach to maintaining coherence is to recognize that not every location needs to be shared (and in fact most don't), and simply reserve some space for non-cacheable data such as semaphores, called a coherency domain.

Using a fixed area of memory, however, is very restrictive. Restrictions can be reduced by allowing the MMU to tag segments or pages as non-cacheable. However, that requires the OS, compiler, and programmer to be involved in specifying data that is to be coherently shared. For example, it would be necessary to distinguish between the sharing of semaphores and simple data so that the data can be cached once a processor owns its semaphore, but the semaphore itself should never be cached.

In general, coherency must be managed in an oblivious manner because designers don't have the luxury of getting to start over with the OS and compilers in going to a multiprocessor configuration (i.e. the mechanisms are add-ons to existing designs). This is especially cumbersome when the processor has been designed with a logical caching scheme (that is, the cache is accessed with virtual locations prior to MMU translation into physical space). The trouble is that coherency problems can usually be detected only in the physical address domain, because each processor can have its own virtual mapping of the shared locations.

MIPS R4000

The R4000 requires the use of an inclusive secondary cache to support efficient coherence protocols. The primary cache is a logical cache -- that is, it is tagged with virtual locations prior to MMU translation. The secondary cache is a physical cache -- its address tags correspond to the physical address output by the MMU.

The primary cache is 8 pages in size, and the MMU is designed so that there can be up to 8 aliases in primary cache for a given page offset address. Thus, the secondary cache must have its tag extended with three extra bits that allow it to determine the actual address in the primary cache to be invalidated (otherwise it would have to invalidate all 8 of the potential locations).

The R4000 also has two modes of operation, depending on whether a secondary cache is present. When the cache is not present, the TLB supports two coherency attributes:

uncached -- data cannot be cached, only the TLB tag -- accesses go to main memory

noncoherent -- coherency need not be maintained -- accesses are not coherent

If the secondary cache is present, three additional attributes are supported:

sharable

update

exclusive

Note that in the following, a miss implies invalidation has occurred, because normally a TLB hit would indicate a cache hit.

For the sharable attribute, a coherent block read request is issued for a load miss to a location within the page. For a store miss, a coherent block read request is issued that also requests exclusivity. If exclusivity is not granted, then processing may be less efficient because coherence must be maintained on subsequent accesses. Coherency is maintained by forcing an external protocol to execute for each access, and this may involve a transfer of ownership.

When the processor writes to a line with the update attribute, a request is also issued to update copies in other caches and main memory with the write update protocol. That is, coherency is maintained by broadcasting writes.

For either a load or store miss to a location within a page for which a cache line has the exclusive attribute, the processor issues a coherent block read request that requests exclusivity. The usual situation here is that exclusivity was given up and now must be regained. Coherency is maintained by explicitly transferring ownership among the processors.

Write-through vs. Write-back

At first it would seem that the simplest way to maintain coherence is to use a write-through policy so that every cache can snoop every write. However, the number of extra writes can easily saturate a bus. The solution to this problem is to use a write-back policy, but that leads to additional problems because there can be multiple writes that do not go to the bus, leading to incoherent data.

One approach is called write-once. In this scheme, the first write is a write-through to signal invalidation to other caches. After that, further writes can occur in write-back mode as long as there is no invalidation. Essentially, the first write takes ownership of the data, and another write from another processor must first deal with the invalidation and may then take ownership. Thus, a cache line has four states:

Invalid

Valid unwritten (valid)

Valid written once (reserved)

Valid written multiple (dirty)

The last two states indicate ownership. The trouble with this scheme is that if a non-owner frequently accesses an owned shared value, it can slow down to main memory speed or slower, and generate excessive bus traffic because all accesses must be to the owning cache, and the owning cache would have to perform a broadcast on its next write to signal that the line is again invalid.

One solution is to grant ownership to the first processor to write to the location and not allow reading directly from the cache. This eliminates the extra read cycles, but then the cache must write-through all cycles in order to update the copies.

We can change the scheme so that when a write is broadcast, if any other processor has a snoop hit, it signals this back to the owner. Then the owner knows it must write through again. However, if no other processor has a copy (signals snooping), it can proceed to write privately. The processor's cache must then snoop for read accesses from other processors and respond to these with the current data, and by marking the line as snooped. The line can return to private status once a write-through results in a no-snoop response.

One interesting side effect of ownership protocols is that they can sometimes result in a speedup greater than the number of processors because the data resides in faster memory. Thus, other processors gain some speed advantage on misses because instead of fetching from the slower main memory, they get data from another processor's fast cache. However, it takes a fairly unusual pattern of access for this to actually be observed in real system performance.

Directory Based Coherence

The previous schemes have all relied heavily on broadcast operations, which are easy to implement on a bus. However, buses are limited in their capacity and thus other structures are required to support sharing for more than a few processors. These structures may support broadcast, but even so, broadcast-based protocols are limited.

The problem is that broadcast is an inherently limited means of communication. It implies a resource that all processors have access to, which means that either they contend to transmit, or they saturate on reception, or they have a factor of N hardware for dealing with the N potential broadcasts.

In a system where memory is a shared resource and processors have only cache as local memory, then each bank of main memory can keep a directory of all caches that have copied a particular line (block). Then, when a processor writes to a location in the block, individual messages are sent to any other caches that have copies. Thus, network traffic is limited to only essential updates.

The memory must keep a bit-vector for each line that has one bit per processor, plus a bit to indicate ownership (in which case there is only one bit set in the processor vector).

This full-map protocol is extremely expensive in terms of memory for more than a few processors. It thus defeats the purpose of leaving a bus-based architecture.

A limited-map protocol stores a small number of processor ID tags with each line in main memory. The assumption here is that only a few processors share data at one time. If there is a need for more processors to share the data than there are slots provided in the directory, then broadcast is used instead.

Chained directories have the main memory store a pointer to a linked list that is itself stored in the caches. Thus, an access that invalidates other copies goes to memory and then traces a chain of pointers from cache to cache, invalidating along the chain. The actual write operation stalls until the chain has been traversed. Obviously this is a slow process (and a prime candidate for use of microthreading to hide the latency).