Virtual Memory

Virtual memory is a mapping from a name space (or virtual address space) to a physical address space. Its impact on the design of the instruction set architecture is felt in two areas, mostly in supervisor state. First, there are the tables that hold the virtual memory mappings, and which must be accessible. Second, the virtual mapping may have an effect on the size of address fields in the instruction set, especially if page-relative addressing is one of the supported modes.

There are three common approaches to the mapping: paging, segmentation, and paged segmentation. A fourth approach has recently become popular in some designs such as the Alpha and UltraSPARC: multiple page sizes.

Paging

In a paged system, there are fixed-size blocks of memory called pages that are allocated to a process. A memory reference consists of a page number and an offset within the page. The memory management unit (MMU) uses the virtual page number to look up the physical page number. Because the virtual address space can be quite large, the look-up usually involves hashing, and possibly additional search. However, only one piece of an address is needed as the look-up key.

With a paged system, there is one address space and every reference must be checked to ensure that it refers to a page that has been allocated to the current process. The operating system allocates pages from this one address space as needed. The operating system can swap out a subset of the pages that are allocated to a process and bring them into memory when they are needed again. The down side is that a process can get into a pattern of access where it is frequently page-faulting and starts to operate at disk access speeds.

Because the pages are fixed in size, they may be only partially used. This internal fragmentation was a significant problem when machine memories were tiny, but is now negligible. However, it lead to the notion of segmentation as an alternative that avoids internal fragmentation.

Segmentation

In a segmented system, the blocks of memory are variable in size. Each process may have one or more segments. The segments may be visible to the process, that is, it may be aware that there are code, data, stack, and heap segments (whereas pages are usually transparent). A segment table is used that specifies the base of the physical addresses associated with a segment and the valid range. The combination of this base and the offset specified by the address means that two words must be used by the MMU to identify the physical address.

One advantage to the operating system is that once a segment has been allocated, there won't be any access faults from the process except those that actually try to access beyond the segment boundaries. Thus, once a process has started, it can run at memory rates until it returns control to the OS. Of course, this leads to the problem that if a segment is large, the time for a context switch can be excessive.

An access outside of the segment is detected as exceeding the valid range, and is trapped. Segmented virtual addressing is otherwise similar to paging. One problem with pure segmentation is external fragmentation (holes in the address map following a series of allocations and deallocations) that can lead to low memory utilization. External fragmentation is also called checkerboarding, and its correction requires a phase of memory compaction that reduces processing efficiency.

Paged Segmentation

Paged segmentation combines the two techniques by segmenting the address space and having fixed size pages within a segment. Thus, a segment can be physically split into pages that are logically contiguous. External fragmentation is reduced because the variability in size is in terms of pages that can be freely rearranged. Swapping at a context switch does not require that an entire segment be stored or loaded. The swap can occur incrementally as pages are replaced or demanded.

The segementation aspect of the scheme contributes the ability to assign multiple address spaces to a process, which it can then manage explicitly. As virtual address spaces grow large, this has the advantage that we can often avoid having to specify a full virtual address (which can be 64 bits long) in a program and thus we save memory. It also makes it easier to distinguish between access violations and simple page faults, because the segment tables carry the access range information.

The drawback of paged segmentation is that it requires a two-stage translation process in which a segment table is first accessed to retrieve a pointer to a page table for the segment (in the Intel Pentium architecture, a third "Page Directory Table" must also be accessed). Thus, the translation time is increased significantly.

Variable Size Pages

Pure paging is attractive in that it requires just one lookup with a single word of address. It is thus the fastest scheme. However, as virtual address spaces have grown large, it has become difficult to deal with so many small chunks of memory. One problem is that the page tables become huge. If we limit the size of the page tables, then we may not be able to access all of the physical memory that we want to have.

In addition, programs tend to have patchy locality in virtual memory. In some cases, small pages are needed to avoid internal fragmentation and to allow flexibility in the use of memory. On the other hand, if we have a large data structure that is being used heavily, it hardly makes sense to break it into many small pages and allow parts of it to be swapped. A more sensible approach would be to assign a larger block of memory and only have to use a single page table entry to translate for it.

Variable size pages are, of course, more difficult to manage than are fixed size pages. They can also reintroduce the problem of external fragmentation, although in a more limited manner than for a segmented system.

Modern 64-bit architectures have addressed this problem by allowing a small number of choices for the page sizes. These ÒsuperpagesÓ are typically a power-of-2 multiple of the smallest page size. So we might see 4K, 16K, 64K, and 1 MB pages supported in a system. If we can use a 1 MB page, then we avoid having to keep 255 table entries that would be necessary in a system that supports only 4K-byte pages.

Translation Lookaside Buffer

As noted previously, virtual address translation can require multiple memory references and calculations, resulting in a large time penalty. One way to save time is to cache the most recently used page translations in a small, fully associative cache so that the translation can be retrieved in a cycle. This is called a translation lookaside buffer (TLB). In case the TLB misses, the virtual address will already have started through the translation process, which eventually succeeds and updates the TLB. If the TLB hits, then the translation is cancelled.

It would seem that making the TLB large would be advantageous, so that fewer TLB misses would occur. However, keep in mind that the TLB also has to interact with the cache. At whatever level the virtual address translation takes place, the time to access the TLB must be inserted. Before first level caches became as fast as they are now, it was possible to have a physically mapped primary cache. Even a tiny TLB, however, may now slow access too much. Thus, it is more common to have the first level cache be virtually mapped and the TLB sits between the first and second level caches. Even then, it must be very fast, and hence small. Full associativity is used to make up for the small size by allowing full utilization of the small number of entries.

Synonym Problem

With a combination of physical and virtual caches, a problem arises when tasks are sharing information in memory. The shared values have the same physical address but a different virtual address in each task. At the level of the virtual-mapped cache, it is thus possible for a single value to occupy two different lines. If one task updates its copy of the value, then the other copy is invalid.

There are different ways to deal with the synonym problem. The simplest is to make such lines uncachable and require that accesses go to the physical cache, where the values are unified (typically level 2). They can also be marked as shared, and we can force a write to go out to the physical level, and invalidate the other copy in the virtual cache.

We should note that the synonym problem is rare, but it affects correctness. Thus, it must be dealt with, but it is not critical to make it fast.

Inverted Tables

When the virtual address space becomes large, the size of the page tables can become unwieldy. It actually becomes more economical to store a table of the physical segments and pages in use and their corresponding virtual pages, and search for the matching virtual page, rather than hashing to it and finding the physical page. The search can be aided by an inverted lookup table or an associative memory. With the inverted lookup table, an associative TLB can also be employed.

Replacement Policies

Given that not all of the allocated virtual pages can fit in physical memory at once, a replacement policy is needed to determine which page is swapped out when space is needed. Note that these are supported in OS software rather than hardware. The reason is that the cost of swapping, due to the slow disk access speeds, is so high that the overhead of a software mechanism is minor. In addition, because the swapping cost is so great, it is desirable to use more complex algorithms to make the selection, which are also more easily implemented in software. There are six policies that are frequently mentioned in the literature:

Least recently used (LRU) -- the most popular

Optimal -- the most desired, the page that won't be needed for the longest time is replaced (theoretical, and computed in hindsight, for evaluation of practical algorithms)

First in first out (FIFO) -- the page that has been in the longest

Last in first out (LIFO) -- the page most recently loaded

Circular FIFO -- implements a pattern similar to LRU with a circular queue that degenerates to FIFO.

Random replacement -- a page is selected at random (has interesting properties in terms of avoiding degenerate cases and ensuring a consistent response time, but in practice it reduces overall performance)

Note that all of these assume that the compiler is blissfully unaware of the memory system. However, as we have noted, if the compiler has a large enough context to work with, it may be able to do a better job of managing the memory. For example, it may be able to indicate that, because a phase of execution has just completed, a segment of memory will no longer be needed, even though it would not be replaced by LRU until much later. The compiler may also be able to initiate prefetch of a page or segment far enough in advance to avoid much of the paging penalty.

This all serves to show that architects, compiler writers, and language and OS designers should work more closely together.