There are many different technologies for storing data. In the old days,
cathode ray tubes, acoustical delay lines, magnetic drums, magnetic cores,
punched cards, and open-reel magnetic tape were popular. Today, most of
these technologies have been replaced by semiconductor memory, magnetic
disk, optical disk, and helical-scan magnetic tape cartidges.
Each of the technologies has a place in the market because it covers a range
of access times with inversely proportional cost per unit, which in turn
is proportional to density. CD- ROM filled a gap in access time between
disk and tape that had emerged as disks got faster and tapes did not. The
price of CD-ROM is basically determined by a market analysis that places
it beteeen disk and tape -- it is actually much cheaper to produce than
the retail price would indicate.

There appears to be a wide gap between the main and secondary memory, but
we are used to the fact that this gap exists, and both systems and languages
are structured to work around it. That is, it is part of the programming
model.
The memory hierarchy is viable because programs tend to exhibit a property
known as locality. There are three basic forms of locality:
Temporal locality: Recently accessed items tend to be accessed again in
the near future.
Spatial locality: Accesses are clustered in the address space.
Sequential locality: Instructions tend to be accessed in sequential memory
locations.
Temporal locality tends to hold under all conditions for code (even parallel
or recursive codes) although it may break down in a rule-based system. For
data, it is most effective in array-based applications.
The presence of spatial locality implies some temporal locality (i..e. if
all accesses are within a small block of memory, then it is probable that
many of them are being accessed repeatedly). However, temporal locality
does not imply spatial locality (e.g. a LISP program might repeatedly run
through a short list who's elements are scattered).
Sequential locality holds for instructions and certain array access patterns.
Interestingly, sequential locality may actually increase for parallel programs
because some loops are eliminated.
The locality property provides the opportunity for us to use a small amount
of very fast memory to effectively accelerate the majority of memory accesses.
Because we know that, statistically, only a small amount of the entire memory
space is being accessed ata any given time, and values in that subset are
being accessed repeatedly, we can copy those values from slower memory to
the small fast memory. We thus end up with a memory system that can hold
a large amount of information (in a large, low-cost memory) yet provide
nearly the same access speed as would be obtained from having all of the
memory be very fast and expensive.
The small and fast memory is, of course, the cache. Most machines today
use some form of cache, although a few do not. For example, certain supercomputers
eschew caches and provide large, fast, expensive memory. Why would this
be done?
Because locality is entirely program-dependent. Not all program exhibit
locality. For example, LISP programs have logical locality that does not
correspond to physical address locality. Even programs in traditional imperative
languages that usually exhibit locality may have sections that have only
weak locality. When this happens, performance is reduced to a level corresponding
to the time to access the larger and slower memory. Supercomputers, on the
other hand, are built for ultimate speed and so their designers avoid caches.
One of the implications of the differences in locality that we may encounter
between data and instructions is that there is benefit in having separate
instruction and data caches. The main benefit of separate caches, however,
is that instructions and operands can be fetched simultaneously -- a design
known as a Harvard architecture, after the Harvard Mark series of machines.
When a memory reference finds a value that it is seeking in the cache, the
event is called a hit. If a reference fails to find a sought-after value
in the cache, the event is called a miss. When a miss occurs, most memory
systems automatically go to successively lower levels of the hierarchy to
try to find the desired value.
The hit ratio is an important measure of the performance of a memory level
and is the probability that a reference is to a value already in a given
level of the hierarchy. The miss ratio is 1 - h and indicates the percentage
of references to a level of the hierarchy that fail to find the desired
data at that level (and must therefore look at a lower level).
The access frequency is the product of the hit ratio for the given level
with the miss ratios of all higher levels.
The effective access time of the memory system is the sum of the access
frequencies times their corresponding access times. This formula is overly
simplistic, as it assumes there is some probability that you have to go
to off-line storage (such as tape) to get data, for example.
Some systems do not automatically seek values at lower levels of the hierarchy.
Instead, the program is responsible for explicitly loading and storing data
in the cache, just as for registers. This sort of cache is said to be explicitly
mangaged. When would such an organization be useful?
When it is possible to program the use of the cache and it is undesirable
to trust the automatic system to cache the proper data. An example is in
digital signal processing, where all the actions of processor are known
in advance and variations in access time due to misses might cause the processor
to fail to keep up with the incoming signal. The programmer of a DSP architecture
can explicitly move data and instructions to and from the cache and thereby
account for every instruction cycle. In a system with an automatic cache,
the timing is not as controllable and hence not as predictable.
A cache consists of lines of data, usually containing values from two
or more consecutive addresses of main memory. Each line has associated with
it a tag that stores the high order bits of the address for the data in
the line.

When a memory reference occurs, the address from the processor is presented
to the comparator in the cache. The high order bits are checked against
the tags stored in the cache. If there is a match, then the line of data
is read from the cache and the low order bits of the address select the
appropriate word from the line to be passed to the CPU.
If there is no match among the tags, then a miss occurs and the address
is sent to the main memory to fetch the line containing the desired word.
The line is loaded into the cache and the desired value is simultaneously
passed to the CPU. Hopefully, the next reference will exhibit locality and
be to another word in the line just fetched, so that it is a hit.
At first glance, it might seem that the smart thng to do would be to fetch
a longer line of data to increase the number of subsequent hits. However,
it must be kept in mind that the longer the cache line, the longer it takes
to read from main memory and thus the greater the time penalty for a miss.
If a program is working in a scattered locality, or in a very small locality,
many of those fetches may wasted and so performance decreases. The best
length for the line is usually determined by software simulations of caches
with different line lengths, running a wide range of benchmark programs.
The manner in which the comparison is implemented has a major effect on
the cache. There are three basic implementation schemes consisting of two
extremes and any design that falls between them. The extremes are fully
associative and direct-mapped (or non-associative). Between these fall various
set-associative implementations.
In a fully associative cache, there is a comparator for each tag, and any
line can contain any block of memory. This allows a value from main memory
to be placed anywhere in the cache. If there is an empty space anywhere
in the cache, it can be filled by a miss.

A fully associative cache has the advantage that it allows complete utilization
of the cache memory. Its disadvantage is that having one comparator per
line is very expensive. In addition, the circuit that identifies the matching
tag among all of the comparisons has a high fan-in degree and thus it is
difficult to make it fast.
In a direct-mapped cache, the a portion of the reference address specifies
a block number in addition to the tag field of the address. The block number
specifies a particular line in the cache, and the tag field of the address
is compared with the tag field attached to that line. If they are equal,
then there is a hit. Thus, thre is a fixed mapping between addresses in
memory and lines in the cache -- each location in memory maps to just one
location in the cache. Because the cache is much smaller, there are many
memory locations that map to each cache line. Even if the entire cache is
empty except for one line, if the memory reference requests a location that
maps to that same line, the value in the cache has to be displaced to main
memory to make room for the requested value.

One advantage of direct mapping is that it requires just a single comparator,
so it is very inexpensive to build. Direct mapping is also very fast, but
it suffers from the problem that a collision between two main memory lines
that map to the same line in cache can thrash in and out, even when there
is plenty of space left in the cache to hold them both. With increased cache
size, the probability of this occurring is reduced and thus some modern
machines use direct-mapped caches.

In a K-way set associative memory, the cache is divided into sets of K lines.
There are S = L/K sets, where L is the number of lines in the cache. The
address contains a set field (a truncated version of the block number) that
selects a set. The K tags for the set are read out and compared to the tag
for the address. If there is a match, then the corresponding line is selected
and the data is fetched from it.

Up to K collisions between memory lines with the same set number can occur
before a value in the cache must be returned to memory. It has been shown
that for reasonably small K (e.g. 2 or 4), the vast majority of conflicts
can be handled without thrashing (repeated cache line replacements).

A set associative memory requires only K comparators, rather than the L
comparators of the fully associative memory. When S = 1, then the cache
is fully associative (K = L). When K = 1, the cache is direct mapped. Thus,
set associativity can provide an economical midpoint between these two extremes.
In experimental analyses of the level of associativity necessary in a cache,
it has been found that there is only about a 10% difference in utilization
and performance between a 4-way seta associative cache and a fully associative
cache. Thus, the majority of the benefit is obtained with just four comparators,
and the returns diminish quickly if more comparators are added. There is
, however, a significant jump between 2-way and 4-way associativity. and
this can be partly attributed to the need to support the two operands and
the result of a dyadic array operation. With just two ways, it is possible
that the three arrays involved do not fit into the cache because of line
conflicts, whereas with more than two ways the arrays can all be mapped
to the cache at once.
In addition to the tag and data, each cache line contains a valid bit that
indicates whether it contains valid data or not. Initially all of the valid
bits indicate that the cache is empty. Eventually the cache is filled and
the valid bits become true. However, there is a condition that can cause
a valid bit to once again become false. Cna anybody think of what that might
be?
When a DMA I/O operation changes the value of the corresponding location
in main memory, the value in the cache is no longer valid.
Thus, the cache includes logic that watches every memory write that originates
from someplace other than the CPU, and if an address is seen that is in
the cache, the cache line is marked as invalid. A subsequent read from that
location will miss in the cache and go to main memory to retrieve the recently
input value.
In a direct-mapped cache, if a new value is being read that conflicts
with a value already in the cache, then there is only one possible action:
the conflicting value must be returned to main memory to make room for the
new value. This process is called replacement.
When a cache is associative, there are K possible cache locations where
the new value can be stored. If all of them are full, then one must be replaced.
The problem is, how do we determine which one should be replaced. The algorithm
for determining replacement is called the replacement policy.
What do you think would be a good policy?
Given that we would like to keep values that we will need again soon, we
want to get rid of the one that won't be needed for the longest time. Since
we can't look into the future, we have to make a best guess.
There are several popular replacement policies that try to approximate this
optimal algorithm. For example, we can consider temporal locality and guess
that any value that has not been used recently is unlikely to be needed
again soon. Thus, we can keep track of the last time each value was accessed,
and which ever was least recently used (LRU) is the one we will replace.
Unfortunately, LRU requires us to keep a history of access for every cache
line, which takes up a lot of space and also slows down the operation of
the cache.
Another approach is to approximate LRU. We can hypothesize that the value
fetched first is now the oldest reference and thus that, if it hasn't been
accessed, it is the LRU value. Obviously, if spatial locality is good, this
is a poor approximation.
Yet another tack is to take a line at random. The problem with LRU and FIFO
is that there are degenerate referencing situations in which they can be
made to thrash. Some researchers have argued that random replacement, while
it sometimes throws out data that will be needed soon, never thrashes. Unfortuantely,
it is diffucult to have truly random replacement, and it does decrease average
performance in order to improve rare worst cases.
Thus far we have mostly discussed what happens when a value is read.
When a value is written to the cache by the CPU, there are two things we
can do: we can write it to the cache and simultaneously write it in through
to the main memory so that the master copy is kept up to date. Or, we can
write it to the cache and not write it back to the main memory until it
is replaced. The former approach is called write through while the latter
is called write back.
Write through has the disavantage that it can slow the cache down to the
speed of the main memory for a string of writes. Write back only slows the
cache for a replacement, which is already a main-memory speed operation.
However, write back requires keeping a bit to remember that a value has
been written (called the dirty bit), and it also presents a problem for
DMA I/O where an I/O device gets output data directly from memory. Because
the mameory contains stale information, the I/O device could output garbage.
A write back cache must also detect when I/O is taking place (by watching
the address lines into the main memory for matches to its tags) and then
save the data before the I/O device reads it, or alternatively it can simply
respond in place of memory to the I/O device's read request.
Write back is thus more complex and costly to implement, and the constant
checking of DMA addresses can slow the cache down by tying up the comparators.
Many modern systems thus provide both write back and write through modes
of operation. It is also noteworthy that instruction caches do not have
to take this into consideration because they cannot be written to.
We also need to consider what happens in the case where the CPU writes to
an invalid location in the cache. This can either be because the CPU is
generating a write address for a location that wasn't previously read, or
because it is writing to a location that was invalidated by DMA input.
When the CPU writes to a new location, the effect is similar to reading
from a new place. The data is written to a line of the cache and the line
is marked valid. However, there is a problem here -- what is it?
The CPU write hasn't filled the entire line.
So if we try to read from a location in that line other than the one just
written we may get garbage. There are two solutions to this problem. One
is to declare a write miss, and read the rest of the line from main memory,
filling it with valid data. The other is to gamble that the processor will
write more values into this line before reading it, and instead mark the
individual words in the line as valid or invalid. If a subsequent read attempts
to access an invalid word in the line, we have a miss and read in the invalid
part(s) of the line. However, if the processor eventually fills the line
with writes, we avoid taking the miss.
We also have to handle conflicts for writes. Consider that the CPU may write
to a direct- mapped cache line that is full, and must therefore be replaced.
The same situation can arise in a set-associative cache if all of the ways
for that set are full. The mechanism that handles replacement on reads also
handles replacement on writes -- essentially it watches for anything being
stored into a valid line in the cache and reacts accordingly regardless
of the source of the store.
It can be observed in many programs that writes occur in clusters rather
than being evenly distributed in time. This presents a problem for the cache
when it is operating in write- through mode or if the writes cause a series
of replacement operations. In either case, the access involves main memory
(or next level of cache), which may be a factor of 10 slower. Thus, the
processor will have to wait while the data is being written before it can
proceed. Many processors address this problem with a small set of registers
called a write buffer. The data and its destination address are written
to the write buffer where they waits for memory to become available so the
data can be actually stored. In the meantime, the processor can continue
on just as if the data has actually been written. What is the one remaining
problem that needs to be addressed in order for this to work?
What happens if the processor reads from that location before the data is
actually written to it?
The write buffer must include logic that checks whether a read address matches
any of the addresses waiting in the write buffer. If there is a match, then
the data is read back from the write buffer instead of from the cache or
from memory. The write buffer also helps to address the problem of the cache
being busy (such as completing a line-fill operation) when a write is issued.
We have already noted that instruction fetches and data fetches have
different patterns of access and locality. In particular, instructions are
not written and so they do not require write buffers or other write-related
logic. These differences encourage the use of caches that are split between
instructions and data. In addition, having two separate caches provides
the potential to fetch from both at once (or fetch from one while writing
to the other), thereby doubling the rate at which values are transferred
between memory and the CPU.
On the other hand, when caches are split, they are usually each half of
the size of a single unified cache. Different programs may have larger or
smaller working sets in data versus instructions. That is, one program might
process a very small amount of data with a complex algorithm while another
might process a large amoutn of data with a simple algorithm. In a unified
cache there is added flexibility to accomodate these variations. Thus, the
hit ratio of a unified cache is always larger than that of an equivalent
split cache.
Furthermore, if a cache is external to the processor chip, there may not
be enough pins to carry the values from two caches at once into the CPU
chip. Thus, it may not be possible to take advantage of the potential bandwidth.
The trend in technology, however, has been toward fitting larger and larger
caches onto the processor chip. As the sizes of the split caches have grown,
the difference in hit ratio between split and unified designs has decreased
to around 1%. Because an on-chip cache is not restricted by the number of
pins at the edge of the chip, the doubled bandwidth can be fully exploited.
Hence we see a period in microprocessor design where unified caches dominated
that leads to a nearly universal use of split cache designs for on-chip
cache.
However, for many processors there is still another level of cache external
to the chip, and even though space is unlimited at this level, unified designs
are chosen because of the pin- limitations on bandwidth.
As technology enables us to place more transistors on a chip, it would
seem logical to continue to increase the size of the cache. A larger cache
has a higher hit ratio, and so we should be able to operate at cache access
speeds for a greater percentage of the execution time. However, as we saw
earlier in the course, as a circuit grows larger it often grows slower.
If we continue to increase the size of the cache, the time required to read
data from it increases and eventually we have to start slowing the clock
cycle of the processor to compensate (or else the processor waits frequently).
The solution is to employ a hierarchy of caches on the chip. A small, fast
cache is used that can keep up with the clock rate of the processor, and
a larger secondary cache is made as large as we can fit on the chip. The
result is that we get a lower hit ratio for the first level cache (especially
since it is probably split) but we also get relatively fast access to the
next level (perhaps just two cycles instead of the usual ten or more). And
at that second level we see a hit ratio that is very close to 100% because
of its size. Accesses to an external cache are then very infrequent and
those to main memory are primarily due to unusual events such as context
switches or I/O.
Some researchers hypothesize that eventually processor chips will be almost
entirely memory, and that external RAM will be treated much like we currently
treat secondary storage devices -- it will be used for file storage and
will be explicitly read and written in large blocks with programmed I/O.
A more radical view says that all memory will eventually contain processors
and that operations on data stored in external memory will send code to
accomplish the processing remotely rather than fetching the data. However,
this view neglects processing in which large amounts of separate data must
be combined, essentially forcing the data to be moved to a common processor
where it can be acted upon.
| PPC 601 | PPC 603 | PPC 604 | PPC 620 | SPARC | R10000 | R4400 | Pentium | P-Pro | |
| Blocking | 1 | 0 | 4 | ||||||
| Split? | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| Data | 32K | 8K | 16K | 32K | 16K | 32K | 16K | 8K | 8K |
| Instruct. | Unified | 8K | 16K | 32K | 16K | 32K | 16K | 8K | 8K |
| Associativity | 8 | 2 | 4 | D: 1/I: 2 | 2 | 1 | 2 | D: 4/ I: 2 | |
| Line size | 64 B | 32 B | 32 B | 16 or 32 B | 32 B | ||||
| Data Write | Back (through) | Back (through) | Back (through) | Back (through) | Back (through) | Back (through) | |||
| Replacement | LRU | ||||||||
| Notes: | I-cache predecodes instr. | I-cache predecodes instr. | 256KB 4-way L2 cache |
Many people think of virtual memory as a way of making a disk appear
to be an extension of main memory. While this is a major benefit of virtual
memory, it is not a definition. VIrtual memory can be defined as
A mapping from a virtual address space to a physical address space.
OR
A mapping from a name space to an address space.
In either case, the key term is mapping. The processor generates an address
that is sent through a mapping function to produce the actual bit pattern
presented to the address lines of the memory. The insertion of the mapping
function between the processor and the memory adds flexibility, because
it allows us to store data in different places in the memory yet have it
appear in a different logical arrangement from the perspective of the program.
One of the advantages of this flexibility is that programs can be compiled
for a standard address space, and can then be loaded into the available
memory and run without modification. For example, in a multitasking environment
one program can be arranged to fit into memory along with an arbitrary combination
of other programs.
Another advantage is that the mapping function can signal (through an exception)
when locations are mapped to locations outside of main memory, such as on
disk storage. If such an exceptions occurs, commonly called a page fault,
then the operating system can load the data in from disk and adjust the
mapping function so that accesses to the data refer to wherever it wound
up in main memory.
In this latter sense, virtual memory provides a way to make main memory
appear larger than it is. It saves the programmer from having to explicitly
move portions of the program or data in and out from disk. It is thus most
useful for running large programs on small machines.
Unfortunately, the performance of virtual memory is best when a small application
runs on a large machine. The reason is that heavy reliance on the disk drive
(which is 5 orders of magnitude slower than main memory) increases the miss
rate at the main memory level of the hierarchy. It was noted earlier that
programming models deal with this wide performance gap by making access
to the secondary memory level explicit. However, virtual memory is a mechanism
that attempts to hide the difference between these layers through an abstraction.
The unknowledgable programmer can be easily lead byt this abstraction to
write programs that give extremely poor performance. Experienced programmers
try to arrange their code and data to avoid excessive reliance on virtual
memory.
Fortunately, in a multitasking situation, the time required to access the
disk can often be hidden by switching to another task. Because main memories
are now quite large, it is less frequent for a single program to need virtual
memory (large data sets being an exception). However, with a multitasking
OS, multiple programs can use up the memory and also keep the processor
busy.
There are basically three approaches to implementing virtual memory: Paging, segmentation, and a combination of the two called paged segmentation. We'll look at each of these approaches in turn.
Memory is divided into fixed-size blocks called pages. Main memory contains
some number of pages (memory size / page size) which is smaller than the
number of pages in the virtual address space (virtual address range / page
size).
For example, if the page size is 4K and the physical memory is 16M (4K pages)
and the virtual memory is 4G (1 M pages) then there is a factor of 256 to
1 mapping.
In order to perform the mapping function, a table lookup is employed. A
page table is kept with one entry per virtual page. (In reality, a 1M entry
table would not be used -- parts of it would also be mapped in virtual memory,
and in may cases those pages would never be stored anywhere because they
would never be accessed.)
A page table entry might have the following form:

The Presence Bit indicates whether the physical page is in main memory or
must be fetched from secondary storage (a page fault). The secondary storage
address is used to locate the data on disk. Physical page address is substituted
for the virtual page address as a result of the lookup.
A virtual address takes the following form for a page of 4K words:
![]()
The page offset (low order 12 bits) is the location of the desired word
within a page. The The Presence Bit indicates whether the physical page
is in main memory or must be fetched from secondary storage (a page fault).
Thus, the virtual address translation hardware appears as follows:

The virtual address from the CPU is split into the offset and page number.
The page number becomes the index into the table (requiring a shift and
an add of the base address of the table to get a proper address).The table
entry is fetched, starting with the presence bit and the physical page.
When the presence bit indicates that the page is in main memory, it simply
substitutes the physical page from the table for the virtual page portion
of the address. When the presence bit indicates that the page is not in
main memory, it triggers a page fault exception, and the OS must initiate
the disk transfer that brings the page into memory.
While a disk transfer is in progress, the OS my cause another task to execute
or it may simply stall the current task and sit idle. In either case, once
the transfer is complete, the OS stores the new physical page number into
the page table and jumps back to the instruction causing the page fault
so that the access can be reissued.
Thus, any virtual page may be stored at any physical page location and the
addressing hardware performs the translation automatically.
A page table can also contain privilege information consisting of an owner
process ID and level of access. On any access, the current process ID in
the program atatus word is checked against the owner process ID portion
of the page table entry and if there is a mtch and the type of access (read
vs. write) is allowed, then the access proceeds. Otherwise an access privilege
violation exception is signalled.
Pages are usually loaded in this manner, which is called "on demand".
However, schemes have also been devised in which the historical behavior
of a task is recorded, and when the task is suspended, its working set of
pages is reloaded before it restarts.
Once main memory pages are all allocated to virtual papges, then any accesses
to additional virtual pages force pages to be replaced (or "swapped").
It should be obvious that, just as with a cache, replacement necessitates
a replacement policy. Typical policies for page replacement are the same
as for a cache: LRU, FIFO and random. However, because page replacement
is a costly operation, it is more often the case that the more costly LRU
policy is employed.
Just as in the cache, the page table entry can contain a dirty bit that
indicates whether the contents of a page have been changed. If no change
has occured, then we can skip the writing back of the page before we replace
it. In fact, because of the savings of not writing a page back, the LRU
policy is sometimes altered to select a page that is unchanged and not quite
the least recently used over an altered page that is least recently used.
As we noted before, LRU has a degenerate pattern of accesses that can cause
pages to be thrown out just before they are needed again. When pages are
repeatedly replaced due to such a pattern, or simply because the program's
working set is so large that it forces continual replacement with few accesses
to each page, then it is said to be thrashing. Given the difference in access
speed between disk and RAM, a thrashing program can exhibit a factor of
10,000 loss in performance. To put this in perspective, a program that takes
1 minute to execute in RAM would take almost a week to execute if it thrashes
in virtual memory.
One inefficiency of paging is that the fixed size of a page means that some
space is wasted in partially filled pages. Although not as much of a concern
today, this "internal fragmentation" was considered to be a serious
problem when machines had smaller main memories. Another problem with paged
virtual memory is that it is a "flat" virtual address (or name)
space. There is just one virtual space shared by all processes, and the
processes have to be statically relocated when loaded.
The operating system can create multiple virtual address spaces, each
starting at an arbitrary location and with arbitrary length. The start of
a segment is virtual address 0. Each process can be assigned a different
segment, and so it is completely unaware that other processes are sharing
the memory with it. Relocation is effectively done dynamically at run time
by the virtual memory mapping mechanism.
Segmentation is implemented in a manner much like paging, through a lookup
table. The difference is that each segment descriptor in the table contains
the base address of the segmenent and a length.
Usually, segments are limited to starting on some power-of-two boundary
that makes the mapping simpler to implement. A system may impose a limit
on the number of active segments, and their minimum size may be limited
to (virtual address range/maximum number of segments). The maximum number
of segments is typically small compared to the virtual address range (for
example 256), in order to keep the size of the segment tables small. Even
so, under many circumstances, individual processes could be allocated multiple
segments (e.g., for code, data, stack, heap, I/O).
A segment descriptor also typically carries some protection information,
such as the read/write permission of the segment and the process ID. It
will also have some housekeeping information such as a presence bit and
dirty bit.
A virtual address in the segmentation scheme consists of a segment number
and a segment offset (if a process has multiple segments, it has access
to some of the bits of the segment number -- otherwise the segment number
is just an ID assigned by the OS). In a pure segmentation scheme the segment
offset field grows and shrinks (logically) depending on the length of the
segment. In a typical implementation, this simply means that some of the
high order bits of the offset are ignored.
Segmentation also suffers from internal fragmentation (some of a segment
may be allocated but goes unused). In addition the variable size of the
segments can lead to external fragmentation in which the allocation and
deallocation of space for processes with segments of different sizes leaves
a collection of holes that are too small for a new process to be fit into,
yet there may be more than enough empty memory if all of the holes were
collected into one space (or if the process could be split up). This situation
is also known as checkerboarding.
External fragmentation is addressed through a process known as compaction
in which all active processes are temporarily suspended and then relocated
in order to gather together all of the memory holes. Compaction is a costly
process, especially in machines that have large amounts of physical memory.
Because the main memory is being completely rearranged, many memory access
operations are required, and few of them can take any advantage of the cache.
Segmentation also suffers from the large unbroken nature of a segment. It
is very costly to swap processes because an entire segment must be written
out (except when the entering segment is smaller, in which case an amount
equivalent to the entering segment must be saved to disk).
The solution to the problems of external fragmentation in segmentation
and the falt name space of paging is to combine the two into a paged segmentation
scheme where segments contain pages and can be split up on page boundaries.
Thus, paging occurs within the distinct name space of each segment, and
segments can be distributed like pages on fixed boundaries that avoid checkerboarding.
They can also be loaded and swapped on a page- by-page basis, making context
switches less costly.
A virtual address in a paged segmented scheme is divided into three parts:
a segment number, a page number, and an offset. The segment number points
to an entry in a segment table. The segment descriptor then points to the
page table for the segment and the page number is used to index into the
page table. The page table contains the physical page number that is concatenated
with the offset to generate the physical address. Thus, instead of having
just a single page table, there is a separate page table for each segment.
Because the segments can be limited in size, their page tables are sometimes
small enough to be kept in memory.
Note that, as with segmentation, the logical size of the page number portion
of the address grows and shrinks according to the size of the segment.

Acceleration of Virtual Address Translation
Each access in a paged-segmented virtual memory system generates two additional
memory accesses (one to the segment table, and the other to the page table).
Even if these tables are cached, the required accesses triple the time to
transfer data between the memory and the CPU. If the tables are not cached,
then the access can be far slower.
In order to avoid these delays, a specialized cache called a translation
lookaside buffer (TLB), is used. Whenever a translation takes place, the
segment and page number are stored together with the upper portion of the
resulting physical address. The next time an access to the same virtual
page is made, the cache fetches the translation and inserts it directly
into the physical address, avoiding the extra memory references. Of course,
when there is a miss in the TLB we must detect the miss and then the normal
table lookup proceeds to generate the address which is stored as a new TLB
entry.
A TLB is typically fully associative and fairly small in size (tens to a
few thousand entries) because each entry serves for a large number of locations
(e.g., 4K) and so normal exploitation of memory locality ensures that entries
in the TLB change much less frequently than cache entries. Typcial hit rates
in a TLB are in the 99% to 99.9% range. Still, when a miss occurs it will
typically cost 10 to 30 clock cycles. So if 1 in 100 accesses miss in the
TLB and the penalty is 30, then the processor slows down by nearly a third.
Virtual Memory and Caching
One question that a designer must face is whether to place the cache in
front of the virtual memory translation or after it. The latter situation
means that the cache is accessed with a physical address. The former means
that the cache sees a virtual address.
A physically addressed cache has advantages of simplicity. It is possible
for two processes to use the same virtual addresses, but the mapping function
ensures that the physical addresses are distinct. Thus, there is no chance
that aliasing could occur in the cache. In addition, recall that when DMA
I/O occurs, the addresses must be compared to the contents of the cache
so that cache lines are either written back or invalidated. This is comparison
is much easier when the cache is physically addressed. In a virtually addressed
cache, a reverse translation has to take place (from physical to virtual
address) in order to carry out the comparison.
On the other hand, a physically addressed cache only sees its address after
it has gone through the mapping process. Even with a TLB, this typically
introduces a one cycle delay. Memory accesses can be pipelined so that the
rate of data transfer is maintained, but when loads or stores occur in isolation
and there are no independent operations to fill the resulting delay, then
the CPU sees a memory wait state and is effectively slowed down by the mapping.
Some systems use a virtual level 1 cache followed by a physical level 2
cache. The level 1 cache sees no delay due to mapping, while the level 2
cache's physical mapping is compatible with comparisons to I/O operation
accesses. If the level 2 cache keeps track of which of its lines are currently
in the level 1 cache, then if it sees an I/O operation accessing one of
those lines it can signal the level 1 cache to take action. Otherwise it
can simply handle the I/O without affecting the level 1 cache.