CmpSci 535 Discussion
Reading 3
The Canonical Generations
of Technology
In our review of modern
computation, we saw a list of generations of technology:
0
Mechanical/Electromechanical
1 Vacuum tube
2 Transistor
3 Integrated circuit
4 Very Large Scale Integration/Microprocessor
5 Parallel
These generations, however,
are not particularly indicative of anything other than changes in construction
techniques. The first two generations represent the typical initial period in
which existing technology is applied to a novel invention. It is not unlike the
early days of automobiles in which some people tried to use steam engines to
power carriages that were modifications of designs that were meant to be pulled
by horses. Just as the complicated and dangerous steam mechanisms were
inappropriate for a personal transportation device, the high-maintenance and
high-power-consumption mechanical and vacuum tube technologies were
inappropriate for a device that needed to employ greater and greater numbers of
components to achieve its goals.
Semiconductor electronics
were developed for other applications in which low power and high reliability
were required, and were a natural match for computing. It should not be
inferred that this was somehow a coincidence -- developments in science and
engineering together with social pressures resulted in a general progression to
more complex mechanisms in many domains. Computing happens to be one of the
most visible and novel outgrowths of having crossed this particular threshold
of complexity.
Once semiconductor technology
was introduced to computing, the next three generations are really the same.
The only difference between them is the density with which the electronics are
packaged. The second generation is the point where each sealed package contains
just one device. The third generation is the transitional period when multiple
devices are in each package, but not enough to form a complete computer. The
fourth generation begins at the point that a whole computer fits into one package.
In terms of mainstream computing, we are still in the fourth generation.
Why?
Because the goal of placing
an entire computer in one package has such significant benefits that it was
done prematurely -- at a point where the technology could barely support a very
minimal computer. The technology has only recently reached the point that a
single package can hold enough components to equal the capabilities of the most
powerful second and third generation machines of the 1970's and 1980's.
Except for comparatively
minor advancements, very little has changed in mainstream computer architecture
since second- and early third-generation machines reached a peak in the
mid-60's. In the meantime, there has been one major digression in architectural
approaches (complex instruction sets), but we have returned to the practices of
that earlier period (reduced instruction set designs, memory hierarchy,
multiple arithmetic units, pipelined processing) in the late 1980's. Most of
the advancements have involved either the development of clever engineering
tricks (to take advantage of features of the changing technology) or responses
to empirical data taken from studies of application software.
Outside of the mainstream
there has been a steady stream of research into novel architectural approaches,
few of which have succeeded because of the cost of creating a viable base of
software. There have been database machines, dataflow processors, parallel
processors of many varieties, logic machines, neural network machines, fuzzy logic
machines, and so on. Some of these are in use in niche application markets. The
more successful architectures are digital signal processors (used in modems,
VCRs, phones, CD/DVD players, talking toys, etc.), fault tolerant processors
(used in banks, aircraft, medical equipment, etc.), and image processors (used
in video games, TVs, manufacturing processes, and medical equipment).
The Future
Many people believe we are on
the verge of entering (or have recently entered) a new generation in which the
mainstream architecture changes significantly. This generation will employ
parallel processing. Currently, parallelism is used quite conservatively --
mainstream processors now typically have multiple arithmetic elements that can
operate simultaneously. For example the Pentium can perform two arithmetic
operations at once in many cases.
Parallel processing will
expand to include specialized independent computers in a single package. Intel,
for example, has a corporate vision of a microprocessor with several units for
executing normal instruction streams, specialized units for generating
graphical displays and processing input signals (such as speech and video).
This will be a heterogeneous parallel processor, in contrast to the homogeneous
parallel processors that have been built of 10s to 1000s of identical
processors.
The viability of alternative
architectures will probably start to grow once the technology advances to the
point that they cost little to add to mainstream processors. Until then, they
will continue in niche markets. A prime example is the current generation of
homogeneous massively parallel processors which were developed for science and
engineering. These were large, expensive machines that attempted to push
performance two or three orders of magnitude beyond uniprocessors. The problem
is that they have always been one generation of technology behind (because of
the time required to design a new implementation for the latest
microprocessor), and they have inherent inefficiencies. The result is that they
deliver only one or two magnitudes of increased performance while costing
proportionally more than a contemporary uniprocessor.
Given the added cost of
building software for parallel machines, most potential buyers have chosen to
wait a few years until uniprocessors deliver similar performance (with no
software development cost), or to employ only limited parallelism (manually
spreading computation over several networked machines, or using a small-scale
multiprocessor), or to employ specialized accelerators (vector, array, image,
and signal processors). These are precisely the direction that microprocessors
are going, so it is likely that we will see specialty manufacturers gradually
fall aside and greater consolidation growing around a few standard
architectures. In many ways, this development of the computer industry is a
parallel to the automotive industry which saw a large number of manufacturers
dwindle to just a few as the manufacturing and development infrastructure costs
grew to support the complexity of the product.
Performance
Measures
The most commonly quoted measure of performance is peak
performance, which is the maximum
rate at which some operation can be executed. For example, the number of no-ops
that can be executed in a single cycle loop, entirely in cache.
MIPS
-- million instructions per second
BIPS,
BOPS, GIPS, GOPS -- billions (giga -- 109)
of instructions/operations per second
TOPS
-- Trillions or tera operation per second. Often teraops.
FLOPS
-- floating point operations per second (MFLOPS, GFLOPS, TFLOPS)
LIPS
-- logical inferences per second
CPS
or COPS -- Connections per second
Beyond
tera (1012) are peta (1015), exa (1018), yota (1021) and
zeta(1024).
Peak performance is based on the clock rate of the
processor, and on the minimum number of cycles per instruction (CPI)
attainable. The clock rate depends on the technology used to implement the
processor. The minimum CPI depends on the instruction set (usually just one or
two instructions) and the speed of the innermost cache, which is usually on the
same chip as the CPU in modern processors. Note that CPI may be less than one
if pipelining and multiple function units are employed.
![]()
Thus, peak performance ignores the fact that there is
a mix of CPI values that depend on the instruction set, the cache behavior, and
the proportions in which instructions are executed.
Note, however, that as 1/t increases,
peak performance increases linearly. This why clock rate is the second most
often quoted performance figure. It is only a bit more meaningful than peak
performance.
Can anybody think of a situation where a clock rate
quotation may be useful?
When comparing two machines with the same
architecture, but different clock rates, the ratio of the clock rates can
indicate the difference in performance.
When is this comparison invalid? (when memory, I/O,
communication, do not increase in speed by the same proportion)
MIPS is misleading because the work done by an
instruction varies. A no-op does no work but gives a high mips rating. A
floating point add does a lot of work, but may have a lower MIPS rating.
FLOPS, LIPS, COPS, TPS are more meaningful because
they specify a particular operation, but they are ofte calculated in differetn
ways for different machines. For example, a machine with a floating point
divide instruction may get credit for executing only one FLOP, while a machine
that has to do several operations to perform the divide may count this as
several FLOPs.
What people really want to know is how fast a machine
runs a given program. But that is a much more complex performance measurement.
What factors determine this performance?
Algorithm and data set (size, length of execution,
pattern of access -- affects memory sys)
Compiler (efficiency of code generated, ability to
optimize access patterns)
Instruction set (how many instructions does it take to
encode the program?)
Available operations (floating point support?)
Operating system (timeout period and cost, other
processes, etc.)
CPI
Clock rate
Memory system performance (cache hit rate, miss
penalty)
The only real way to accurately measure the
performance of a system is to code a program of interest and execute it.
Benchmarks have been created to serve as standard codes that can be tried on
different machines for comparison purposes. Some popular ones include SPECmark,
Perfect Club, Linpack, Livermore loops. There are also specialized benchmarks,
such as those that test performance of PCs on popular packages.
The table on p. 17 of Kai Hwang's book, Advanced
Computer Architecture, provides an interesting comparison:
|
Machine |
Clock
Rate |
Peak
Performance |
CPU
Time |
|
VAX
11/780 |
5
MHz |
1
MIPS |
12
seconds |
|
IBM
RS/6000 |
25
MHz |
18
MIPS |
1
second |
Why is the the VAX a 5:1 CPI, while the RS/6000 is
only 1.4:1?
If the VAX is accelerated to 25 MHz, it will execute
at 5 MIPS and have a CPU time of 2.4 seconds. How can it be that a machine
(RS/6000) that has a 3.6X advantage in peak performance (over the accelerated
VAX) gains only a factor of 2.4 improvement in execution time?
RS/6000 takes more instructions to execute the same
program.
VAX may be able to hide extra memory references inside
its longer instructions.
Benchmarks
The Systems Performance Evaluation Cooperative (SPEC) is a group of vendors who decided to set up a more
meaningful benchmark that the simple synthetic ones that they could quote in
their market literature. The original SPEC89 benchmark has been replaced by a
1992 version, so one must take care in identifying the version used when
comparing machines of this period with later machines. One way to tell the
difference is that SPEC89 reports a single figure called a SPECmark, while the
newer benchmark reports separate integer and floating point figures (SPECin92
and SPECfp92).
The SPEC benchmark computes these figures by applying
the geometric mean (see next section) to the measured execution times of the 6
integer and 14 floating point routines in the suite. Most of the individual
codes are written in FORTRAN although 8 of the 20 are in C. Of the 14 floating
point routines, 9 are double precision (64 bit) while the others are single
precision. The floating point routines account for about 65,000 lines of code
all together, while the integer codes are abour 124,000 lines long. The SPEC
benchmark is thus a test based on real programs, kernels of real programs, or
programs taken from research labs.
Each of the 20 routines is referred to by a reference
number and a name. For example, 052.alvinn is a neural network training program
written at CMU. It takes a series of low-resolution black-and-white images of a
road from a moving vehicle and corresponding steering commands and trains a
neural network to mimic the steering operations when shown new imagery.
026.compress compresses and decompresses a 1 MB file
20 times using the Unix compress utility. The compression routine is based on a
dynamically constructed hash-table, and thus tests cache performance on unpredictable
memory access patterns.
The 015.doduc routine is a FORTRAN monte-carlo
simulation using double-precision floating point.
056.ear is a C program that simulates the acoustical
properties of the ear.
008.espresso is a Boolean expression minimizer that is
often found in digital logic CAD systems. Its input and output are in the form
of truth tables and it emphasizes table look-up operations and Boolean
operations.
023.eqntott is a companion to espresso that translates
a Boolean expression into a truth table. This is an integer-based C routine
that also uses the standard C library routine qsort quite extensively.
0.94fppp is a scalar floating-point intensive FORTRAN
program that solves a quantum chemistry problem.
085.gcc as the name implies is a Gnu C compilation, in
this case with Sun-3 (Motorola 68020 + 68882) assembler output.
090.hydro uses the Navier-Stokes equations to simulate
the flow of material in galaxies with active jets shooting out perpendicular to
their disks. It is a rather small code that is generally cache resident.
022.li is an execution of a 9-queens problem written
recursively for a small Lisp interpreter that is itself written in C. The
reason for 9 rather than 8 queens is simply to make the program work harder.
Obviously, the recursive nature results in a test of deeper procedure calling
than the other codes.
034.mdljdp2 solves the multi-body equations of motion
for 500 atoms in a gas using double-precision floating point. It is a small
memory routine that is often fully cachable. A second version, 077.mdljsp is
simply a single-precision implementation that is provided for comparison
purposes.
093.nasa7 is a nested suite of 7 benchmark kernels,
all easily vectorizable and double precision. There are two fluid-dynamics
programs, a matrix multiply, a matrix inversion, a matrix solution (block
tridiagonal on one dimension of a 4D array), a 2D complex FFT, and a parallel
Cholesky decomposition.
048.ora is an optical ray-tracing program that
emphasizes double-precision floating point, especially square roots.
072.sc is a run of a simple spreadsheet program doing
some recalculations.
013.spice2g6 is an execution of a popular analog
circuit simulation program.
089.su2cor uses quantum chemistry techniques to solve
for elementary particle masses.
078.swm256 is a small weather-prediction program that
uses finite difference methods.
047tomcatv is a 2D mesh benchmark that is specifically
designed to be easily vectorized.
039.wave5 is a large-memory program that simulates a
plasma by solving Maxwell's equations.
SPEC was updated again in 2000, see http://www.spec.org
for details.
Other popular benchmarks include the Perfect Club
suite of scientific programs that were selected as being a challenge for
supercomputer-class machines, the Transaction Processing Council's trio of
benchmarks (TPC-A, TPC-B, and TPC-C) that test perfomance on applications in
which transactions are remotely entered that must be posted to a database, and
a series of benchmarks for image processing and computer vision (the Abingdon
Cross, CMU IP Suite, DARPA IU Benchmark, etc.).
Mean Performance
If we run some benchmarks on a machine, and get
execution time performance figures for each, we would like to use these to
compute an average time for the machine. At first, we might be tempted to
simply average the individual execution times.
![]()
But this is not proportional to the execution time for
a typical mix of programs, and is thus misleading. Suppose the fastest program
in the suite of benchmarks does some trivial operation that we never expect our
machine to execute in practice. It distorts the average, leading us to believe
that the machine will provide better performance than it really does.
One approach to addressing this problem is to
determine a weight for each of the benchmark codes -- perhaps this reflects an
estimate of how much of the machine's expected load will mimic that of each
particular code. By multiplying the time of each benchmark by the weight before
summing, we get an overall performance that better represents what we can
expect on our anticipated load.
This is adequate for a single machine, but what if we
want to compare two machines and we are given their execution time on several
codes. If we use the arithmetic mean, we can get confusing results (adapted
from Fig 2.7 in the text):
|
|
A |
B |
A/A |
B/A |
P1+P2/2 |
P1+P2/2 |
A/B |
B/B |
|
Program 1 |
1 |
10 |
1 |
10 |
|
|
0.1 |
1 |
|
Program 2 |
1000 |
100 |
1 |
0.1 |
|
|
10 |
1 |
|
Arithmetic |
500.5 |
55 |
1 |
0.109 |
5.05 |
5.05 |
9.1 |
1 |
From this we see that the arithmetic mean of the
execution times on each machine is different by a factor of about 9, yet the
average of the relative times for the machines is the same. The problem here is
the same as the algebraic property that a product of sums does not equal a sum
of products in general. Because we want to determine their relative performance
(a (inverse) multiplicative operation), we should use an average that is based
on multiplicative operations.
The geometric mean is such an average:
![]()
All the geometric mean does is determine the value
such that if m of them are multiplied together, they equal the product of the
data values (think of this as occupying a volume in space, and computing the
length of a line that spans the volume diagonally).
|
|
A |
B |
A/A |
B/A |
P1*P2^0.5 |
P1*P2^0.5 |
A/B |
B/B |
|
Program 1 |
1 |
10 |
1 |
10 |
|
|
0.1 |
1 |
|
Program 2 |
1000 |
100 |
1 |
0.1 |
|
|
10 |
1 |
|
Geometric |
31.6 |
31.6 |
1 |
1 |
1 |
1 |
1 |
1 |
Here we see that the ratio of the average of the times
is the same as the average of the ratios. The two machines are actually
comparable in performance. This was the reasoning that went into the use of the
geometric mean by the SPEC benchmark, However, the flaw in this is that the
numbers being averaged in this case are ratios. The practical result is that
outlying values have a greater impact than they should. Thus, we should use an
average that inverts the values. The mean that does this is called the Harmonic
mean.
![]()
With the harmonic mean, we get an average of the ratios
that does not give extra weight to benchmarks that have a few high ratios. Why
hasn't SPEC switched to using this? They have a legacy to preserve (changing
the mean would make it difficult to compare to older reports). But there is
also a competitive advantage to using the geometric mean, as it allows a
marketing department to focus its energy toward boosting performance on one or
two of the benchmarks to improve overall performance, rather than having to
work hard on every one of the benchmarks.
Non-speed Measures of
Performance
During the period when
microprocessors were trying to catch up with mainframes and supercomputers,
speed was the dominating factor in reporting performance. Even after core
processor performance reached supercomputer levels (as measured by CPU
benchmarks), microprocessors were unable to directly substitute for their
larger counterparts. The reasons boil down to the limited number of connections
that can be brought out of the small area of a single IC. A high-end
microprocessor may have a few hundred connections, but many of these merely
supply power to the circuitry. A mainframe, on the other hand, may support
several thousand external connections. It naturally can move data between
memory and the processor in larger chunks and thus at higher rates. Similarly,
I/O in a mainframe or supercomputer dedicates many more connections to
supporting the transfer of massive amounts of data into and out of memory.
CPU speed is thus just one
aspect of performance. Memory and I/O transfer rates impact performance on
larger tasks. Separate benchmarks may be used to characterize these aspects of
performance. The overall impact of improving this aspect of performance is
better throughput, as might be measured in a whole-application context by the
TPC benchmarks or the SPECweb benchmarks.
Beyond throughput, there are
many other aspects of an architecture that can be measured and which can be
considered a form of performance. These include: power, space, fault tolerance,
and real-time behavior.
Power
As computers have been
reduced in size to enter new application domains, a major impediment to further
reductions has been power consumption. On a personal, practical level, it is
frustrating to have a laptop computer run out of battery power after an hour or
two of work. This could be avoided by carrying a ten-pound battery pack, but
the inconvenience would outweigh (literally!) the added benefit.
For other applications, the
power consumption of a computer can determine whether it can be used at all.
For an exotic example, consider a planetary space probe that has a fixed power
budget for the life of its mission. A certain fraction of the budget is
allocated to computation, and no more power is available. Thus, engineers on
the project look first at power consumption and view speed as a secondary
consideration. In a less exotic, the processor in a talking greeting card must
run on an extremely small amount of power. The rate of computation may be
fixed, and rather small, so speed is almost irrelevant. But the size and cost
of the battery that can fit in the card is a key factor in making it possible
to manufacture and sell for a profit.
Power may be measured in
several ways. The metric is watts (or milliwatts, microwatts, or nanowatts),
however, the measurement methodology must be taken into consideration. Is the
power measurement peak, average, or minimum? Does it account for various low
power modes that the processor may support? How does it relate to clock rate
(power consumption drops dramatically at lower clock rates)? Is it measured for
a real application, or for a synthetic workload?
The latter question is often
a surprise to software engineers -- different programs consume power at
different rates. This was demonstrated quite dramatically in the author's lab,
where an older electrical system was being taxed near its limit. Circuit
breakers would blow periodically, and the problem was eventually traced to the
running of a floating-point intensive application on a larger machine. The
floating point logic in this system would normally draw very little power, but
one program would push it to full utilization and cause an upsurge in power
consumption of several watts. There is actually research effort going into
compiler and operating system optimizations that spread out work to keep power
consumption steady in some applications.
Special low-power
architectures have been developed for use in portable and handheld devices.
These use special techniques, such as turning off banks of memory and shutting down
I/O connections when they are not needed. They often also vary clock speeds to
reduce power consumption during idle time, and support various levels of power
saving, such as "sleep" mode.
At the other end of the
scale, supercomputers are heavily impacted by power consumption. The most
powerful systems are typically built from thousands of high-performance
microprocessors. While the power consumption of a few extra watts in a PC may
not seem significant, multiplied by a factor of ten thousand, it becomes a
major consideration. As an illustrative example, the first machine to reach
Teraflop performance, ASCI Red, consisted of over 9000 Intel Pentium Pro
processors. In planning for the facility, it was realized that the machine
would need its own 1.6 megawatt transformer. This was so far beyond prior
experience that the procurement people at Sandia National Laboratory believed
it had to be a mistake and instead budgeted for a 600 kilowatt transformer.
When the mistake was discovered, it was too late to increase the budget, and so
it was necessary to postpone the resurfacing of a parking lot to make up for
the deficit.
Space
Another aspect of an
architecture is the physical space that it occupies. Smaller space enables use
in a wider range of applications. Some researchers talk about "smart
sand" in which they envision sand-grain sized computers that operate on
static electricity from their surrounding environment. Such computers could be
used in packaging for consumer goods, medical instrumentation, environmental
monitoring, and so on.
High performance processors
typically cannot operate in isolation. They require separate memory, bus
interfaces, I/O interfaces, network interfaces, graphics accelerators, and
other components to become a usable system. But as microelectronics have grown
smaller, it has been possible to move more of these components onto the same
IC. For certain applications, System On a Chip (SOC) architectures are highly
advantageous. Even for desktop and laptop systems, reducing the number of
components helps to reduce size, cost, and power consumption.
Fault Tolerance
Sometimes we simply cannot
afford to have a computer fail. Life critical examples include aircraft control
systems, medical life support and radiation therapy equipment, nuclear
reactors, and so on. Even in cases where a failure isn't life-threatening, the
cost may exceed the price of the machine. For example, if a bank loses access
to its accounts for even a few minutes on a busy stock market trading day, it
can lose millions of dollars.
It's a common misperception
that computers are only as infallible as their software. As transistors have
become smaller and packed more densely, they have become less reliable.
Radiation from a cosmic ray can easily disrupt a transistor in a modern
processor, and microelectronics are becoming more susceptible to
electromagnetic interference. When a fault is transient in nature it is called
a soft fault, and when it is permanent, it is called a hard fault. Because
processors are now so complex, it is not possible to economically test them
exhaustively. Variations in manufacturing process lead some chips to have
faults while others do not. The ones that are most frustrating are those that
are intermittent, perhaps with temperature or other environmental factors, as
these may past testing and only exhibit occasional problems in use.
One approach to fault
tolerance is to build more robust circuits. The use of different materials (at
greater expense) can make integrated circuits tolerant of radiation. Circuits
can be more thoroughly tested over a longer period, with varying temperatures
and vibration or physical shocking to eliminate those that are unreliable.
Operation over a wider temperature range or in the presence of radiation or
under strong vibrations are another dimension of performance.
Circuits can be replicated so
that the same operation can be performed in parallel, and the results compared.
Disagreement indicates a fault and voting selects the correct result. Usually
circuits are replicated in threes, and thus this approach is called triple
modular redundancy (TMR).
Redundancy can be applied at
the system level, with processors operating in parallel and voting. When a
processor disagrees with the majority, the others go on and the disagreeable
processor is shut down for diagnostics and possible replacement.
Real Time
In some cases a computer must
meet a deadline or catastrophic results ensue. If a nuclear reactor starts to
lose coolant, the control rods must be dropped into place before the heat warps
their tubes. While it might seem that speed is the most important factor in
meeting such a deadline, consistency in performance is really more critical.
Because the computer cannot be late, system engineers must always assume worst
case performance: that the caches will always miss, the pipelines will operate
at minimum efficiency, that disks will be as slow as possible, and so on. If
they were to rely on better performance, then there is the possibility that all
of the worst cases could coincide and cause a failure to meet the hard
deadline.
Some processors are thus more
effective for real-time processing than others that may have significantly
better average speed. Some features that help improve predictability of timing
for real time include being able to explicitly load certain data into cache
memory and lock it there so that it is known which variables can always be
accessed at maximum speed. Being able to explicitly schedule instructions for
execution by multiple functional units, with known execution times (versus the
now common practice of allowing the hardware to dynamically reschedule
instructions into different orders).
In Summary
Performance metrics are as
diverse as the application domains in which computers are used. Which ones are
most important depend on how the system will be used. Metrics based purely on
system characteristics, such as clock rate, are poor predictors of actual
performance. The best approach is to actually measure the performance of a
system on the target application mix. However, this may not be practical. In
that case, standard benchmarks may be used as a guide, but only if their
behavior is well understood and can be related to the target application.
© Copyright 1995, 1996,
2001 Charles C. Weems Jr. All rights reserved.
Back to Chip Weems' home page.