The most commonly quoted measure of performance is peak performance,
which is the maximum rate at which some operation can be executed. For example,
the number of no- ops that can be executed in a single cycle loop, entirely
in cache.
MIPS -- million instructions per second
BIPS, BOPS, GIPS, GOPS -- billions (giga -- 109) of instructions/operations
per second
TOPS -- Trillions or tera operation per second. Often teraops.
FLOPS -- floating point operations per second (MFLOPS, GFLOPS, TFLOPS)
LIPS -- logical inferences per second
CPS or COPS -- Connections per second
Beyond tera (10^12) are peta (10^15), exa (10^18), yota (10^21) and zeta(10^24).
Peak performance is based on the clock rate of the processor, and on the
minimum number of cycles per instruction (CPI) attainable. The clock rate
depends on the technology used to implement the processor. The minimum CPI
depends on the instruction set (usually just one or two instructions) and
the speed of the innermost cache, which is usually on the same chip as the
CPU in modern processors. Note that CPI may be less than one if pipelining
and multiple function units are employed.
1/t ¥ CPImin
Thus, peak performance ignores the fact that there is a mix of CPI values
that depend on the instruction set, the cache behavior, and the proportions
in which instructions are executed.
Note, however, that as 1/t increases, peak performance increases linearly.
This why clock rate is the second most often quoted performance figure.
It is only a bit more meaningful than peak performance.
Can anybody think of a situation where a clock rate quotation may be useful?
When comparing two machines with the same architecture, but different clock
rates, the ratio of the clock rates can indicate the difference in performance.
When is this comparison invalid? (when memory, I/O, communication, do not
increase in speed by the same proportion)
MIPS is misleading because the work done by an instruction varies. A no-op
does no work but gives a high mips rating. A floating point add does a lot
of work, but may have a lower MIPS rating.
FLOPS, LIPS, COPS, TPS are more meaningful because they specify a particular
operation, but they are ofte calculated in differetn ways for different
machines. For example, a machine with a floating point divide instruction
may get credit for executing only one FLOP, while a machine that has to
do several operations to perform the divide may count this as several FLOPs.
What people really want to know is how fast a machine runs a given program.
But that is a much more complex performance measurement. What factors determine
this performance?
Algorithm and data set (size, length of execution, pattern of access --
affects memory sys)
Compiler (efficiency of code generated, ability to optimize access patterns)
Instruction set (how many instructions does it take to encode the program?)
Available operations (floating point support?)
Operating system (timeout period and cost, other processes, etc.)
CPI
Clock rate
Memory system performance (cache hit rate, miss penalty)
The only real way to accurately measure the performance of a system is to
code a program of interest and execute it. Benchmarks have been created
to serve as standard codes that can be tried on different machines for comparison
purposes. Some popular ones include SPECmark, Perfect Club, Linpack, Livermore
loops. There are also specialized benchmarks, such as those that test performance
of PCs on popular packages.
The table on p. 17 of Kai Hwang's book, Advanced Computer Architecture,
provides an interesting comparison:
The Standard Performance Evaluation
Corporation (SPEC) is a company founded by a group of vendors who decided
to set up a more meaningful benchmark that the simple synthetic ones that
they could quote in their market literature. The original SPEC89 benchmark
was replaced by a 1992 version, and this has recently been replaced by a
1995 version, so one must take care in identifying the version used when
comparing machines of this period with later machines. One way to tell the
difference is that SPEC89 reports a single figure called a SPECmark, while
the 1992 benchmark reports separate integer and floating point figures (SPECint92
and SPECfp92), and the 1995 benchmark distinguishes figures for a base system
configuration and peak performance (SPECint_base95, SPECint95, SPECfp_base95,
SPECfp95).
The SPEC benchmark computes these figures by applying the geometric mean
(see next section) to the measured execution times of the 6 integer and
14 floating point routines in the suite. Most of the individual codes are
written in FORTRAN although 8 of the 20 are in C. Of the 14 floating point
routines, 9 are double precision (64 bit) while the others are single precision.
The floating point routines account for about 65,000 lines of code all together,
while the integer codes are abour 124,000 lines long. The SPEC benchmark
is thus a test based on real programs, kernels of real programs, or programs
taken from research labs.
Each of the 20 routines is referred to by a reference number and a name.
For example, 052.alvinn is a neural network training program written at
CMU. It takes a series of low- resolution black-and-white images of a road
from a moving vehicle and corresponding steering commands and trains a neural
network to mimic the steering operations when shown new imagery.
026.compress compresses and decompresses a 1 MB file 20 times using the
Unix compress utility. The compression routine is based on a dynamically
constructed hash- table, and thus tests cache performance on unpredictable
memory access patterns.
The 015.doduc routine is a FORTRAN monte-carlo simulation using double-precision
floating point.
056.ear is a C program that simulates the acoustical properties of the ear.
008.espresso is a Boolean expression minimizer that is often found in digital
logic CAD systems. Its input and output are in the form of truth tables
and it emphasizes table look-up operations and Boolean operations.
023.eqntott is a companion to espresso that translates a Boolean expression
into a truth table. This is an integer-based C routine that also uses the
standard C library routine qsort quite extensively.
0.94fppp is a scalar floating-point intensive FORTRAN program that solves
a quantum chemistry problem.
085.gcc as the name implies is a Gnu C compilation, in this case with Sun-3
(Motorola 68020 + 68882) assembler output.
090.hydro uses the Navier-Stokes equations to simulate the flow of material
in galaxies with active jets shooting out perpendicular to their disks.
It is a rather small code that is generally cache resident.
022.li is an execution of a 9-queens problem written recursively for a small
Lisp interpreter that is itself written in C. The reason for 9 rather than
8 queens is simply to make the program work harder. Obviously, the recursive
nature results in a test of deeper procedure calling than the other codes.
034.mdljdp2 solves the multi-body equations of motion for 500 atoms in a
gas using double-precision floating point. It is a small memory routine
that is often fully cachable. A second version, 077.mdljsp is simply a single-precision
implementation that is provided for comparison purposes.
093.nasa7 is a nested suite of 7 benchmark kernels, all easily vectorizable
and double precision. There are two fluid-dynamics programs, a matrix multiply,
a matrix inversion, a matrix solution (block tridiagonal on one dimension
of a 4D array), a 2D complex FFT, and a parallel Cholesky decomposition.
048.ora is an optical ray-tracing program that emphasizes double-precision
floating point, especially square roots.
072.sc is a run of a simple spreadsheet program doing some recalculations.
013.spice2g6 is an execution of a popular analog circuit simulation program.
089.su2cor uses quantum chemistry techniques to solve for elementary particle
masses.
078.swm256 is a small weather-prediction program that uses finite difference
methods.
047tomcatv is a 2D mesh benchmark that is specifically designed to be easily
vectorized.
039.wave5 is a large-memory program that simulates a plasma by solving Maxwell's
equations.
Other popular benchmarks include the Perfect Club suite of scientific programs
that were selected as being a challenge for supercomputer-class machines,
the Transaction Processing Council's
trio of benchmarks (TPC-A, TPC-B, and TPC-C) that test perfomance on applications
in which transactions are remotely entered that must be posted to a database,
and a series of benchmarks for image processing and computer vision (the
Abingdon Cross, CMU IP Suite, DARPA IU Benchmark, etc.).
If we run some benchmarks on a machine, and get execution time performance
figures for each, we would like to use these to compute an average time
for the machine. At first, we might be tempted to simply average the individual
execution times.
(sum of n execution times)/n But this is not proportional to the execution
time for a typical mix of programs, and is thus misleading. Suppose the
fastest program in the suite of benchmarks does some trivial operation that
we never expect our machine to execute in practice. It distorts the average,
leading us to believe that the machine will provide better performance than
it really does.
One approach to addressing this problem is to determine a weight for each
of the benchmark codes -- perhaps this reflects an estimate of how much
of the machine's expected load will mimic that of each particular code.
By multiplying the time of each benchmark by the weight before summing,
we get an overall performance that better represents what we can expect
on our anticipated load.
This is adequate for a single machine, but what if we want to compare two
machines and we are given their execution time on several codes. If we use
the arithmetic mean, we can get confusing results (adapted from Fig 2.7
in the text):
| A | B | A/A | B/A | P1+P2/2 | P1+P2/2 | A/B | B/B | |
| Program 1 | 1 | 10 | 1 | 10 | 0.1 | 1 | ||
| Program 2 | 1000 | 100 | 1 | 0.1 | 10 | 1 | ||
| Arithmetic | 500.5 | 55 | 1 | 0.109 | 5.05 | A=5.05*B | A=9.1*B | 1 |
From this we see that the arithmetic mean of the execution times on each
machine is different by a factor of about 9, yet the average of the relative
times for the machines is the same. The problem here is the same as the
algebraic property that a product of sums does not equal a sum of products
in general. Because we want to determine their relative performance (a (inverse)
multiplicative operation), we need to use an average that is based only
on multiplicative operations.
The geometric mean is such an average:
n-th root of(product of n execution times)
All the geometric mean does is determine the value such that if m of them
are multiplied together, they equal the product of the data values (think
of this as occupying a volume in space, and computing the length of a line
that spans the volume diagonally).
| A | B | A/A | B/A | P1+P2/2 | P1+P2/2 | A/B | B/B | |
| Program 1 | 1 | 10 | 1 | 10 | 0.1 | 1 | ||
| Program 2 | 1000 | 100 | 1 | 0.1 | 10 | 1 | ||
| Geometric | 31.6 | 31.6 | 1 | 1 | 1 | 1 | 1 | 1 |
Here we see that the ratio of the average of the times is the same as the average of the ratios. The two machines are actually comparable in performance.
Noting that performance is dominated by slower operations leads to the
basis for Amdahl's Law (Gene Amdahl was one of the architects of the IBM
360, and went on to form a successful competitor that built IBM-compatible
mainframes. Even then, there were IBM clones!) for speedup of a processor,
Speedup = n/(1 + (n-1)a)
where n is the speed of an operation after improvement, and a is the fraction
of time spent in operations other than that one.
What this says is that for an infinite speed (i.e., reducing the time for
the operation to zero), the maximum speedup is 1/a. Thus, if half of the
operations are unaccelerated (a = 0.5), the maximum speedup is a factor
of two.