Performance
Measures
One commonly quoted measure of performance is peak
performance, which is the maximum
rate at which some operation can be executed. For example, the number of
single-cycle operations, perfectly scheduled to keep all pipelines operating at
100% utilization, that can be executed in a tight loop, entirely in cache.
MIPS
-- million instructions per second
BIPS,
BOPS, GIPS, GOPS -- billions (giga -- 109)
of instructions/operations per second
TOPS
-- Trillions or tera operation per second. Often teraops.
FLOPS
-- floating point operations per second (MFLOPS, GFLOPS, TFLOPS)
LIPS
-- logical inferences per second
CPS
or COPS -- Connections per second
Beyond
tera (1012) are peta (1015), exa (1018), yota (1021) and
zeta(1024).
Absolute peak performance (instructions per second) is
based on the clock rate of the processor (Rclock),
and on the minimum number of clock Cycles Per Instruction (CPI) attainable. The
clock rate depends on the technology used to implement the processor. The
minimum CPI depends on the instruction set (usually just a subset of
instructions have the minimum CPI) and the speed of the innermost cache, which
is usually on the same chip as the CPU in modern processors. Note that CPI may
be less than one if multiple function units are employed in parallel.
peakabs = Rclock/CPImin
Thus, absolute peak performance ignores the fact that
there is a mix of CPI values that depend on the instruction set, the cache
behavior, and the proportions in which instructions are executed.
Note, however, that as Rclock
increases, peak performance increases linearly. This is why clock rate is the
most often quoted performance figure. It is only a bit more meaningful than
peak performance -- can you think of a situation where a clock rate quotation
may be useful?
When comparing two machines with the same
architecture, but different clock rates, the ratio of the clock rates can
indicate the difference in performance. When is this comparison invalid? (when
memory, I/O, communication, do not increase in speed by the same proportion)
What people really want to know is how fast a machine
runs a given program. But that is a much more complex performance measurement.
What factors determine this performance?
Algorithm and data set (size, length of execution,
pattern of access -- affects memory system)
Compiler (efficiency of code generated, ability to
optimize access patterns)
Instruction set (how many instructions does it take to
encode the program?)
Available operations (floating point support?)
Number and length of pipelines
Schedule of instructions, pipeline efficiency
Branch prediction accuracy, branch frequency, branch
penalty
Operating system (timeout period and cost, other
processes, etc.)
CPI
Clock rate
Memory system performance (cache hit rate, miss
penalty)
I/O system performance and overhead
The only real way to accurately measure the
performance of a system is to code a program of interest and execute it.
Benchmarks have been created to serve as standard codes that can be tried on
different machines for comparison purposes. Some popular ones include SPECmark,
Perfect Club, Linpack, Livermore loops. There are also specialized benchmarks,
such as those that test performance of PCs on popular packages, graphics
benchmarks, transaction processing benchmarks, network benchmarks, etc.
The table on p. 17 of Kai Hwang’s Advanced
Computer Architecture text provides an interesting comparison:
|
Machine |
Clock
Rate |
Peak
Performance |
CPU
Time |
|
VAX
11/780 |
5
MHz |
1
MIPS |
12
seconds |
|
IBM
RS/6000 |
25
MHz |
18
MIPS |
1
second |
Why is the VAX a 5:1 CPI, while the RS/6000 is only
1.4:1?
The VAX is a CISC architecture -- it takes more cycles
for each instruction
If the VAX is accelerated to 25 MHz, it will execute
at 5 MIPS and have a CPU time of 2.4 seconds. How can it be that a machine (RS/6000)
that has a 3.6X advantage in peak performance (over the accelerated VAX) gains
only a factor of 2.4 improvement in execution time?
RS/6000 takes more instructions to execute the same
program.
VAX may be able to hide extra memory references inside
its longer instructions.
Throughput
is the number of transactions per second for a program, where a transaction is
some meaningful unit of work for the application. For example, queries per
second for a database, or gates per second for a VLSI layout tool.
Benchmarks
The Whetstone benchmark is a collection of codes originally written in Algol 60 that
test floating point math library performance. These routines are designed to be
either scalar or unvectorizable array operations, although modern compilers are
now able to optimize some of them. The benchmark reports Whetstones/second
– an intentionally meaningless value that is only useful in comparison to
other executions of the benchmark. Unfortunately, Whetstone has been recoded in
other languages over the years, in both single and double precision, and
"improved" from time to time (sometimes in ways that favored
particular machines) so that it is almost impossible to compare one Whetstone
rating to another. However, if it indicates anything at all about a machine, it
is probably most related to scalar floating point performance.
Another popular benchmark is the Dhrystone (the name is taken from a play on Whetstone and the
fact that Dhrystone doesn't employ floating point -- i.e. if it doesn't float,
it must be dry) synthetic benchmark, which involves integer and string
processing. The benchmark has been implemented in both Ada and C. It reports
Dhrystones/second -- another intentionally meaningless number. However, it has
been reported that 1657 Dhrystones/second equals one VAX-11/780 MIPS.
Unfortunately, when DEC reported that the VAX-11/780 is a 1 MIPS machine, they
exaggerated by about a factor of 2 (the VAX-11/780 was actually a 0.47 MIPS
machine). DEC later tried to correct this error by renaming VAX MIPS as VAX
Units of Performance (VUPS).
The Dhrystone benchmark does not execute any real task
(which is why it is termed synthetic). Nor is its balance of processing
particularly representative of many applications. It heavily favors string
processing, has only a shallow subroutine call tree, and its subroutines are
smaller than would be found in programs doing real work. Dhrystone is also
small enough that it can be completely cache-resident during execution on many
machines,
One of the more famous floating-point benchmarks is Linpack, created by Jack Dongarra, which get its name from a
linear algebra package that it uses to solve a dense system of linear equations
with Gaussian elimination. The benchmark keeps track of execution time and then
divides this into the number of floating point operations that it performs to
get a MegaFLOPS rating. The version that is most often reported is based on a
100 x 100 matrix using double precision floating point. Valid results must be
reported from a standard FORTRAN compiled execution -- hand tuning is forbidden
if the result is to be considered official. Unfortunately, this version of the
benchmark is usually cache resident in any machine with more than about 300 KB
of cache.
The core of the 100 x 100 benchmark is a routine
called daxpy that is often referred to in journal articles about compiler
optimization. This is because the importance of this benchmark for marketing
purposes has become so great that everyone seeks to have their compiler give
the absolute best performance on it. Thus, compilers today often have special
code just to recognize daxpy and other parts of Linpack -- technically, this is
not hand optimization because it is part of the compiler, but it skews the
meaning of the reported performance nonetheless. Some compilers even have
special optimization switches that are only used for compiling Linpack.
The daxpy routine is based on a loop of the following
form:
do 30 i = 1,n
dy(i) = dy(i) + da*dx(i)
30 continue
It can be seen that this loop provides an excellent
opportunity to exploit a multiply-add operation as found in most
vector-processing machines and even some microprocessors and digital signal
processors. Thus, one may occasionally hear a reference to MACs --
Multiply-ACcumulate operations per second, which are often derived from a
daxpy-like loop benchmark.
Another version of Linpack that is often reported
involves a 1000 x 1000 matrix. While this may seem to be more realistic than
the 100 x 100 version, the caveat is that implementations of the larger version
are allowed to be hand-optimized. Thus, the performance for this
non-cache-resident benchmark can be surprisingly better than for its smaller
cousin.
One final note about Linpack is that it is an
officially administered benchmark. In order to have permission to quote it, a
vendor must report their results to Oak Ridge National Laboratory, where it is
administered. The results are available on-line through an auto-mailer. Simply
send the message "send performance from benchmark" to netlib@ornl.gov
and the latest report will be sent as a reply.
The Standard Performance Evaluation Corporation (SPEC) is a nonprofit corporation set up by a group of
vendors who decided they needed more meaningful benchmark than the simple
synthetic ones in use at the time that they could quote in their market
literature. The original SPEC89 benchmark was replaced by a 1992 version, and
that was in turn replaced by a 1995 version, so one must take care in
identifying the version used when comparing machines of different periods. One
way to tell the difference is that SPEC89 reports a single figure called a
SPECmark, while the newer benchmarks report separate integer and floating point
figures (SPECint92, SPECfp92, SPECint95, SPECfp95). The SPEC benchmark computes
these figures by applying the geometric mean (see next section) to the routines
in each part of the suite, and dividing the time by a
In the SPEC92 version, most of the individual codes
are written in FORTRAN although 8 of the 20 are in C. Of the 14 floating point
routines, 9 are double precision (64 bit) while the others are single
precision. The floating point routines account for about 58,000 lines of code
all together, while the integer codes are about 124,000 lines long. The SPEC
rating is obtained by dividing the wall clock time for a run of a routine by a
reference time (the time for a specified reference machine -- a VAX-11/780 for
SPEC92, a SPARCStation 10/40 for SPEC95) and then combining the ratios
(separately for the integer and floating point suites) with a geometric mean.
The rules for running the SPEC92 benchmarks are very
flexible. The only restriction is that a compiler flag cannot contain the name
of one of the benchmark routines. As a result, vendors have taken ever
opportunity to stretch the benchmark to its limits. The SPEC ratings are
sometimes called a "guaranteed never to exceed" performance
indicator. Some of the tricks used include disabling the network and display,
using only an external terminal to interface to the machine, running in single
user mode, using different compile flag settings and even different compilers
for each routine, using third-party optimizers, modifying the OS kernel, and
formatting the disk in a nonstandard manner.
One study (Computer Architecture News, 2/96) found
benchmark specific optimizations in one case (eqntott) such that changing a
constant from 1 to 2 in the code caused the object code to grow by a factor of
10 in size and run 6 times slower. Another case was found where an optimization
produced invalid object code if the data types of an expression were changed.
A variation on the SPEC benchmark was introduced in
1994 to address some of these problems and is called the SPECBase. Vendors must
report the SPECBase figure to SPEC but can continue to quote SPECPeak as the
existing benchmark was renamed. The SPECBase restricts the run to use only 4
compiler flags and they must be set the same for all of the routines. The same
compiler must be used for all routines. No assertion flags are allowed that
give the compiler information regarding the properties of the program that
cannot be identified from the source code. For SPECBase92, run-time profile
feedback was not allowed but SPECBase95 relaxes this restriction.
Each of the SPEC routines is referred to by a reference
number and a name.
The SPEC92int routines are all written in C
008.espresso is a Boolean expression
minimizer that is often found in digital logic CAD systems. Its input and
output are in the form of truth tables and it emphasizes table look-up operations
and Boolean operations. (Gross size = 14800 lines, net size excluding comments
and blank lines = 11000)
022.li is an execution of a 9-queens
problem written recursively for a small Lisp interpreter that is itself written
in C. The reason for 9 rather than 8 queens is simply to make the program work
harder. Obviously, the recursive nature results in a test of deeper procedure
calling than the other codes. (G= 7700, N = 5000)
023.eqntott is a companion to espresso
that translates a Boolean expression into a truth table. This is an
integer-based C routine that also uses the standard C library routine qsort
quite extensively. (G = 3600, N = 2600)
026.compress compresses and decompresses
a 1 MB file 20 times using the Unix compress utility. The compression routine
is based on a dynamically constructed hash-table (Lempel-Ziv coding), and thus
tests cache performance on unpredictable memory access patterns. (G = 1500, N =
1000)
072.sc is a run of a simple Unix
spreadsheet program (curses) doing some recalculations (budgets, SPEC metrics,
amortization schedules). (G = 8500, N = 7100)
085.gcc as the name implies is a Gnu C
compilation, in this case with Sun-3 (Motorola 68020 + 68882) assembler output.
(G = 87800, N = 58800)
The SPEC92fp routines are all written in
Fortran, unless otherwise noted:
013.spice2g6 is an execution of a
popular analog circuit simulation program (double precision). (G = 18900, N =
15000)
015.doduc routine is a FORTRAN
monte-carlo simulation of the time evolution of a thermo-hydraulic model for a
component of a nuclear reactor (double-precision). (G = 5300, N = 5300)
034.mdljdp2 solves the multi-body
equations of motion for 500 atoms in a gas using a model based on the idealized
Lennard-Jones potential. It is a small memory routine that is often fully
cachable (double precision). (G = 4500, N = 3600)
039.wave5 is a large-memory program that
simulates a plasma by solving Maxwell's equations on a catesian mesh (single
precision). (G = 7600, N = 6400)
047tomcatv is a 2D mesh benchmark that generates
a 2D boundary-fitted coordinate system around a geometric region. It is
specifically designed to be easily vectorized (double precision). (G = 200, N =
100)
048.ora is an optical ray-tracing
program that emphasizes double-precision floating point, especially square
roots. (G = 500, N = 300)
052.alvinn is a neural network training
program written in C. It takes a series of low-resolution black-and-white
images of a road from a moving vehicle and corresponding steering commands and
trains a neural network to mimic the steering operations when shown new imagery
(single precision). (G = 300, N = 200)
056.ear is a C program that simulates
the acoustical properties of the ear. It translates a sound file into a
cochleogram using FFT and other math library routines (single precision). (G =
5200, N = 3300)
077.mdljsp2 is simply a single-precision
implementation of 034mdljdp2 that is provided for comparison purposes (single
precision). (G = 3900, N = 3100)
078.swm256 is a shallow water modelling
program that uses finite difference methods over a 256 X 256 grid (single
precision). (G = 500, N = 300)
089.su2cor uses quantum chemistry
techniques to solve for elementary particle masses in Quark Gluon theory
(vectorizable, double precision). (G = 2500, N = 1700)
090.hydro uses the hydrodynamic
Navier-Stokes equations to simulate the flow of material in galaxies with
active jets shooting out perpendicular to their disks. It is a rather small
code that is generally cache resident (vectorizable, double precision). (G = 4500,
N = 1700)
093.nasa7 is a nested suite of 7
benchmark kernels. There are two fluid-dynamics programs, a matrix multiply, a
matrix inversion, a matrix solution (block tridiagonal on one dimension of a 4D
array), a 2D complex FFT, and a parallel Cholesky decomposition (all
vectorizable, double precision). (G = 1300, N = 800)
094.fppp is a scalar floating-point
intensive FORTRAN program that solves a quantum chemistry problem involving
electron interactions (double precision). (G = 2700, N = 2100)
The SPEC95 benchmark reuses many of the existing
codes, eliminates a few and adds a few new codes. For all of the routines new
data sets were developed that force longer runs. In one case, a routine was
modified slightly to use larger arrays. All of the SPECint routines are in C
and all of the floating point routines are in Fortran 77. The SPECint95
routines are:
099.go an artificial intelligence
program playing the game of Go.
124.m88ksim is a simulation of a
Motorola 88K RISC microprocessor running a test program.
126.gcc is the same as 085.gcc except
that a newer version of gcc is used, and outputs SPARC object code.
129.compress is the same as 026 compress
130.li is the same as 022.li
132.jpeg does a graphics compression and
decompressino using the JPEG standard
134.perl does string processing,
generating anagrams and prime numbers, using the PERL language
147.vortex runs a database program
The SPECfp95 routines are:
101.tomcatv is the same as 047.tomcatv
102.swim is the same as 078.swm256
except that it uses a 1024 X 1024 grid
103.su2cor is the same as 089.su2cor
104.hydro2d is the same as 090.hydro2d
107.mgrid is a multigrid equation solver
applied to a 3D potential field
110.applu solves parabolic and elliptic
partial differential equations
125.turb3d simulates isotropic
homogeneous turbulence in a cube
141.apsi is an air pollution simulation
that includes temperature, wind, velocity, and pollutant distribution
145.fpppp is the same as 094.fpppp
146.wave5 is the same as 039.wave5
The SPEC CPU 2000 benchmarks retired some of the
SPEC95 benchmarks and added new ones. These are meant to be more challenging.
However, one analysis has shown that even though the benchmarks increase memory
usage, they are less challenging in terms of branch prediction. In particular,
none of the SPEC 2000 benchmarks contains the density of difficult-to-predict
branches as the 099.go benchmark from SPEC 95. Good branch prediction is
critical to modern processors with multiple pipelines, and thus SPEC 2000 does
not provide applications that are as effective in helping to guide architects
in advancing the state of the art in branch prediction. Branching is, how2ever,
becoming more of an issue with increased use of object-oriented programming.
The SPEC 2000 CPU integer benchmarks are still mainly
written in C, although one is now in C++.
164.gzip
C Compression
175.vpr
C FPGA Circuit Placement
and Routing
176.gcc
C C Programming Language
Compiler
181.mcf
C Combinatorial
Optimization
186.crafty
C Game Playing: Chess
197.parser
C Word Processing
252.eon
C++ Computer Visualization
253.perlbmk
C PERL Programming Language
254.gap
C Group Theory, Interpreter
255.vortex
C Object-oriented Database
256.bzip2
C Compression
300.twolf
C Place and Route Simulator
The SPEC CPU 2000 floating point benchmarks are now
written in C, Fortran 77, and Fortran 90. It should be noted that, from a
research perspective, this diversity of languages makes it more difficult to
evaluate compiler optimizations on the full SPEC suite, as any research
compiler must now support four languages. From the perspective of evaluating
system performance, it is even more interesting to examine the detailed SPEC
reports, as they can reveal differences in optimization capabilities for
specific languages in addition to hardware performance.
168.wupwise
Fortran 77 Physics /
Quantum Chromodynamics
171.swim
Fortran 77 Shallow Water
Modeling
172.mgrid
Fortran 77 Multi-grid
Solver: 3D Potential Field
173.applu
Fortran 77 Parabolic /
Elliptic Partial Differential Equations
177.mesa
C 3-D Graphics Library
178.galgel
Fortran 90 Computational
Fluid Dynamics
179.art
C Image Recognition /
Neural Networks
183.equake
C Seismic Wave Propagation
Simulation
187.facerec
Fortran 90 Image
Processing: Face Recognition
188.ammp
C Computational Chemistry
189.lucas
Fortran 90 Number Theory /
Primality Testing
191.fma3d
Fortran 90 Finite-element
Crash Simulation
200.sixtrack
Fortran 77 High Energy
Nuclear Physics Accelerator Design
301.apsi
Fortran 77 Meteorology:
Pollutant Distribution
Other popular benchmarks include the Perfect Club
suite of scientific programs that were selected as being a challenge for
supercomputer-class machines, the Transaction Processing Council's trio of
benchmarks (TPC-A, TPC-B, and TPC-C, which has replaced the first two, and
simulates five different types of transactions rather than one) that test
performance on applications in which transactions are remotely entered that
must be posted to a database, and a series of benchmarks for image processing
and computer vision (the Abingdon Cross, CMU IP Suite, DARPA IU Benchmark,
etc.).
Mean Performance
If we run some benchmarks on a machine, and get
execution rate (e.g. SPECratio, MIPS or TPS) performance figures for each, we
would like to use these to compute an average execution rate for the machine.
At first, we might be tempted to simply average the individual execution rates.
![]()
But this is not proportional to the execution time for
the whole suite -- it is just the average of the individual rates. If all of
the benchmarks processed the same type and number of values, we could just
compute an overall execution rate from the total time for the suite. But
because benchmark suites generally contain a wide variety of programs, we need
another approach. The geometric mean, which is used in the SPEC benchmarks is:
![]()
But this is just as inappropriate. The arithmetic mean
essentially determines the value such that if m of those values are placed in a
line, their total length equals the sum of the lengths of the data values. All
the geometric mean does is determine the value such that if m of them are
multiplied together, they equal the product of the data values (think of this
as occupying the same volume in space). It still isn't proportional to the
total execution time of the benchmark suite. For example, in the SPEC
benchmarks it has been noted that exceptionally good performance on just one of
the routines can greatly skew the final SPEC rating in a positive direction.
Yet we intuitively sense that a more representative figure would not be so
skewed, and would be proportional to running some mix of the routines.
The problem is that the execution rate is the
inversely proportional to the execution time. Thus, we want a mean that inverts
the rates and divides into m instead of by m. This is precisely the harmonic
mean:
![]()
For example, if we have four benchmark programs that
execute at 37, 800, 14, and 12270 MIPS, the arithmetic mean is 3280 MIPS, the
geometric mean is 267 MIPS, and the Harmonic mean is 40.1 MIPS. What this shows
is that, using the Harmonic mean, the slower benchmarks dominate the execution
time and hence the overall performance, which is what we would expect to happen
in the real world.
Note that if a benchmark suite reports time rather
than rate of execution, then the arithmetic mean is appropriate. Also, if one
can determine the number of operations in an expected workload corresponding to
routines in a benchmark suite, then a weighted average can be used to estimate
the expected performance of a benchmarked machine on the expected workload. For
example, if the workload is 90% database and 10% JPEG compression, then the
appropriate ratios from the SPEC benchmark can be weighted by these factors to
determine a workload-specific performance figure. Of course, it is rare that
the codes in a benchmark are exactly those that will be used in a real
environment, so such an estimate is still of limited analytical value.
Amdahl's
Law
Noting that performance is dominated by slower
operations leads to the basis for Amdahl's Law (Gene Amdahl was one of the
architects of the IBM 360, and went on to form a successful competitor that
built IBM-compatible mainframes. Even then, there were IBM clones!) for speedup
of a parallel processor,
![]()
where n is the number of processors, and a is the fraction of time spent in purely sequential execution.
What this says is that for an infinite number of
processors (i.e. infinite computing resources), the maximum speedup is 1/a. Thus, if half of the operations are sequential, the maximum speedup
is a factor of two.
This result has been combined with the earlier results
showing a small maximum degree of available parallelism to provide an argument
against parallelism. However, that assumes a sequential algorithm is being run
on a parallel machine. When an explicitly parallel algorithm is used, many
applications show a value for a that approaches 1/n.
Furthermore, if speedup is relative to a sequential
algorithm, we sometimes see a superlinear speedup because the parallel
algorithm eliminates some overhead (e.g. it doesn't have to calculate array
element addresses from index values).
Amdahl’s law generalizes to any situation in
which a performance enhancement is added to a processor. If we think of
parallelism as merely a specific mechanism for improving performance, we can
see that any mechanism that improves the performance of just a portion of the
computers operations can be analyzed with Amdahl’s law. In this view, n
is the speedup factor of the enhancement, and a is the
fraction of the system’s operation that is not affected by the
enhancement.
For example, suppose there is a graphics operation
that accounts for 10% of execution time (90% of execution time is not affected
by any improvement in this operation) in an application, and by adding special
hardware we can speed this up by a factor of 18. Then we would have
18/(1+(18-1)*0.9) = 18/16.3 = 1.1 as the overall speedup that we can expect
from our extra hardware. The most that we could expect would be 1/0.9 = 1.11.
That is, reducing execution time by 10% (making the operation take no time at
all) gives us an 1.11X speedup.
We can use Amdahl’s law to quickly decide
whether it is worth exploring an architectural change. For example, suppose
that we could use twice as much hardware as in the preceding example, and make
the graphics operation run 36 times faster. We already know, however, that
we’re within about 1% of the optimal level of performance, so this is
likely going to be a waste of resources. A common mistake is to identify an
operation that can be sped up, and then try to maximize the speedup. Instead,
we must look at the overall speedup that will result, and try to find a
solution that balances the investment of resources against the expected gain.
Depending on the cost of the additional graphics hardware, it may not even make
sense to try to accelerate a graphics operation that accounts for just 10% of
execution time in a specific application. The architect must also consider the
importance of the application, and importance can have many definitions.