CmpSci 635 Notes from Lecture 3
Performance Measures
The most commonly quoted measure of performance is peak performance,
which is the maximum rate at which some operation can be executed. For example,
the number of no-ops that can be executed in a single cycle loop, entirely
in cache.
- MIPS -- million instructions per second
- BIPS, BOPS, GIPS, GOPS -- billions (giga -- 109) of instructions/operations
per second
- TOPS -- Trillions or tera operation per second. Often teraops.
- FLOPS -- floating point operations per second (MFLOPS, GFLOPS, TFLOPS)
- LIPS -- logical inferences per second
- CPS or COPS -- Connections per second
- Beyond tera (1012) are peta (1015), exa (1018),
yota (1021) and zeta(1024).
Absolute peak performance (instructions per second) is based on the clock
rate of the processor (Rclock), and on the minimum number of
clock Cycles Per Instruction (CPI) attainable. The clock rate depends on
the technology used to implement the processor. The minimum CPI depends
on the instruction set (usually just one or two instructions have the minimum
CPI) and the speed of the innermost cache, which is usually on the same
chip as the CPU in modern processors. Note that CPI may be less than one
if multiple function units are employed in parallel.
peakabs = Rclock/CPImin
Thus, absolute peak performance ignores the fact that there is a mix of
CPI values that depend on the instruction set, the cache behavior, and the
proportions in which instructions are executed.
Note, however, that as Rclock increases, peak performance increases
linearly. This is why clock rate is the second most often quoted performance
figure. It is only a bit more meaningful than peak performance -- can anybody
think of a situation where a clock rate quotation may be useful?
When comparing two machines with the same architecture, but different clock
rates, the ratio of the clock rates can indicate the difference in performance.
When is this comparison invalid? (when memory, I/O, communication, do not
increase in speed by the same proportion)
What people really want to know is how fast a machine runs a given program.
But that is a much more complex performance measurement. What factors determine
this performance?
- Algorithm and data set (size, length of execution, pattern of access
-- affects memory system)
- Compiler (efficiency of code generated, ability to optimize access
patterns)
- Instruction set (how many instructions does it take to encode the
program?)
- Available operations (floating point support?)
- Operating system (timeout period and cost, other processes, etc.)
- CPI
- Clock rate
- Memory system performance (cache hit rate, miss penalty)
- I/O system performance and overhead
The only real way to accurately measure the performance of a system is to
code a program of interest and execute it. Benchmarks have been created
to serve as standard codes that can be tried on different machines for comparison
purposes. Some popular ones include SPECmark, Perfect Club, Linpack, Livermore
loops. There are also specialized benchmarks, such as those that test performance
of PCs on popular packages.
The table on p. 17 of Hwang provides an interesting comparison:
Machine Clock Rate Peak Performance CPU Time
VAX 11/780 5 MHz 1 MIPS 12 seconds
IBM RS/6000 25 MHz 18 MIPS 1 second
Why is the the VAX a 5:1 CPI, while the RS/6000 is only 1.4:1?
The VAX is a CISC architecture -- it takes more cycles for each instruction
If the VAX is accelerated to 25 MHz, it will execute at 5 MIPS and have
a CPU time of 2.4 seconds. How can it be that a machine (RS/6000) that has
a 3.6X advantage in peak performance (over the accelerated VAX) gains only
a factor of 2.4 improvement in execution time?
RS/6000 takes more instructions to execute the same program.
VAX may be able to hide extra memory references inside its longer instructions.
Throughput is the number of transactions per second for a program,
where a transaction is some meaningful unit of work for the application.
For example, queries per second for a database, or gates per second for
a VLSI layout tool.
Benchmarks
The Whetstone benchmark is a collection of codes originally written in Algol
60 that test floating point math library performance. These routines are
designed to be either scalar or unvectorizable array operations, although
modern compilers are now able to optimize some of them. The benchmark reports
Whetstones/second -- a puposely meaningless value that is only useful in
comparison to other executions of the benchmark. Unfortunately, Whetstone
has been recoded in other languages over the years, in both single and double
precision, and "improved" from time to time (sometimes in ways
that favored particular machines) so that it is almost impossible to compare
one Whetstone rating to another. However, if it indicates anything at all
about a machine, it is probably most related to scalar floating point performance.
Another popular benchmark is the Dhrystone (the name is taken from
a play on Whetstone and the fact that Dhrystone doesn't employ floating
point -- i.e. if it doesn't float, it must be dry) synthetic benchmark,
which involves integer and string processing. The benchmark has been implemented
in both Ada and C. It reports Dhrystones/second -- another purposely meaningless
number. However, it has been reported that 1657 Dhrystones/second equals
one VAX-11/780 MIPS. Unfortunately, when DEC reported that the VAX-11/780
is a 1 MIPS machine, they exagerated by about a factor of 2 (the VAX-11/780
was actually a 0.47 MIPS machine). DEC later tried to correct this error
by renaming VAX MIPS as VAX Units of Performance (VUPS).
The Dhrystone benchmark does not execute any real task (which is why it
is termed synthetic). Nor is its balance of processing particularly representative
of many applications. It heavily favors string processing, has only a shallow
subroutine call tree, and its subroutines are smaller than would be found
in programs doing real work. Dhrystone is also small enough that it can
be completely cache-resident during execution on many machines,
One of the more famous floating point benchmarks of recent years is Linpack,
created by Jack Dongarra, which get its name from a linear algebra package
that it uses to solve a dense system of linear equations with Gaussian elimination.
The benchmark keeps track of execution time and then divides this into the
number of floating point operations that it performs to get a MegaFLOPS
rating. The version that is most often reported is based on a 100 x 100
matrix using double precision floating point. Valid results must be reported
from a standard FORTRAN compiled execution -- hand tuning is forbidden if
the result is to be considered official. Unfortunately, this version of
the benchmark is usually cache resident in any machine with more than about
300 KB of cache.
The core of the 100 x 100 benchmark is a routine called daxpy that is often
referred to in journal articles about compiler optimization. This is because
the importance of this benchmark for marketing purposes has become so great
that everyone seeks to have their compiler give the absolute best perfomance
on it. Thus, compilers today often have special code just to recognize daxpy
and other parts of Linpack -- technically, this is not hand optimization
because it is part of the compiler, but it skews the meaning of the reported
performance nonetheless. Some compilers even have special optimization switches
that are only used for compiling Linpack.
The daxpy routine is based on a loop of the following form:
do 30 i = 1,n
dy(i) = dy(i) + da*dx(i)
30 continue
It can be seen that this loop provides an excellent opportunity to exploit
a multiply-add operation as found in most vector-processing machines and
even some microprocessors and digital signal processors. Thus, one may occasionally
hear a reference to MACs -- Multiply-ACcumulate operations per second, which
are often derived from a daxpy-like loop benchmark.
Another version of Linpack that is often reported involves a 1000 x 1000
matrix. While this may seem to be more realistic than the 100 x 100 version,
the caveat is that implementations of the larger version are allowed to
be hand-optimized. Thus, the performance for this non- cache-resident benchmark
can be surprisingly better than for its smaller cousin.
One final note about Linpack is that it is an officially administered benchmark.
In order to have permission to quote it, a vendor must report their results
to Oak Ridge National Laboratory, where it is administered. The results
are available on-line through an auto- mailer. Simply send the message "send
performance from benchmark" to netlib@ornl.gov and the latest
report will be sent as a reply.
The Standard Performance Evaluation Corporation (SPEC) is a nonprofit
corporation set up by a group of vendors who decided they needed more meaningful
benchmark than the simple synthetic ones in use at the time that they could
quote in their market literature. The original SPEC89 benchmark was replaced
by a 1992 version, and that was in turn replaced by a 1995 version, so one
must take care in identifying the version used when comparing machines of
different periods. One way to tell the difference is that SPEC89 reports
a single figure called a SPECmark, while the newer benchmarks report separate
integer and floating point figures (SPECint92, SPECfp92, SPECint95, SPECfp95).
The SPEC benchmark computes these figures by applying the geometric mean
(see next section) to the routines in each part of the suite, and dividing
the time by a
In the SPEC92 version, most of the individual codes are written in FORTRAN
although 8 of the 20 are in C. Of the 14 floating point routines, 9 are
double precision (64 bit) while the others are single precision. The floating
point routines account for about 58,000 lines of code all together, while
the integer codes are abour 124,000 lines long. The SPEC rating is obtained
by dividing the wall clock time for a run of a routine by a reference time
(the time for a specified reference machine -- a VAX-11/780 for SPEC92,
a SPARCStation 10/40 for SPEC95) and then combining the ratios (separately
for the integer and floating point suites) with a geometric mean.
The rules for running the SPEC92 benchmarks are very flexible. The only
restriction is that a compiler flag cannot contain the name of one of the
benchmark routines. As a result, vendors have taken ever opportunity to
stretch the benchmark to its limits. The SPEC ratings are sometimes called
a "guaranteed never to exceed" performance indicator. Some of
the tricks used include disabling the network and display, using only an
external terminal to interface to the machine, running in single user mode,
using different compile flag settings and even different compilers for each
routine, using third-party optimizers, modifying the OS kernel, and formatting
the disk in a nonstandard manner.
One study (Computer Architecture News, 2/96) found benchmark specific optimizations
in one case (eqntott) such that changing a constant from 1 to 2 in the code
caused the object code to grow by a factor of 10 in size and run 6 times
slower. Another case was found where an optimization produced invalid object
code if the data types of an expression were changed.
A variation on the SPEC benchmark was introduced in 1994 to address some
of these problems and is called the SPECBase. Vendors must report the SPECBase
figure to SPEC but can continue to quote SPECPeak as the existing benchmark
was renamed. The SPECBase restricts the run to use only 4 compiler flags
and they must be set the same for all of the routines. The same compiler
must be used for all routines. No assertion flags are allowed that give
the compiler information regarding the properties of the program that cannot
be identified from the source code. For SPECBase92, run-time profile feedback
was not allowed but SPECBase95 relaxes this restriction.
Each of the SPEC routines is referred to by a reference number and a name.
The SPEC92int routines are all written in C
008.espresso is a Boolean expression minimizer that is often found
in digital logic CAD systems. Its input and output are in the form of truth
tables and it emphasizes table look-up operations and Boolean operations.
(Gross size = 14800 lines, net size excluding comments and blank lines =
11000)
022.li is an execution of a 9-queens problem written recursively
for a small Lisp interpreter that is itself written in C. The reason for
9 rather than 8 queens is simply to make the program work harder. Obviously,
the recursive nature results in a test of deeper procedure calling than
the other codes. (G= 7700, N = 5000)
023.eqntott is a companion to espresso that translates a Boolean
expression into a truth table. This is an integer-based C routine that also
uses the standard C library routine qsort quite extensively. (G = 3600,
N = 2600)
026.compress compresses and decompresses a 1 MB file 20 times using
the Unix compress utility. The compression routine is based on a dynamically
constructed hash-table (Lempel-Ziv coding), and thus tests cache performance
on unpredictable memory access patterns. (G = 1500, N = 1000)
072.sc is a run of a simple Unix spreadsheet program (curses) doing
some recalculations (budgets, SPEC metrics, amortization schedules). (G
= 8500, N = 7100)
085.gcc as the name implies is a Gnu C compilation, in this case
with Sun-3 (Motorola 68020 + 68882) assembler output. (G = 87800, N = 58800)
The SPEC92fp routines are all written in Fortran, unless otherwise noted:
013.spice2g6 is an execution of a popular analog circuit simulation
program (double precision). (G = 18900, N = 15000)
015.doduc routine is a FORTRAN monte-carlo simulation of the time
evolution of a thermo-hydraulic model for a component of a nuclear reactor
(double- precision). (G = 5300, N = 5300)
034.mdljdp2 solves the multi-body equations of motion for 500 atoms
in a gas using a model based on the idealized Lennard-Jones potential. It
is a small memory routine that is often fully cachable (double precision).
(G = 4500, N = 3600)
039.wave5 is a large-memory program that simulates a plasma by solving
Maxwell's equations on a catesian mesh (single precision). (G = 7600, N
= 6400)
047tomcatv is a 2D mesh benchmark that generates a 2D boundary-fitted
coordinate system around a geometric region. It is specifically designed
to be easily vectorized (double precision). (G = 200, N = 100)
048.ora is an optical ray-tracing program that emphasizes double-precision
floating point, especially square roots. (G = 500, N = 300)
052.alvinn is a neural network training program written in C. It
takes a series of low-resolution black-and-white images of a road from a
moving vehicle and corresponding steering commands and trains a neural network
to mimic the steering operations when shown new imagery (single precision).
(G = 300, N = 200)
056.ear is a C program that simulates the acoustical properties of
the ear. It translates a sound file into a cochleogram using FFT and other
math library routines (single precision). (G = 5200, N = 3300)
077.mdljsp2 is simply a single-precision implementation of 034mdljdp2
that is provided for comparison purposes (single precision). (G = 3900,
N = 3100)
078.swm256 is a shallow water modelling program that uses finite
difference methods over a 256 X 256 grid (single precision). (G = 500, N
= 300)
089.su2cor uses quantum chemistry techniques to solve for elementary
particle masses in Quark Gluon theory (vectorizable, double precision).
(G = 2500, N = 1700)
090.hydro uses the hydrodynamic Navier-Stokes equations to simulate
the flow of material in galaxies with active jets shooting out perpendicular
to their disks. It is a rather small code that is generally cache resident
(vectorizable, double precision). (G = 4500, N = 1700)
093.nasa7 is a nested suite of 7 benchmark kernels. There are two
fluid-dynamics programs, a matrix multiply, a matrix inversion, a matrix
solution (block tridiagonal on one dimension of a 4D array), a 2D complex
FFT, and a parallel Cholesky decomposition (all vectorizable, double precision).
(G = 1300, N = 800)
094.fpppp is a scalar floating-point intensive FORTRAN program that
solves a quantum chemistry problem involving electron interactions (double
precision). (G = 2700, N = 2100)
The SPEC95 benchmark reuses many of the existing codes, eliminates a few
and adds a few new codes. For all of the routines new data sets were developed
that force longer runs. In one case, a routine was modified slightly to
use larger arrays. All of the SPECint routines are in C and all of the floating
point routines are in Fortran 77. The SPECint95 routines are:
099.go an artificial intelligence program playing the game of Go.
124.m88ksim is a simulation of a Motorola 88K RISC microprocessor
running a test program.
126.gcc is the same as 085.gcc except that a newer version of gcc
is used, and outputs SPARC object code.
129.compress is the same as 026 compress
130.li is the same as 022.li
132.jpeg does a graphics compression and decompressino using the
JPEG standard
134.perl does string processing, generating anagrams and prime numbers,
using the PERL language
147.vortex runs a database program
The SPECfp95 routines are:
101.tomcatv is the same as 047.tomcatv
102.swim is the same as 078.swm256 except that it uses a 1024 X 1024
grid
103.su2cor is the same as 089.su2cor
104.hydro2d is the same as 090.hydro2d
107.mgrid is a multigrid equation solver applied to a 3D potential
field
110.applu solves parabolic and elliptic partial differential equations
125.turd3d simulates isotropic homogeneous turbulence in a cube
141.apsi is an air pollution simulation that includes temperature,
wind, velocity, and pollutant distribution
145.fpppp is the same as 094.fpppp
146.wave5 is the same as 039.wave5
Other popular benchmarks include the Perfect Club suite of scientific programs
that were selected as being a challenge for supercomputer-class machines,
the Transaction Processing Council's trio of benchmarks (TPC-A, TPC-B, and
TPC-C, which has replaced the first two, and simulates five different types
of transactions rather than one) that test perfomance on applications in
which transactions are remotely entered that must be posted to a database,
and a series of benchmarks for image processing and computer vision (the
Abingdon Cross, CMU IP Suite, DARPA IU Benchmark, etc.).
Mean Performance
If we run some benchmarks on a machine, and get execution rate (e.g. SPECratio,
MIPS or TPS) performance figures for each, we would like to use these to
compute an average execution rate for the machine. At first, we might be
tempted to simply average the individual execution rates.

But this is not proportional to the execution time for the whole suite --
it is just the average of the individual rates. If all of the benchmarks
processed the same type and number of values, we could just compute an overall
execution rate from the total time for the suite. But because benchmark
suites generally contain a wide variety of programs, we need another approach.
The geometric mean, which is used in the SPEC benchmarks is:

But this is just as inappropriate. The arithmetic mean essentially determines
the value such that if m of those values are placed in a line, their total
length equals the sum of the lengths of the data values. All the geometric
mean does is determine the value such that if m of them are multiplied together,
they equal the product of the data values (think of this as occupying the
same volume in space). It still isn't proportional to the total execution
time of the benchmark suite. For example, in the SPEC benchmarks it has
been noted that exceptionally good performance on just one of the routines
can greatly skew the final SPEC rating in a positive direction. Yet we intuitively
sense that a more representative figure would not be so skewed, and would
be proportional to running some mix of the routines.
The problem is that the execution rate is the inversely proportional to
the execution time. Thus, we want a mean that inverts the rates and divides
into m instead of by m. This is precisely the harmonic mean:

For example, if we have four benchmark programs that execute at 37, 800,
14, and 12270 MIPS, the arithmetic mean is 3280 MIPS, the geometric mean
is 267 MIPS, and the Harmonic mean is 40.1 MIPS. What this shows is that,
using the Harmonic mean, the slower benchmarks dominate the execution time
and hence the overall performance, which is what we would expect to happen
in the real world.
Note that if a benchmark suite reports time rather than rate of execution,
then the arithmetic mean is appropriate. Also, if one can determine the
number of operations in an expected workload corresponding to routines in
a benchmark suite, then a weighted average can be used to estimate the expected
performance of a benchmarked machine on the expected workload. For example,
if the workload is 90% database and 10% JPEG compression, then the appropriate
ratios from the SPEC benchmark can be weighted by these factors to determine
a workload-specific performance figure. Of course, it is rare that the code
in a benchmark are exactly those that will be used in a real environment,
so such an estimate is still of limited analytical value.
Amdahl's Law
Noting that performance is dominated by slower operations leads to the basis
for Amdahl's Law (Gene Amdahl was one of the architects of the IBM 360,
and went on to form a successful competitor that built IBM-compatible mainframes.
Even then, there were IBM clones!) for speedup of a parallel processor,

where n is the number of processors, and alpha is the fraction of time spent
in purely sequential execution.
What this says is that for an infinite number of processors (i.e. infinite
computing resources), the maximum speedup is 1/alpha. Thus, if half of the
operations are sequential, the maximum speedup is a factor of two.
This result has been combined with the earlier results showing a small maximum
degree of available parallelism to provide an argument against parallelism.
However, that assumes a sequential algorithm is being run on a parallel
machine. When an explicitly parallel algorithm is used, many applications
show a value for alpha that approaches 1/n.
Furthermore, if speedup is relative to a sequential algorithm, we sometimes
see a superlinear speedup because the parallel algorithm eliminates some
overhead (e.g. it doesn't have to calculate array element addresses from
index values).
The AT2 model
In designing computer architectures, one can make the superficial observation
that the more hardware that is employed, the greater the power of the system.
Theoreticians would like to have a model that relates hardware complexity
to algorithmic complexity. One simple measure of complexity in a system
is the area of silicon. Thompson proposed that this area times latency time
squared would be a proportional upper bound on algorithmic complexity.

That is, given an algorithmic complexity of a particular order, changing
silicon area has a linear effect on time, while changing time has a squared
effect on the area required to achieve the same overall complexity.
One useful derived measure that comes out of this model is that, given a
square chip with area A, an orthogonal cut across the chip produces an edge
with length root A. In a given time period, T, the amount of data that can
flow across this edge is proportional to

This is called the bisection bandwidth of the chip.
© Copyright 1995, 1996 Charles C. Weems Jr. All rights reserved.
Back
to Chip Weems' home page.
Back to courses
index page.
Back to Computer
Science Department home page.