CmpSci 535 Notes from Lecture 3

Performance Measures

The most commonly quoted measure of performance is peak performance, which is the maximum rate at which some operation can be executed. For example, the number of no- ops that can be executed in a single cycle loop, entirely in cache.

MIPS -- million instructions per second

BIPS, BOPS, GIPS, GOPS -- billions (giga -- 109) of instructions/operations per second

TOPS -- Trillions or tera operation per second. Often teraops.

FLOPS -- floating point operations per second (MFLOPS, GFLOPS, TFLOPS)

LIPS -- logical inferences per second

CPS or COPS -- Connections per second

Beyond tera (10^12) are peta (10^15), exa (10^18), yota (10^21) and zeta(10^24).

Peak performance is based on the clock rate of the processor, and on the minimum number of cycles per instruction (CPI) attainable. The clock rate depends on the technology used to implement the processor. The minimum CPI depends on the instruction set (usually just one or two instructions) and the speed of the innermost cache, which is usually on the same chip as the CPU in modern processors. Note that CPI may be less than one if pipelining and multiple function units are employed.

1/t ¥ CPImin

Thus, peak performance ignores the fact that there is a mix of CPI values that depend on the instruction set, the cache behavior, and the proportions in which instructions are executed.

Note, however, that as 1/t increases, peak performance increases linearly. This why clock rate is the second most often quoted performance figure. It is only a bit more meaningful than peak performance.

Can anybody think of a situation where a clock rate quotation may be useful?

When comparing two machines with the same architecture, but different clock rates, the ratio of the clock rates can indicate the difference in performance.

When is this comparison invalid? (when memory, I/O, communication, do not increase in speed by the same proportion)

MIPS is misleading because the work done by an instruction varies. A no-op does no work but gives a high mips rating. A floating point add does a lot of work, but may have a lower MIPS rating.

FLOPS, LIPS, COPS, TPS are more meaningful because they specify a particular operation, but they are ofte calculated in differetn ways for different machines. For example, a machine with a floating point divide instruction may get credit for executing only one FLOP, while a machine that has to do several operations to perform the divide may count this as several FLOPs.

What people really want to know is how fast a machine runs a given program. But that is a much more complex performance measurement. What factors determine this performance?

Algorithm and data set (size, length of execution, pattern of access -- affects memory sys)

Compiler (efficiency of code generated, ability to optimize access patterns)

Instruction set (how many instructions does it take to encode the program?)

Available operations (floating point support?)

Operating system (timeout period and cost, other processes, etc.)

CPI

Clock rate

Memory system performance (cache hit rate, miss penalty)

The only real way to accurately measure the performance of a system is to code a program of interest and execute it. Benchmarks have been created to serve as standard codes that can be tried on different machines for comparison purposes. Some popular ones include SPECmark, Perfect Club, Linpack, Livermore loops. There are also specialized benchmarks, such as those that test performance of PCs on popular packages.

The table on p. 17 of Kai Hwang's book, Advanced Computer Architecture, provides an interesting comparison:

Machine Clock Rate Peak Performance CPU Time VAX 11/780 5 MHz 1 MIPS 12 seconds IBM RS/6000 25 MHz 18 MIPS 1 second Why is the the VAX a 5:1 CPI, while the RS/6000 is only 1.4:1?

If the VAX is accelerated to 25 MHz, it will execute at 5 MIPS and have a CPU time of 2.4 seconds. How can it be that a machine (RS/6000) that has a 3.6X advantage in peak performance (over the accelerated VAX) gains only a factor of 2.4 improvement in execution time?

RS/6000 takes more instructions to execute the same program.

VAX may be able to hide extra memory references inside its longer instructions.

Benchmarks

The Standard Performance Evaluation Corporation (SPEC) is a company founded by a group of vendors who decided to set up a more meaningful benchmark that the simple synthetic ones that they could quote in their market literature. The original SPEC89 benchmark was replaced by a 1992 version, and this has recently been replaced by a 1995 version, so one must take care in identifying the version used when comparing machines of this period with later machines. One way to tell the difference is that SPEC89 reports a single figure called a SPECmark, while the 1992 benchmark reports separate integer and floating point figures (SPECint92 and SPECfp92), and the 1995 benchmark distinguishes figures for a base system configuration and peak performance (SPECint_base95, SPECint95, SPECfp_base95, SPECfp95).

The SPEC benchmark computes these figures by applying the geometric mean (see next section) to the measured execution times of the 6 integer and 14 floating point routines in the suite. Most of the individual codes are written in FORTRAN although 8 of the 20 are in C. Of the 14 floating point routines, 9 are double precision (64 bit) while the others are single precision. The floating point routines account for about 65,000 lines of code all together, while the integer codes are abour 124,000 lines long. The SPEC benchmark is thus a test based on real programs, kernels of real programs, or programs taken from research labs.

Each of the 20 routines is referred to by a reference number and a name. For example, 052.alvinn is a neural network training program written at CMU. It takes a series of low- resolution black-and-white images of a road from a moving vehicle and corresponding steering commands and trains a neural network to mimic the steering operations when shown new imagery.

026.compress compresses and decompresses a 1 MB file 20 times using the Unix compress utility. The compression routine is based on a dynamically constructed hash- table, and thus tests cache performance on unpredictable memory access patterns.

The 015.doduc routine is a FORTRAN monte-carlo simulation using double-precision floating point.

056.ear is a C program that simulates the acoustical properties of the ear.

008.espresso is a Boolean expression minimizer that is often found in digital logic CAD systems. Its input and output are in the form of truth tables and it emphasizes table look-up operations and Boolean operations.

023.eqntott is a companion to espresso that translates a Boolean expression into a truth table. This is an integer-based C routine that also uses the standard C library routine qsort quite extensively.

0.94fppp is a scalar floating-point intensive FORTRAN program that solves a quantum chemistry problem.

085.gcc as the name implies is a Gnu C compilation, in this case with Sun-3 (Motorola 68020 + 68882) assembler output.

090.hydro uses the Navier-Stokes equations to simulate the flow of material in galaxies with active jets shooting out perpendicular to their disks. It is a rather small code that is generally cache resident.

022.li is an execution of a 9-queens problem written recursively for a small Lisp interpreter that is itself written in C. The reason for 9 rather than 8 queens is simply to make the program work harder. Obviously, the recursive nature results in a test of deeper procedure calling than the other codes.

034.mdljdp2 solves the multi-body equations of motion for 500 atoms in a gas using double-precision floating point. It is a small memory routine that is often fully cachable. A second version, 077.mdljsp is simply a single-precision implementation that is provided for comparison purposes.

093.nasa7 is a nested suite of 7 benchmark kernels, all easily vectorizable and double precision. There are two fluid-dynamics programs, a matrix multiply, a matrix inversion, a matrix solution (block tridiagonal on one dimension of a 4D array), a 2D complex FFT, and a parallel Cholesky decomposition.

048.ora is an optical ray-tracing program that emphasizes double-precision floating point, especially square roots.

072.sc is a run of a simple spreadsheet program doing some recalculations.

013.spice2g6 is an execution of a popular analog circuit simulation program.

089.su2cor uses quantum chemistry techniques to solve for elementary particle masses.

078.swm256 is a small weather-prediction program that uses finite difference methods.

047tomcatv is a 2D mesh benchmark that is specifically designed to be easily vectorized.

039.wave5 is a large-memory program that simulates a plasma by solving Maxwell's equations.

Other popular benchmarks include the Perfect Club suite of scientific programs that were selected as being a challenge for supercomputer-class machines, the Transaction Processing Council's trio of benchmarks (TPC-A, TPC-B, and TPC-C) that test perfomance on applications in which transactions are remotely entered that must be posted to a database, and a series of benchmarks for image processing and computer vision (the Abingdon Cross, CMU IP Suite, DARPA IU Benchmark, etc.).

Mean Performance

If we run some benchmarks on a machine, and get execution time performance figures for each, we would like to use these to compute an average time for the machine. At first, we might be tempted to simply average the individual execution times.

(sum of n execution times)/n But this is not proportional to the execution time for a typical mix of programs, and is thus misleading. Suppose the fastest program in the suite of benchmarks does some trivial operation that we never expect our machine to execute in practice. It distorts the average, leading us to believe that the machine will provide better performance than it really does.

One approach to addressing this problem is to determine a weight for each of the benchmark codes -- perhaps this reflects an estimate of how much of the machine's expected load will mimic that of each particular code. By multiplying the time of each benchmark by the weight before summing, we get an overall performance that better represents what we can expect on our anticipated load.

This is adequate for a single machine, but what if we want to compare two machines and we are given their execution time on several codes. If we use the arithmetic mean, we can get confusing results (adapted from Fig 2.7 in the text):

   A  B  A/A  B/A  P1+P2/2  P1+P2/2  A/B  B/B
 Program 1  1  10  1  10      0.1  1
 Program 2  1000  100  1  0.1      10  1
 Arithmetic  500.5  55  1  0.109  5.05  A=5.05*B  A=9.1*B  1

From this we see that the arithmetic mean of the execution times on each machine is different by a factor of about 9, yet the average of the relative times for the machines is the same. The problem here is the same as the algebraic property that a product of sums does not equal a sum of products in general. Because we want to determine their relative performance (a (inverse) multiplicative operation), we need to use an average that is based only on multiplicative operations.

The geometric mean is such an average:

n-th root of(product of n execution times)

All the geometric mean does is determine the value such that if m of them are multiplied together, they equal the product of the data values (think of this as occupying a volume in space, and computing the length of a line that spans the volume diagonally).

   A  B  A/A  B/A  P1+P2/2  P1+P2/2  A/B  B/B
 Program 1  1  10  1  10      0.1  1
 Program 2  1000  100  1  0.1      10  1
 Geometric  31.6  31.6  1  1 1 1 1  1


A B A/A B/A P1*P2^0.5 P1*P2^0.5 A/B B/B Program 1 1 10 1 10 0.1 1 Program 2 1000 100 1 0.1 10 1 1 1 1 1 1 1

Here we see that the ratio of the average of the times is the same as the average of the ratios. The two machines are actually comparable in performance.

Amdahl's Law

Noting that performance is dominated by slower operations leads to the basis for Amdahl's Law (Gene Amdahl was one of the architects of the IBM 360, and went on to form a successful competitor that built IBM-compatible mainframes. Even then, there were IBM clones!) for speedup of a processor,

Speedup = n/(1 + (n-1)a)

where n is the speed of an operation after improvement, and a is the fraction of time spent in operations other than that one.

What this says is that for an infinite speed (i.e., reducing the time for the operation to zero), the maximum speedup is 1/a. Thus, if half of the operations are unaccelerated (a = 0.5), the maximum speedup is a factor of two.



© Copyright 1995, 1996 Charles C. Weems Jr. All rights reserved.
Back to Chip Weems' home page.
Back to courses index page.
Back to Computer Science Department home page.