Performance Measures

One commonly quoted measure of performance is peak performance, which is the maximum rate at which some operation can be executed. For example, the number of single-cycle operations, perfectly scheduled to keep all pipelines operating at 100% utilization, that can be executed in a tight loop, entirely in cache.

MIPS -- million instructions per second

BIPS, BOPS, GIPS, GOPS -- billions (giga -- 109) of instructions/operations per second

TOPS -- Trillions or tera operation per second. Often teraops.

FLOPS -- floating point operations per second (MFLOPS, GFLOPS, TFLOPS)

LIPS -- logical inferences per second

CPS or COPS -- Connections per second

Beyond tera (1012) are peta (1015), exa (1018), yota (1021) and zeta(1024).

Absolute peak performance (instructions per second) is based on the clock rate of the processor (Rclock), and on the minimum number of clock Cycles Per Instruction (CPI) attainable. The clock rate depends on the technology used to implement the processor. The minimum CPI depends on the instruction set (usually just a subset of instructions have the minimum CPI) and the speed of the innermost cache, which is usually on the same chip as the CPU in modern processors. Note that CPI may be less than one if multiple function units are employed in parallel.

peakabs = Rclock/CPImin

Thus, absolute peak performance ignores the fact that there is a mix of CPI values that depend on the instruction set, the cache behavior, and the proportions in which instructions are executed.

Note, however, that as Rclock increases, peak performance increases linearly. This is why clock rate is the most often quoted performance figure. It is only a bit more meaningful than peak performance -- can you think of a situation where a clock rate quotation may be useful?

When comparing two machines with the same architecture, but different clock rates, the ratio of the clock rates can indicate the difference in performance. When is this comparison invalid? (when memory, I/O, communication, do not increase in speed by the same proportion)

What people really want to know is how fast a machine runs a given program. But that is a much more complex performance measurement. What factors determine this performance?

Algorithm and data set (size, length of execution, pattern of access -- affects memory system)

Compiler (efficiency of code generated, ability to optimize access patterns)

Instruction set (how many instructions does it take to encode the program?)

Available operations (floating point support?)

Number and length of pipelines

Schedule of instructions, pipeline efficiency

Branch prediction accuracy, branch frequency, branch penalty

Operating system (timeout period and cost, other processes, etc.)

CPI

Clock rate

Memory system performance (cache hit rate, miss penalty)

I/O system performance and overhead

The only real way to accurately measure the performance of a system is to code a program of interest and execute it. Benchmarks have been created to serve as standard codes that can be tried on different machines for comparison purposes. Some popular ones include SPECmark, Perfect Club, Linpack, Livermore loops. There are also specialized benchmarks, such as those that test performance of PCs on popular packages, graphics benchmarks, transaction processing benchmarks, network benchmarks, etc.

The table on p. 17 of Kai Hwang’s Advanced Computer Architecture text provides an interesting comparison:

 

Machine

Clock Rate

Peak Performance

CPU Time

VAX 11/780

5 MHz

1 MIPS

12 seconds

IBM RS/6000

25 MHz

18 MIPS

1 second

Why is the VAX a 5:1 CPI, while the RS/6000 is only 1.4:1?

The VAX is a CISC architecture -- it takes more cycles for each instruction

If the VAX is accelerated to 25 MHz, it will execute at 5 MIPS and have a CPU time of 2.4 seconds. How can it be that a machine (RS/6000) that has a 3.6X advantage in peak performance (over the accelerated VAX) gains only a factor of 2.4 improvement in execution time?

RS/6000 takes more instructions to execute the same program.

VAX may be able to hide extra memory references inside its longer instructions.

Throughput is the number of transactions per second for a program, where a transaction is some meaningful unit of work for the application. For example, queries per second for a database, or gates per second for a VLSI layout tool.

Benchmarks

The Whetstone benchmark is a collection of codes originally written in Algol 60 that test floating point math library performance. These routines are designed to be either scalar or unvectorizable array operations, although modern compilers are now able to optimize some of them. The benchmark reports Whetstones/second – an intentionally meaningless value that is only useful in comparison to other executions of the benchmark. Unfortunately, Whetstone has been recoded in other languages over the years, in both single and double precision, and "improved" from time to time (sometimes in ways that favored particular machines) so that it is almost impossible to compare one Whetstone rating to another. However, if it indicates anything at all about a machine, it is probably most related to scalar floating point performance.

Another popular benchmark is the Dhrystone (the name is taken from a play on Whetstone and the fact that Dhrystone doesn't employ floating point -- i.e. if it doesn't float, it must be dry) synthetic benchmark, which involves integer and string processing. The benchmark has been implemented in both Ada and C. It reports Dhrystones/second -- another intentionally meaningless number. However, it has been reported that 1657 Dhrystones/second equals one VAX-11/780 MIPS. Unfortunately, when DEC reported that the VAX-11/780 is a 1 MIPS machine, they exaggerated by about a factor of 2 (the VAX-11/780 was actually a 0.47 MIPS machine). DEC later tried to correct this error by renaming VAX MIPS as VAX Units of Performance (VUPS).

The Dhrystone benchmark does not execute any real task (which is why it is termed synthetic). Nor is its balance of processing particularly representative of many applications. It heavily favors string processing, has only a shallow subroutine call tree, and its subroutines are smaller than would be found in programs doing real work. Dhrystone is also small enough that it can be completely cache-resident during execution on many machines,

One of the more famous floating-point benchmarks is Linpack, created by Jack Dongarra, which get its name from a linear algebra package that it uses to solve a dense system of linear equations with Gaussian elimination. The benchmark keeps track of execution time and then divides this into the number of floating point operations that it performs to get a MegaFLOPS rating. The version that is most often reported is based on a 100 x 100 matrix using double precision floating point. Valid results must be reported from a standard FORTRAN compiled execution -- hand tuning is forbidden if the result is to be considered official. Unfortunately, this version of the benchmark is usually cache resident in any machine with more than about 300 KB of cache.

The core of the 100 x 100 benchmark is a routine called daxpy that is often referred to in journal articles about compiler optimization. This is because the importance of this benchmark for marketing purposes has become so great that everyone seeks to have their compiler give the absolute best performance on it. Thus, compilers today often have special code just to recognize daxpy and other parts of Linpack -- technically, this is not hand optimization because it is part of the compiler, but it skews the meaning of the reported performance nonetheless. Some compilers even have special optimization switches that are only used for compiling Linpack.

The daxpy routine is based on a loop of the following form:

   do 30 i = 1,n

     dy(i) = dy(i) + da*dx(i)

30 continue

It can be seen that this loop provides an excellent opportunity to exploit a multiply-add operation as found in most vector-processing machines and even some microprocessors and digital signal processors. Thus, one may occasionally hear a reference to MACs -- Multiply-ACcumulate operations per second, which are often derived from a daxpy-like loop benchmark.

Another version of Linpack that is often reported involves a 1000 x 1000 matrix. While this may seem to be more realistic than the 100 x 100 version, the caveat is that implementations of the larger version are allowed to be hand-optimized. Thus, the performance for this non-cache-resident benchmark can be surprisingly better than for its smaller cousin.

One final note about Linpack is that it is an officially administered benchmark. In order to have permission to quote it, a vendor must report their results to Oak Ridge National Laboratory, where it is administered. The results are available on-line through an auto-mailer. Simply send the message "send performance from benchmark" to netlib@ornl.gov and the latest report will be sent as a reply.

The Standard Performance Evaluation Corporation (SPEC) is a nonprofit corporation set up by a group of vendors who decided they needed more meaningful benchmark than the simple synthetic ones in use at the time that they could quote in their market literature. The original SPEC89 benchmark was replaced by a 1992 version, and that was in turn replaced by a 1995 version, so one must take care in identifying the version used when comparing machines of different periods. One way to tell the difference is that SPEC89 reports a single figure called a SPECmark, while the newer benchmarks report separate integer and floating point figures (SPECint92, SPECfp92, SPECint95, SPECfp95). The SPEC benchmark computes these figures by applying the geometric mean (see next section) to the routines in each part of the suite, and dividing the time by a

In the SPEC92 version, most of the individual codes are written in FORTRAN although 8 of the 20 are in C. Of the 14 floating point routines, 9 are double precision (64 bit) while the others are single precision. The floating point routines account for about 58,000 lines of code all together, while the integer codes are about 124,000 lines long. The SPEC rating is obtained by dividing the wall clock time for a run of a routine by a reference time (the time for a specified reference machine -- a VAX-11/780 for SPEC92, a SPARCStation 10/40 for SPEC95) and then combining the ratios (separately for the integer and floating point suites) with a geometric mean.

The rules for running the SPEC92 benchmarks are very flexible. The only restriction is that a compiler flag cannot contain the name of one of the benchmark routines. As a result, vendors have taken ever opportunity to stretch the benchmark to its limits. The SPEC ratings are sometimes called a "guaranteed never to exceed" performance indicator. Some of the tricks used include disabling the network and display, using only an external terminal to interface to the machine, running in single user mode, using different compile flag settings and even different compilers for each routine, using third-party optimizers, modifying the OS kernel, and formatting the disk in a nonstandard manner.

One study (Computer Architecture News, 2/96) found benchmark specific optimizations in one case (eqntott) such that changing a constant from 1 to 2 in the code caused the object code to grow by a factor of 10 in size and run 6 times slower. Another case was found where an optimization produced invalid object code if the data types of an expression were changed.

A variation on the SPEC benchmark was introduced in 1994 to address some of these problems and is called the SPECBase. Vendors must report the SPECBase figure to SPEC but can continue to quote SPECPeak as the existing benchmark was renamed. The SPECBase restricts the run to use only 4 compiler flags and they must be set the same for all of the routines. The same compiler must be used for all routines. No assertion flags are allowed that give the compiler information regarding the properties of the program that cannot be identified from the source code. For SPECBase92, run-time profile feedback was not allowed but SPECBase95 relaxes this restriction.

Each of the SPEC routines is referred to by a reference number and a name.

The SPEC92int routines are all written in C

008.espresso is a Boolean expression minimizer that is often found in digital logic CAD systems. Its input and output are in the form of truth tables and it emphasizes table look-up operations and Boolean operations. (Gross size = 14800 lines, net size excluding comments and blank lines = 11000)

022.li is an execution of a 9-queens problem written recursively for a small Lisp interpreter that is itself written in C. The reason for 9 rather than 8 queens is simply to make the program work harder. Obviously, the recursive nature results in a test of deeper procedure calling than the other codes. (G= 7700, N = 5000)

023.eqntott is a companion to espresso that translates a Boolean expression into a truth table. This is an integer-based C routine that also uses the standard C library routine qsort quite extensively. (G = 3600, N = 2600)

026.compress compresses and decompresses a 1 MB file 20 times using the Unix compress utility. The compression routine is based on a dynamically constructed hash-table (Lempel-Ziv coding), and thus tests cache performance on unpredictable memory access patterns. (G = 1500, N = 1000)

072.sc is a run of a simple Unix spreadsheet program (curses) doing some recalculations (budgets, SPEC metrics, amortization schedules). (G = 8500, N = 7100)

085.gcc as the name implies is a Gnu C compilation, in this case with Sun-3 (Motorola 68020 + 68882) assembler output. (G = 87800, N = 58800)

The SPEC92fp routines are all written in Fortran, unless otherwise noted:

013.spice2g6 is an execution of a popular analog circuit simulation program (double precision). (G = 18900, N = 15000)

015.doduc routine is a FORTRAN monte-carlo simulation of the time evolution of a thermo-hydraulic model for a component of a nuclear reactor (double-precision). (G = 5300, N = 5300)

034.mdljdp2 solves the multi-body equations of motion for 500 atoms in a gas using a model based on the idealized Lennard-Jones potential. It is a small memory routine that is often fully cachable (double precision). (G = 4500, N = 3600)

039.wave5 is a large-memory program that simulates a plasma by solving Maxwell's equations on a catesian mesh (single precision). (G = 7600, N = 6400)

047tomcatv is a 2D mesh benchmark that generates a 2D boundary-fitted coordinate system around a geometric region. It is specifically designed to be easily vectorized (double precision). (G = 200, N = 100)

048.ora is an optical ray-tracing program that emphasizes double-precision floating point, especially square roots. (G = 500, N = 300)

052.alvinn is a neural network training program written in C. It takes a series of low-resolution black-and-white images of a road from a moving vehicle and corresponding steering commands and trains a neural network to mimic the steering operations when shown new imagery (single precision). (G = 300, N = 200)

056.ear is a C program that simulates the acoustical properties of the ear. It translates a sound file into a cochleogram using FFT and other math library routines (single precision). (G = 5200, N = 3300)

077.mdljsp2 is simply a single-precision implementation of 034mdljdp2 that is provided for comparison purposes (single precision). (G = 3900, N = 3100)

078.swm256 is a shallow water modelling program that uses finite difference methods over a 256 X 256 grid (single precision). (G = 500, N = 300)

089.su2cor uses quantum chemistry techniques to solve for elementary particle masses in Quark Gluon theory (vectorizable, double precision). (G = 2500, N = 1700)

090.hydro uses the hydrodynamic Navier-Stokes equations to simulate the flow of material in galaxies with active jets shooting out perpendicular to their disks. It is a rather small code that is generally cache resident (vectorizable, double precision). (G = 4500, N = 1700)

093.nasa7 is a nested suite of 7 benchmark kernels. There are two fluid-dynamics programs, a matrix multiply, a matrix inversion, a matrix solution (block tridiagonal on one dimension of a 4D array), a 2D complex FFT, and a parallel Cholesky decomposition (all vectorizable, double precision). (G = 1300, N = 800)

094.fppp is a scalar floating-point intensive FORTRAN program that solves a quantum chemistry problem involving electron interactions (double precision). (G = 2700, N = 2100)

The SPEC95 benchmark reuses many of the existing codes, eliminates a few and adds a few new codes. For all of the routines new data sets were developed that force longer runs. In one case, a routine was modified slightly to use larger arrays. All of the SPECint routines are in C and all of the floating point routines are in Fortran 77. The SPECint95 routines are:

099.go an artificial intelligence program playing the game of Go.

124.m88ksim is a simulation of a Motorola 88K RISC microprocessor running a test program.

126.gcc is the same as 085.gcc except that a newer version of gcc is used, and outputs SPARC object code.

129.compress is the same as 026 compress

130.li is the same as 022.li

132.jpeg does a graphics compression and decompressino using the JPEG standard

134.perl does string processing, generating anagrams and prime numbers, using the PERL language

147.vortex runs a database program

The SPECfp95 routines are:

101.tomcatv is the same as 047.tomcatv

102.swim is the same as 078.swm256 except that it uses a 1024 X 1024 grid

103.su2cor is the same as 089.su2cor

104.hydro2d is the same as 090.hydro2d

107.mgrid is a multigrid equation solver applied to a 3D potential field

110.applu solves parabolic and elliptic partial differential equations

125.turb3d simulates isotropic homogeneous turbulence in a cube

141.apsi is an air pollution simulation that includes temperature, wind, velocity, and pollutant distribution

145.fpppp is the same as 094.fpppp

146.wave5 is the same as 039.wave5

The SPEC CPU 2000 benchmarks retired some of the SPEC95 benchmarks and added new ones. These are meant to be more challenging. However, one analysis has shown that even though the benchmarks increase memory usage, they are less challenging in terms of branch prediction. In particular, none of the SPEC 2000 benchmarks contains the density of difficult-to-predict branches as the 099.go benchmark from SPEC 95. Good branch prediction is critical to modern processors with multiple pipelines, and thus SPEC 2000 does not provide applications that are as effective in helping to guide architects in advancing the state of the art in branch prediction. Branching is, how2ever, becoming more of an issue with increased use of object-oriented programming.

The SPEC 2000 CPU integer benchmarks are still mainly written in C, although one is now in C++.

164.gzip  C  Compression 

175.vpr  C  FPGA Circuit Placement and Routing

176.gcc  C  C Programming Language Compiler

181.mcf  C  Combinatorial Optimization

186.crafty  C  Game Playing: Chess

197.parser  C  Word Processing

252.eon  C++  Computer Visualization

253.perlbmk  C  PERL Programming Language

254.gap  C  Group Theory, Interpreter

255.vortex  C  Object-oriented Database

256.bzip2  C  Compression

300.twolf  C  Place and Route Simulator

The SPEC CPU 2000 floating point benchmarks are now written in C, Fortran 77, and Fortran 90. It should be noted that, from a research perspective, this diversity of languages makes it more difficult to evaluate compiler optimizations on the full SPEC suite, as any research compiler must now support four languages. From the perspective of evaluating system performance, it is even more interesting to examine the detailed SPEC reports, as they can reveal differences in optimization capabilities for specific languages in addition to hardware performance.

168.wupwise  Fortran 77  Physics / Quantum Chromodynamics

171.swim  Fortran 77  Shallow Water Modeling

172.mgrid  Fortran 77  Multi-grid Solver: 3D Potential Field

173.applu  Fortran 77  Parabolic / Elliptic Partial Differential Equations

177.mesa  C  3-D Graphics Library

178.galgel  Fortran 90  Computational Fluid Dynamics

179.art  C  Image Recognition / Neural Networks

183.equake  C  Seismic Wave Propagation Simulation

187.facerec  Fortran 90  Image Processing: Face Recognition

188.ammp  C  Computational Chemistry

189.lucas  Fortran 90  Number Theory / Primality Testing

191.fma3d  Fortran 90  Finite-element Crash Simulation

200.sixtrack  Fortran 77  High Energy Nuclear Physics Accelerator Design

301.apsi  Fortran 77  Meteorology: Pollutant Distribution

Other popular benchmarks include the Perfect Club suite of scientific programs that were selected as being a challenge for supercomputer-class machines, the Transaction Processing Council's trio of benchmarks (TPC-A, TPC-B, and TPC-C, which has replaced the first two, and simulates five different types of transactions rather than one) that test performance on applications in which transactions are remotely entered that must be posted to a database, and a series of benchmarks for image processing and computer vision (the Abingdon Cross, CMU IP Suite, DARPA IU Benchmark, etc.).

Mean Performance

If we run some benchmarks on a machine, and get execution rate (e.g. SPECratio, MIPS or TPS) performance figures for each, we would like to use these to compute an average execution rate for the machine. At first, we might be tempted to simply average the individual execution rates.

But this is not proportional to the execution time for the whole suite -- it is just the average of the individual rates. If all of the benchmarks processed the same type and number of values, we could just compute an overall execution rate from the total time for the suite. But because benchmark suites generally contain a wide variety of programs, we need another approach. The geometric mean, which is used in the SPEC benchmarks is:

But this is just as inappropriate. The arithmetic mean essentially determines the value such that if m of those values are placed in a line, their total length equals the sum of the lengths of the data values. All the geometric mean does is determine the value such that if m of them are multiplied together, they equal the product of the data values (think of this as occupying the same volume in space). It still isn't proportional to the total execution time of the benchmark suite. For example, in the SPEC benchmarks it has been noted that exceptionally good performance on just one of the routines can greatly skew the final SPEC rating in a positive direction. Yet we intuitively sense that a more representative figure would not be so skewed, and would be proportional to running some mix of the routines.

The problem is that the execution rate is the inversely proportional to the execution time. Thus, we want a mean that inverts the rates and divides into m instead of by m. This is precisely the harmonic mean:

For example, if we have four benchmark programs that execute at 37, 800, 14, and 12270 MIPS, the arithmetic mean is 3280 MIPS, the geometric mean is 267 MIPS, and the Harmonic mean is 40.1 MIPS. What this shows is that, using the Harmonic mean, the slower benchmarks dominate the execution time and hence the overall performance, which is what we would expect to happen in the real world.

Note that if a benchmark suite reports time rather than rate of execution, then the arithmetic mean is appropriate. Also, if one can determine the number of operations in an expected workload corresponding to routines in a benchmark suite, then a weighted average can be used to estimate the expected performance of a benchmarked machine on the expected workload. For example, if the workload is 90% database and 10% JPEG compression, then the appropriate ratios from the SPEC benchmark can be weighted by these factors to determine a workload-specific performance figure. Of course, it is rare that the codes in a benchmark are exactly those that will be used in a real environment, so such an estimate is still of limited analytical value.

Amdahl's  Law

Noting that performance is dominated by slower operations leads to the basis for Amdahl's Law (Gene Amdahl was one of the architects of the IBM 360, and went on to form a successful competitor that built IBM-compatible mainframes. Even then, there were IBM clones!) for speedup of a parallel processor,

where n is the number of processors, and a is the fraction of time spent in purely sequential execution.

What this says is that for an infinite number of processors (i.e. infinite computing resources), the maximum speedup is 1/a. Thus, if half of the operations are sequential, the maximum speedup is a factor of two.

This result has been combined with the earlier results showing a small maximum degree of available parallelism to provide an argument against parallelism. However, that assumes a sequential algorithm is being run on a parallel machine. When an explicitly parallel algorithm is used, many applications show a value for a that approaches 1/n.

Furthermore, if speedup is relative to a sequential algorithm, we sometimes see a superlinear speedup because the parallel algorithm eliminates some overhead (e.g. it doesn't have to calculate array element addresses from index values).

Amdahl’s law generalizes to any situation in which a performance enhancement is added to a processor. If we think of parallelism as merely a specific mechanism for improving performance, we can see that any mechanism that improves the performance of just a portion of the computers operations can be analyzed with Amdahl’s law. In this view, n is the speedup factor of the enhancement, and a is the fraction of the system’s operation that is not affected by the enhancement.

For example, suppose there is a graphics operation that accounts for 10% of execution time (90% of execution time is not affected by any improvement in this operation) in an application, and by adding special hardware we can speed this up by a factor of 18. Then we would have 18/(1+(18-1)*0.9) = 18/16.3 = 1.1 as the overall speedup that we can expect from our extra hardware. The most that we could expect would be 1/0.9 = 1.11. That is, reducing execution time by 10% (making the operation take no time at all) gives us an 1.11X speedup.

We can use Amdahl’s law to quickly decide whether it is worth exploring an architectural change. For example, suppose that we could use twice as much hardware as in the preceding example, and make the graphics operation run 36 times faster. We already know, however, that we’re within about 1% of the optimal level of performance, so this is likely going to be a waste of resources. A common mistake is to identify an operation that can be sped up, and then try to maximize the speedup. Instead, we must look at the overall speedup that will result, and try to find a solution that balances the investment of resources against the expected gain. Depending on the cost of the additional graphics hardware, it may not even make sense to try to accelerate a graphics operation that accounts for just 10% of execution time in a specific application. The architect must also consider the importance of the application, and importance can have many definitions.