CmpSci 535 Discussion Reading 3

 

The Canonical Generations of Technology

 

In our review of modern computation, we saw a list of generations of technology:

 

0 Mechanical/Electromechanical

1 Vacuum tube

2 Transistor

3 Integrated circuit

4 Very Large Scale Integration/Microprocessor

5 Parallel

 

These generations, however, are not particularly indicative of anything other than changes in construction techniques. The first two generations represent the typical initial period in which existing technology is applied to a novel invention. It is not unlike the early days of automobiles in which some people tried to use steam engines to power carriages that were modifications of designs that were meant to be pulled by horses. Just as the complicated and dangerous steam mechanisms were inappropriate for a personal transportation device, the high-maintenance and high-power-consumption mechanical and vacuum tube technologies were inappropriate for a device that needed to employ greater and greater numbers of components to achieve its goals.

 

Semiconductor electronics were developed for other applications in which low power and high reliability were required, and were a natural match for computing. It should not be inferred that this was somehow a coincidence -- developments in science and engineering together with social pressures resulted in a general progression to more complex mechanisms in many domains. Computing happens to be one of the most visible and novel outgrowths of having crossed this particular threshold of complexity.

 

Once semiconductor technology was introduced to computing, the next three generations are really the same. The only difference between them is the density with which the electronics are packaged. The second generation is the point where each sealed package contains just one device. The third generation is the transitional period when multiple devices are in each package, but not enough to form a complete computer. The fourth generation begins at the point that a whole computer fits into one package. In terms of mainstream computing, we are still in the fourth generation.

 

Why?

 

Because the goal of placing an entire computer in one package has such significant benefits that it was done prematurely -- at a point where the technology could barely support a very minimal computer. The technology has only recently reached the point that a single package can hold enough components to equal the capabilities of the most powerful second and third generation machines of the 1970's and 1980's.

 

Except for comparatively minor advancements, very little has changed in mainstream computer architecture since second- and early third-generation machines reached a peak in the mid-60's. In the meantime, there has been one major digression in architectural approaches (complex instruction sets), but we have returned to the practices of that earlier period (reduced instruction set designs, memory hierarchy, multiple arithmetic units, pipelined processing) in the late 1980's. Most of the advancements have involved either the development of clever engineering tricks (to take advantage of features of the changing technology) or responses to empirical data taken from studies of application software.

 

Outside of the mainstream there has been a steady stream of research into novel architectural approaches, few of which have succeeded because of the cost of creating a viable base of software. There have been database machines, dataflow processors, parallel processors of many varieties, logic machines, neural network machines, fuzzy logic machines, and so on. Some of these are in use in niche application markets. The more successful architectures are digital signal processors (used in modems, VCRs, phones, CD/DVD players, talking toys, etc.), fault tolerant processors (used in banks, aircraft, medical equipment, etc.), and image processors (used in video games, TVs, manufacturing processes, and medical equipment).

 

The Future

 

Many people believe we are on the verge of entering (or have recently entered) a new generation in which the mainstream architecture changes significantly. This generation will employ parallel processing. Currently, parallelism is used quite conservatively -- mainstream processors now typically have multiple arithmetic elements that can operate simultaneously. For example the Pentium can perform two arithmetic operations at once in many cases.

 

Parallel processing will expand to include specialized independent computers in a single package. Intel, for example, has a corporate vision of a microprocessor with several units for executing normal instruction streams, specialized units for generating graphical displays and processing input signals (such as speech and video). This will be a heterogeneous parallel processor, in contrast to the homogeneous parallel processors that have been built of 10s to 1000s of identical processors.

 

The viability of alternative architectures will probably start to grow once the technology advances to the point that they cost little to add to mainstream processors. Until then, they will continue in niche markets. A prime example is the current generation of homogeneous massively parallel processors which were developed for science and engineering. These were large, expensive machines that attempted to push performance two or three orders of magnitude beyond uniprocessors. The problem is that they have always been one generation of technology behind (because of the time required to design a new implementation for the latest microprocessor), and they have inherent inefficiencies. The result is that they deliver only one or two magnitudes of increased performance while costing proportionally more than a contemporary uniprocessor.

 

Given the added cost of building software for parallel machines, most potential buyers have chosen to wait a few years until uniprocessors deliver similar performance (with no software development cost), or to employ only limited parallelism (manually spreading computation over several networked machines, or using a small-scale multiprocessor), or to employ specialized accelerators (vector, array, image, and signal processors). These are precisely the direction that microprocessors are going, so it is likely that we will see specialty manufacturers gradually fall aside and greater consolidation growing around a few standard architectures. In many ways, this development of the computer industry is a parallel to the automotive industry which saw a large number of manufacturers dwindle to just a few as the manufacturing and development infrastructure costs grew to support the complexity of the product.

 

Performance Measures

The most commonly quoted measure of performance is peak performance, which is the maximum rate at which some operation can be executed. For example, the number of no-ops that can be executed in a single cycle loop, entirely in cache.

MIPS -- million instructions per second

BIPS, BOPS, GIPS, GOPS -- billions (giga -- 109) of instructions/operations per second

TOPS -- Trillions or tera operation per second. Often teraops.

FLOPS -- floating point operations per second (MFLOPS, GFLOPS, TFLOPS)

LIPS -- logical inferences per second

CPS or COPS -- Connections per second

Beyond tera (1012) are peta (1015), exa (1018), yota (1021) and zeta(1024).

Peak performance is based on the clock rate of the processor, and on the minimum number of cycles per instruction (CPI) attainable. The clock rate depends on the technology used to implement the processor. The minimum CPI depends on the instruction set (usually just one or two instructions) and the speed of the innermost cache, which is usually on the same chip as the CPU in modern processors. Note that CPI may be less than one if pipelining and multiple function units are employed.

Thus, peak performance ignores the fact that there is a mix of CPI values that depend on the instruction set, the cache behavior, and the proportions in which instructions are executed.

Note, however, that as 1/t increases, peak performance increases linearly. This why clock rate is the second most often quoted performance figure. It is only a bit more meaningful than peak performance.

Can anybody think of a situation where a clock rate quotation may be useful?

When comparing two machines with the same architecture, but different clock rates, the ratio of the clock rates can indicate the difference in performance.

When is this comparison invalid? (when memory, I/O, communication, do not increase in speed by the same proportion)

MIPS is misleading because the work done by an instruction varies. A no-op does no work but gives a high mips rating. A floating point add does a lot of work, but may have a lower MIPS rating.

FLOPS, LIPS, COPS, TPS are more meaningful because they specify a particular operation, but they are ofte calculated in differetn ways for different machines. For example, a machine with a floating point divide instruction may get credit for executing only one FLOP, while a machine that has to do several operations to perform the divide may count this as several FLOPs.

What people really want to know is how fast a machine runs a given program. But that is a much more complex performance measurement. What factors determine this performance?

Algorithm and data set (size, length of execution, pattern of access -- affects memory sys)

Compiler (efficiency of code generated, ability to optimize access patterns)

Instruction set (how many instructions does it take to encode the program?)

Available operations (floating point support?)

Operating system (timeout period and cost, other processes, etc.)

CPI

Clock rate

Memory system performance (cache hit rate, miss penalty)

The only real way to accurately measure the performance of a system is to code a program of interest and execute it. Benchmarks have been created to serve as standard codes that can be tried on different machines for comparison purposes. Some popular ones include SPECmark, Perfect Club, Linpack, Livermore loops. There are also specialized benchmarks, such as those that test performance of PCs on popular packages.

The table on p. 17 of Kai Hwang's book, Advanced Computer Architecture, provides an interesting comparison:

 

Machine

Clock Rate

Peak Performance

CPU Time

VAX 11/780

5 MHz

1 MIPS

12 seconds

IBM RS/6000

25 MHz

18 MIPS

1 second

Why is the the VAX a 5:1 CPI, while the RS/6000 is only 1.4:1?

If the VAX is accelerated to 25 MHz, it will execute at 5 MIPS and have a CPU time of 2.4 seconds. How can it be that a machine (RS/6000) that has a 3.6X advantage in peak performance (over the accelerated VAX) gains only a factor of 2.4 improvement in execution time?

RS/6000 takes more instructions to execute the same program.

VAX may be able to hide extra memory references inside its longer instructions.

Benchmarks

The Systems Performance Evaluation Cooperative (SPEC) is a group of vendors who decided to set up a more meaningful benchmark that the simple synthetic ones that they could quote in their market literature. The original SPEC89 benchmark has been replaced by a 1992 version, so one must take care in identifying the version used when comparing machines of this period with later machines. One way to tell the difference is that SPEC89 reports a single figure called a SPECmark, while the newer benchmark reports separate integer and floating point figures (SPECin92 and SPECfp92).

The SPEC benchmark computes these figures by applying the geometric mean (see next section) to the measured execution times of the 6 integer and 14 floating point routines in the suite. Most of the individual codes are written in FORTRAN although 8 of the 20 are in C. Of the 14 floating point routines, 9 are double precision (64 bit) while the others are single precision. The floating point routines account for about 65,000 lines of code all together, while the integer codes are abour 124,000 lines long. The SPEC benchmark is thus a test based on real programs, kernels of real programs, or programs taken from research labs.

Each of the 20 routines is referred to by a reference number and a name. For example, 052.alvinn is a neural network training program written at CMU. It takes a series of low-resolution black-and-white images of a road from a moving vehicle and corresponding steering commands and trains a neural network to mimic the steering operations when shown new imagery.

026.compress compresses and decompresses a 1 MB file 20 times using the Unix compress utility. The compression routine is based on a dynamically constructed hash-table, and thus tests cache performance on unpredictable memory access patterns.

The 015.doduc routine is a FORTRAN monte-carlo simulation using double-precision floating point.

056.ear is a C program that simulates the acoustical properties of the ear.

008.espresso is a Boolean expression minimizer that is often found in digital logic CAD systems. Its input and output are in the form of truth tables and it emphasizes table look-up operations and Boolean operations.

023.eqntott is a companion to espresso that translates a Boolean expression into a truth table. This is an integer-based C routine that also uses the standard C library routine qsort quite extensively.

0.94fppp is a scalar floating-point intensive FORTRAN program that solves a quantum chemistry problem.

085.gcc as the name implies is a Gnu C compilation, in this case with Sun-3 (Motorola 68020 + 68882) assembler output.

090.hydro uses the Navier-Stokes equations to simulate the flow of material in galaxies with active jets shooting out perpendicular to their disks. It is a rather small code that is generally cache resident.

022.li is an execution of a 9-queens problem written recursively for a small Lisp interpreter that is itself written in C. The reason for 9 rather than 8 queens is simply to make the program work harder. Obviously, the recursive nature results in a test of deeper procedure calling than the other codes.

034.mdljdp2 solves the multi-body equations of motion for 500 atoms in a gas using double-precision floating point. It is a small memory routine that is often fully cachable. A second version, 077.mdljsp is simply a single-precision implementation that is provided for comparison purposes.

093.nasa7 is a nested suite of 7 benchmark kernels, all easily vectorizable and double precision. There are two fluid-dynamics programs, a matrix multiply, a matrix inversion, a matrix solution (block tridiagonal on one dimension of a 4D array), a 2D complex FFT, and a parallel Cholesky decomposition.

048.ora is an optical ray-tracing program that emphasizes double-precision floating point, especially square roots.

072.sc is a run of a simple spreadsheet program doing some recalculations.

013.spice2g6 is an execution of a popular analog circuit simulation program.

089.su2cor uses quantum chemistry techniques to solve for elementary particle masses.

078.swm256 is a small weather-prediction program that uses finite difference methods.

047tomcatv is a 2D mesh benchmark that is specifically designed to be easily vectorized.

039.wave5 is a large-memory program that simulates a plasma by solving Maxwell's equations.

SPEC was updated again in 2000, see http://www.spec.org for details.

Other popular benchmarks include the Perfect Club suite of scientific programs that were selected as being a challenge for supercomputer-class machines, the Transaction Processing Council's trio of benchmarks (TPC-A, TPC-B, and TPC-C) that test perfomance on applications in which transactions are remotely entered that must be posted to a database, and a series of benchmarks for image processing and computer vision (the Abingdon Cross, CMU IP Suite, DARPA IU Benchmark, etc.).

Mean Performance

If we run some benchmarks on a machine, and get execution time performance figures for each, we would like to use these to compute an average time for the machine. At first, we might be tempted to simply average the individual execution times.

But this is not proportional to the execution time for a typical mix of programs, and is thus misleading. Suppose the fastest program in the suite of benchmarks does some trivial operation that we never expect our machine to execute in practice. It distorts the average, leading us to believe that the machine will provide better performance than it really does.

One approach to addressing this problem is to determine a weight for each of the benchmark codes -- perhaps this reflects an estimate of how much of the machine's expected load will mimic that of each particular code. By multiplying the time of each benchmark by the weight before summing, we get an overall performance that better represents what we can expect on our anticipated load.

This is adequate for a single machine, but what if we want to compare two machines and we are given their execution time on several codes. If we use the arithmetic mean, we can get confusing results (adapted from Fig 2.7 in the text):

 

 

A

B

A/A

B/A

P1+P2/2

P1+P2/2

A/B

B/B

Program 1

1

10

1

10

 

 

0.1

1

Program 2

1000

100

1

0.1

 

 

10

1

Arithmetic

500.5

55

1

0.109

5.05

5.05

9.1

1

 

From this we see that the arithmetic mean of the execution times on each machine is different by a factor of about 9, yet the average of the relative times for the machines is the same. The problem here is the same as the algebraic property that a product of sums does not equal a sum of products in general. Because we want to determine their relative performance (a (inverse) multiplicative operation), we should use an average that is based on multiplicative operations.

The geometric mean is such an average:

All the geometric mean does is determine the value such that if m of them are multiplied together, they equal the product of the data values (think of this as occupying a volume in space, and computing the length of a line that spans the volume diagonally).

 

 

A

B

A/A

B/A

P1*P2^0.5

P1*P2^0.5

A/B

B/B

Program 1

1

10

1

10

 

 

0.1

1

Program 2

1000

100

1

0.1

 

 

10

1

Geometric

31.6

31.6

1

1

1

1

1

1

Here we see that the ratio of the average of the times is the same as the average of the ratios. The two machines are actually comparable in performance. This was the reasoning that went into the use of the geometric mean by the SPEC benchmark, However, the flaw in this is that the numbers being averaged in this case are ratios. The practical result is that outlying values have a greater impact than they should. Thus, we should use an average that inverts the values. The mean that does this is called the Harmonic mean.

With the harmonic mean, we get an average of the ratios that does not give extra weight to benchmarks that have a few high ratios. Why hasn't SPEC switched to using this? They have a legacy to preserve (changing the mean would make it difficult to compare to older reports). But there is also a competitive advantage to using the geometric mean, as it allows a marketing department to focus its energy toward boosting performance on one or two of the benchmarks to improve overall performance, rather than having to work hard on every one of the benchmarks.

 

Non-speed Measures of Performance

 

During the period when microprocessors were trying to catch up with mainframes and supercomputers, speed was the dominating factor in reporting performance. Even after core processor performance reached supercomputer levels (as measured by CPU benchmarks), microprocessors were unable to directly substitute for their larger counterparts. The reasons boil down to the limited number of connections that can be brought out of the small area of a single IC. A high-end microprocessor may have a few hundred connections, but many of these merely supply power to the circuitry. A mainframe, on the other hand, may support several thousand external connections. It naturally can move data between memory and the processor in larger chunks and thus at higher rates. Similarly, I/O in a mainframe or supercomputer dedicates many more connections to supporting the transfer of massive amounts of data into and out of memory.

 

CPU speed is thus just one aspect of performance. Memory and I/O transfer rates impact performance on larger tasks. Separate benchmarks may be used to characterize these aspects of performance. The overall impact of improving this aspect of performance is better throughput, as might be measured in a whole-application context by the TPC benchmarks or the SPECweb benchmarks.

 

Beyond throughput, there are many other aspects of an architecture that can be measured and which can be considered a form of performance. These include: power, space, fault tolerance, and real-time behavior.

 

Power

 

As computers have been reduced in size to enter new application domains, a major impediment to further reductions has been power consumption. On a personal, practical level, it is frustrating to have a laptop computer run out of battery power after an hour or two of work. This could be avoided by carrying a ten-pound battery pack, but the inconvenience would outweigh (literally!) the added benefit.

 

For other applications, the power consumption of a computer can determine whether it can be used at all. For an exotic example, consider a planetary space probe that has a fixed power budget for the life of its mission. A certain fraction of the budget is allocated to computation, and no more power is available. Thus, engineers on the project look first at power consumption and view speed as a secondary consideration. In a less exotic, the processor in a talking greeting card must run on an extremely small amount of power. The rate of computation may be fixed, and rather small, so speed is almost irrelevant. But the size and cost of the battery that can fit in the card is a key factor in making it possible to manufacture and sell for a profit.

Power may be measured in several ways. The metric is watts (or milliwatts, microwatts, or nanowatts), however, the measurement methodology must be taken into consideration. Is the power measurement peak, average, or minimum? Does it account for various low power modes that the processor may support? How does it relate to clock rate (power consumption drops dramatically at lower clock rates)? Is it measured for a real application, or for a synthetic workload?

 

The latter question is often a surprise to software engineers -- different programs consume power at different rates. This was demonstrated quite dramatically in the author's lab, where an older electrical system was being taxed near its limit. Circuit breakers would blow periodically, and the problem was eventually traced to the running of a floating-point intensive application on a larger machine. The floating point logic in this system would normally draw very little power, but one program would push it to full utilization and cause an upsurge in power consumption of several watts. There is actually research effort going into compiler and operating system optimizations that spread out work to keep power consumption steady in some applications.

 

Special low-power architectures have been developed for use in portable and handheld devices. These use special techniques, such as turning off banks of memory and shutting down I/O connections when they are not needed. They often also vary clock speeds to reduce power consumption during idle time, and support various levels of power saving, such as "sleep" mode.

 

At the other end of the scale, supercomputers are heavily impacted by power consumption. The most powerful systems are typically built from thousands of high-performance microprocessors. While the power consumption of a few extra watts in a PC may not seem significant, multiplied by a factor of ten thousand, it becomes a major consideration. As an illustrative example, the first machine to reach Teraflop performance, ASCI Red, consisted of over 9000 Intel Pentium Pro processors. In planning for the facility, it was realized that the machine would need its own 1.6 megawatt transformer. This was so far beyond prior experience that the procurement people at Sandia National Laboratory believed it had to be a mistake and instead budgeted for a 600 kilowatt transformer. When the mistake was discovered, it was too late to increase the budget, and so it was necessary to postpone the resurfacing of a parking lot to make up for the deficit.

 

Space

 

Another aspect of an architecture is the physical space that it occupies. Smaller space enables use in a wider range of applications. Some researchers talk about "smart sand" in which they envision sand-grain sized computers that operate on static electricity from their surrounding environment. Such computers could be used in packaging for consumer goods, medical instrumentation, environmental monitoring, and so on.

 

High performance processors typically cannot operate in isolation. They require separate memory, bus interfaces, I/O interfaces, network interfaces, graphics accelerators, and other components to become a usable system. But as microelectronics have grown smaller, it has been possible to move more of these components onto the same IC. For certain applications, System On a Chip (SOC) architectures are highly advantageous. Even for desktop and laptop systems, reducing the number of components helps to reduce size, cost, and power consumption.

 

Fault Tolerance

 

Sometimes we simply cannot afford to have a computer fail. Life critical examples include aircraft control systems, medical life support and radiation therapy equipment, nuclear reactors, and so on. Even in cases where a failure isn't life-threatening, the cost may exceed the price of the machine. For example, if a bank loses access to its accounts for even a few minutes on a busy stock market trading day, it can lose millions of dollars.

It's a common misperception that computers are only as infallible as their software. As transistors have become smaller and packed more densely, they have become less reliable. Radiation from a cosmic ray can easily disrupt a transistor in a modern processor, and microelectronics are becoming more susceptible to electromagnetic interference. When a fault is transient in nature it is called a soft fault, and when it is permanent, it is called a hard fault. Because processors are now so complex, it is not possible to economically test them exhaustively. Variations in manufacturing process lead some chips to have faults while others do not. The ones that are most frustrating are those that are intermittent, perhaps with temperature or other environmental factors, as these may past testing and only exhibit occasional problems in use.

 

One approach to fault tolerance is to build more robust circuits. The use of different materials (at greater expense) can make integrated circuits tolerant of radiation. Circuits can be more thoroughly tested over a longer period, with varying temperatures and vibration or physical shocking to eliminate those that are unreliable. Operation over a wider temperature range or in the presence of radiation or under strong vibrations are another dimension of performance.

 

Circuits can be replicated so that the same operation can be performed in parallel, and the results compared. Disagreement indicates a fault and voting selects the correct result. Usually circuits are replicated in threes, and thus this approach is called triple modular redundancy (TMR).

 

Redundancy can be applied at the system level, with processors operating in parallel and voting. When a processor disagrees with the majority, the others go on and the disagreeable processor is shut down for diagnostics and possible replacement.

 

Real Time

 

In some cases a computer must meet a deadline or catastrophic results ensue. If a nuclear reactor starts to lose coolant, the control rods must be dropped into place before the heat warps their tubes. While it might seem that speed is the most important factor in meeting such a deadline, consistency in performance is really more critical. Because the computer cannot be late, system engineers must always assume worst case performance: that the caches will always miss, the pipelines will operate at minimum efficiency, that disks will be as slow as possible, and so on. If they were to rely on better performance, then there is the possibility that all of the worst cases could coincide and cause a failure to meet the hard deadline.

 

Some processors are thus more effective for real-time processing than others that may have significantly better average speed. Some features that help improve predictability of timing for real time include being able to explicitly load certain data into cache memory and lock it there so that it is known which variables can always be accessed at maximum speed. Being able to explicitly schedule instructions for execution by multiple functional units, with known execution times (versus the now common practice of allowing the hardware to dynamically reschedule instructions into different orders).

 

In Summary

 

Performance metrics are as diverse as the application domains in which computers are used. Which ones are most important depend on how the system will be used. Metrics based purely on system characteristics, such as clock rate, are poor predictors of actual performance. The best approach is to actually measure the performance of a system on the target application mix. However, this may not be practical. In that case, standard benchmarks may be used as a guide, but only if their behavior is well understood and can be related to the target application.

 

© Copyright 1995, 1996, 2001 Charles C. Weems Jr. All rights reserved.


Back to Chip Weems' home page.


Back to courses index page.


Back to Computer Science Department home page.