Lecture 1
Introduction
Course subject, goals, and prerequisites
Syllabus and schedule
Office hours, location
Course organizatoin
Grading
Project -- team composition and size, choosing members
Computer Generations
Technological
View of History
If you examine computer technology with the 20-20 hindsight of a historian, it is possible to identify six distinct generations of hardware. The first three generations are distinguished by fundamental changes in technology:
0 - Electro-mechanical (fingers, stones, beads, gears, motors, relays), Prehistory to 1950's
1 - Vacuum tube (electronic), 1940's to 1960's
2 - Transistor (discrete, solid state), 1960's to early 1970's
The successive generations simply pack more second generation devices into a smaller space.
3 - Integrated circuit (TTL bipolar -- small scale, medium scale, large scale integration) Late 1960's to late 1970's.
4 - VLSI (CMOS -- sufficient density to place an entire CPU on a chip), mid 1970's to present.
5 - Parallel systems (CMOS -- sufficient density to place multiple processors or functional units on a chip), early 1980's to present
Notice that these generations had significant overlap. New technologies rarely replace old ones over night. The pattern that can be regularly observed in the transition between generations is that initially the architectures of the previous generation are faithfully translated into the new technology. Then, once the designers become comfortable with the new technology, they begin to explore the additional capabilities that it offers and new architectures are created that push the limits of the technology. The desire to exceed the current capabilities results in demand for technological advance and the cycle repeats.
Functional
View of History
An alternative way to view the progress of technology is by its functionality. The very first aids to computing (fingers, stones, beads) were memory aids. All of the computation was done by the human, but using memory devices reduced errors and enabled working with more digits of precision. In some cases, this also improved speed of calculation (watch an expert abacus operator some time), but the primary goal was to improve accuracy. Greater accuracy enables more complex computations to be performed.
The next step in functionality is to have the device do some of the computation. A device like Pascal's box performs carries automatically. Mechanical calculators grow in complexity to automate more functions (subtraction, multiplication, etc.). These further remove sources of error and enable more complex processing, where longer chains of operations can take place with less opportunity for error.
Of course, error still creeps in with long sequences of manual operations, so the next level of functionality is automating the steps. Babbage's difference engine was created to print mathematical tables for use in calculations, and one of its major features is that it stamps the results directly into copper printing plates to avoid the possibility of transcription errors. Machines with fixed instruction sequences are the first step, and then comes general programmability, first with externally supplied programs (Jacquard loom, Analytical Engine, ENIAC, Harvard Mark series, etc.). Eventually, von Neumann proposes the stored program, and we have the modern computer architecture.
Even with large electronic programmable calculators such as ENIAC, the goal was still to enhance accuracy (ENIAC computed army artillery ballistics tables, where errors could have quite serious consequences). As before, enhanced accuracy means more complex calculations can be done, so eventually, speed becomes a significant enabler. But another top priority is reliability (long computations can't be completed on a machine that breaks down every few days). The first generation (tubes) suffers from regular failure of the devices (filaments that burn out, and vacuum that loses purity over time).
Here we see the motivation for moving from less reliable relays and vacuum tubes to solid state transistors that can operate for years without having to be replaced. At first, transistor machines are not much faster than their predecessors, but the more compact devices enable more sophisticated operations to be built in hardware.
Once reliability is reasonably assured, then speed takes over as a primary motivation for enabling more complex computation. However, as the speed of the topmost systems increases, this opens up space in the market for machines with lower performance, and we have a significant new influence in computer design: market expansion.
To summarize the functional history of computing: memory aids, automated calculation, and programmability are all directed toward greater accuracy. Then speed and reliability become a factor, with reliability giving way to speed (although reliability is becoming a factor once again). As speeds increase, a new dimension opens up for computing devices that are not the fastest, but that supply enough computational capability for marketable applications. Market expansion is multifaceted, with factors of cost, size, power, environmental tolerance, backward compatibility, etc. all coming into play.
Overlap and
the Pace of Progress
There has not been a radical shift in technology since the transistor was introduced, but rather a steady progression in capability. However, in another theme that we shall see recurring throughout the history of architecture, a steady progression occasionally crosses a threshold that leads to a radical change in how we perceive the technology. When such a threshold is crossed in device density, for example, a new generation results.


In the first three generations, there was complete flexibility of design because every circuit was developed out of elementary electronic components for every computer. If you wanted a particular kind of flip-flop, you designed and built the circuit yourself, using analog components. Computer designers had many options for such circuit designs and even memory technology. For example, one could design a computer to use negative and positive voltages to represent logic states instead of circuits that are either on or off.
In the third generation, however, component manufacturers started supplying larger building blocks. It became too costly to develop unique circuits, so designers built computers out of commonly available building blocks such as gates, flip-flops, registers, multiplexers, adders, and even complete arithmetic-logic units. There was still a reasonable amount of flexibility in design -- many unique designs could be created from these building blocks. The advantage was that, anyone who understood digital logic and a modest amount of analog electronics could design and build a computer. The cost of development was dramatically reduced (especially as solid state memory began to replace magnetic core) and so designs proliferated. Any company that had some background in digital electronics could enter the market quickly with a new design.
One factor that enabled this explosion in architectural diversity was the lack of a large base of existing software. It was this generation that saw the first large-scale operating systems developed, and many companies could enter the lower cost segment of the market with fairly minimal operating systems. Also at this time, customers were still accustomed to programming in assembly language and using fairly simple compilers. Thus, the cost of creating a new architecture, in terms of both hardware and software, was at an all-time low.

The Fourth Generation
With the fourth generation, there was in one sense a return to the earlier generations -- for the architect, developing a computer meant working at the transistor level again. For the chip designers, it meant that they were no longer just building bits and pieces -- they were designing a whole computer, with many larger implications that they were not accustomed to handling. Thus, the cost to develop an architecture increased. But that increase was in non-recurring engineering costs, rather than recurring component costs. In past generations, the cost of producing the hardware was much closer to the cost of designing it. With the fourth generation, the cost of production fell dramatically below the cost of design, with a corresponding decrease in profit margin. Where it had previously been profitable to design and build (often by hand) a few copies of a large machine, it became necessary to amortize the design cost over thousands to millions of units.
Creating a market for so many processors also implies a growth in the software base -- such a large market requires many different applications, each of which has its own software. Creating such a software base involves a larger investment in software development tools by the manufacturer (and third parties), which increases development costs. And once such a software base is created, issues of compatibility begin to dominate issues of increased performance.
For the manufacturers of systems, the fourth generation suddenly resulted in a choice between developing an architecture and buying an off-the-shelf design. Economically this meant that one could invest a great deal in a new design (with the possible benefit of having a distinguished product that would take a large share of the market) or one could invest very little and join the ranks of clone manufacturers where commodity market forces result in small profit margins and high volumes.
Both paths have their risks. A novel architecture that fails to win a large share of the market is a high-loss venture. On the other hand, a small inefficiency in the design and manufacture of a commodity system, such as a PC, can result in uncompetitive pricing and a corresponding loss.
The result of the fourth generation was that processor design shifted from the early computer manufacturers (Univac, Sperry, Remington, Control Data, Honeywell, RCA, Singer) to companies that built chips. A few of the early makers (IBM, DEC, etc.) made this transition to some extent, but the shift in technology provided existing chip fabrication companies like Intel, Motorola, Texas Instruments, etc. the opportunity enter the computer business on an even level with the long-established manufacturers.
If you examine the late third-generation architectures and compare them to the early 4th generation architectures, it is clear that there was a significant step backward in architectural sophistication, and that only recently have the fourth-generation designs progressed beyond those of the third generation. Look, for example, at the CDC and Cray vector processors of the 1960Ős and 1970s, which had instruction level parallel processing comparable to microprocessors that did not appear until thirty years later. This is partly because the chip manufacturers had less experience in designing computers, and partly because, in the rush to embrace the new and still immature technology, there was a significant sacrifice in capability. For example, mainframe manufacturers had no limits on the number of gates they could use in a design, but early VLSI was highly constrained and so sophistication had to be sacrificed merely to get a barely functional processor on one chip.
Now, however, there is enough capacity on a single chip (over 200 million transistors) to equal the best mainframe designs. Thus, mainframes are now being surpassed in raw computational speed. The remaining advantage of the mainframe is essentially that it is designed for very large data processing applications that involve online transaction processing and access to immense databases. Mainframes can still outperform microprocessors in terms of I/O capacity and memory bandwidth, but this is more due to the implementation of their I/O and memory subsystems than to their CPU architecture.
In the fourth generation, there was a great falling out and standardization of architectures. Because the use of an off-the-shelf microprocessor made it possible for a computer company to cheaply bring a small computer to market, there was a disincentive to developing a new architecture. That would entail developing a new chip and a complete software infrastructure, which is very costly. Except for a few large companies from the earlier generation who could afford to invest in such an effort, there was no way for a computer manufacturer to get into the business with a competitive product that did not use one of the standard microprocessors.
As we all know, the standards for mainstream microprocessors effectively shrunk to the Intel and Motorola designs with the former being in the IBM PC and clones and the latter being in the Apple products. In the late 1980Ős, this expanded to include the RISC processors such as SPARC, MIPS, Power PC, HP Precision Architecture, and Alpha. The Motorola 68K architecture was replaced by the PowerPC in the Apple products, and shifted to being used primarily as an embedded processor.
When Intel announced that they would produce a new 64-bit architecture (Itanium), HP, which collaborated on the design, announced that it would discontinue the PA-RISC once the Intel design reached performance levels comparable to their own. Shortly after that, MIPS made a similar announcement regarding their 64-bit architecture (although they continue to produce the 32-bit MIPS for embedded use and computer game consoles). In 2001, Compaq (now part of HP), which had bought the DEC Alpha, announced that it would discontinue the architecture. In spite of these announcements, these architectures were still in production for several years because the Intel 64bit design had yet to deliver equivalent performance.
AMD, which had successfully cloned the Intel 32-bit architecture, observed an opportunity and developed a 64-bit extension of that architecture which became more successful than the Itanium. Intel was thus forced to adopt a compatible extension and start selling a 64-bit Pentium, which was in direct competition with their other 64-bit machine. Intel was also clouding the water around the Itanium by continuing to accelerate the 32-bit architecture to stay ahead of AMD, with the result that the combination of raw performance and more mature software tools made the IA-32 faster than the Itanium on many applications.
By 2007, the remaining mainstream microprocessor architectures are: IA-32, 64-bit AMD/Intel, Itanium, SPARC, and Power/PowerPC.
The reduction of processor architectures to a small number of standards had the effect of turning the manufacture of computers into a commodity market. Anybody could start up a computer company, and their success depended almost entirely on whether they could bring a system to market at a competitive price. Volume parts discounts, reduced number of components, and robotic assembly or exploitation of unskilled assembly labor have more to do with product success than does architecture in a mass market such as personal computers. With few exceptions, personal computer companies are no longer in the technology business, but are in the manufacturing, distribution, and marketing business.
Apple is one obvious exception, although their technology efforts are mainly in accessories that connect to the computer (iPod, AppleTV, iPhone, wireless hubs) and in package design (iMac, Mac Mini, X-Serve). They have effectively adopted a luxury-car strategy for their consumer products, going after the small segment of the market that will pay a premium for better design and engineering, instead of seeking the lowest price. With their adoption in 2006 of the Intel architecture over the PowerPC, we can expect IBM to redirect its focus for the Power architecture. Indeed, the embedding of a PowerPC core in the Cell BE processor is an indication of one direction of this shift.
In the 1980Ős and 1990Ős, workstation manufacturers found a niche in providing machines that were more powerful than personal computers at a slightly higher price. Essentially they took over the minicomputer market of the 1970Ős by using the same processors as found in PCs, but enhancing their performance with external accelerants such as cache memory, faster buses, larger disks, greater memory capacity, and support for graphical interfaces. Eventually, these manufacturers realized that the higher price point of their market could support the development of proprietary architectures that would outperform PC processors and give them a clear advantage -- hence the development of the RISC architectures for workstations.
However, as component densities grew, there were fewer features that could be added on that were not already in commodity microprocessors. Nor did workstation ever penetrate the larger market. Instead, they were chased up the ladder toward using their costly, higher performance processors to build small servers. The PCs are also advancing on this market, and squeezing these manufacturers up against the large servers. Presently, none of these designs has the necessary structure or software support to make the leap to the large server level. In that arena, the mainframe systems are well established, with designs that emphasize high throughput, high reliability, and scalability. Implementing these features involves different approaches to I/O, to memory, to fault tolerance, and to operating system design. As an analogy, a company can build a delivery service around cars, vans, and airplanes. These would be able to carry a large amount of small packages quickly. However, for moving vast amounts of goods, this model is inefficient and one needs to shift to ships, trucks, and railroads. Microprocessors have been pursuing the express delivery model, while mainframes have focused on the cargo transport model.
Of course, given very inexpensive microprocessors, it is possible to build mainframe-equivalent performance in a cluster for some applications. At that point, cost of service and downtime can grow, due to the large numbers of relatively unreliable processors in such a cluster. Some users have found, however, that it is more economical to simply discard a cheap machine that dies than to try to fix it. When the cost of replacement is a few hundred dollars, the cost of support personnel can exceed the replacement cost in just a few hours. Thus, keeping spares in reserve, automatically bringing them on line, and having a lower-level technician periodically go through the data center replacing the dead machines with new ones is an economical alternative. One other factor that must be included, however, is the energy cost. And clusters do tend to be considerably less energy efficient than mainframes.
Inevitably, there is only so much performance that one can wring from a single processor on a single chip. The step in architecture to obtain greater performance is to use multiple processors.
The Fifth
Generation
We have been in the fifth generation for some time now. Arguably, it has always been with us. Charles Babbage recognized the potential for parallel processing, as did John vonNeumann. Parallel processors were under construction as early as 1963, and commercial ones were being delivered by the end of that decade. Whenever there has been a market for more processing power than can be delivered by a single CPU, there have been parallel machines to sell into that market. However, they only began to be mass produced, in the late 1990Ős, in the form of multiprocessor servers. As of 2002, there was one production mainstream microprocessor (the IBM Power4) that placed two processors on a single chip. By 2006, dual cores became commonplace and quad-cores appeared in 2007. Intel forecasts the ability to eventually place 80 cores on a chip.
For a while, in the early 1990Ős, some people persisted in trying to distinguish ULSI or wafer-scale fabrication from VLSI chips. Pretty much everyone admits now that we are really just looking at more of the same in terms of the technology. We can see that the fourth generation was not a change in technology but a crossing of a capability threshold. Similarly, the fifth generation is a crossing of a capability threshold wherein we can build machines that have multiple processors on a chip. Even with single-processor chips, we at last have sufficient power (and are sufficiently close to the limits of power) that multiprocessor systems are becoming cost competitive as small servers and low-end mainframes.
There is still some debate as to whether we are seeing a new architectural generation, or whether we should wait for something more revolutionary before declaring a new generation. The research parallel processors of the 1990Ős had this sort of radical nature, but the only designs from that period that have survived are the ones that were more evolutionary. There is some indication, however, that with current shifts in technology, some of those approaches may be revived.
If we examine the preceding generations, we see a pattern that is likely to persist through this new generation. That pattern is that the first architectures of a new generation initially try to mimic the architectures of the preceding generations, and then the architects discover new possibilities that lead to the emergence of new architectures. For example, the first vacuum tube machines were essentially electronic differential analyzers, but people quickly discovered the stored program concept and architecture advanced. Transistor machines were initially just like vacuum tube machines (the IBM 7090 was a 709 made of transistors -- in fact the name was originally conceived as 709T). Then notions such as multiprocessing developed.
The first IC processors were just smaller transistorized designs, but then such concepts as caching, virtual memory, and microprogramming developed. VLSI machines were initially patterned after minicomputers -- a step backward to cacheless, single user designs for embedded applications, but they quickly advanced to match IC machines and then went past them with reduced instruction sets, pipelines, and multiple functional units (architectural features that had been limited to supercomputers and high-end mainframes previously) and added branch prediction and speculation.
Now we see supercomputers moving into massive parallelism, and more mainstream architectures are going to multiple processors. As in previous generations, there is an attempt to pattern the new generation after the old. Thus we see processors with multiple functional units that try to hide their presence under a traditional instruction set. We see multiprocessors that try hard to maintain a single memory space and sequential semantics for process interactions. We see massively parallel machines that either have just a single control thread (SIMD) or try to hide their multiple threads behind a single thread model (SPMD). The degree of parallelism in most machines is limited to two or four processors, and they are used to support multitasking in operating systems. Chip multiprocessors are simply packing these designs onto a single IC.
As devices on the chip shrink, and clock rates rise, the length of wires on the chip becomes a major limiter of performance. Speeds can only increase if the data can traverse the wires in tie to reach its destination as it is needed. This is leasing to some rethinking of architecture to consider the impact of having vast numbers of gates on a chip, but where only a fraction of them are reachable in a clock cycle. Beyond chip multiprocessors, we may see designs that distribute processing among more locations on a chip, with limited resources at each location and a need to coordinate the flow of data and instructions through those loci.
The fifth generation is one focus of this course. Can we predict how the fifth generation will evolve?
Software Support
Popular languages today support the fiction of a simple programming model based on a cacheless von Neumann architecture. In this model we access values one at a time. The time to access any value is the same as to access any other. If we want to perform an operation on an array of data, we must iterate through the index values of the array.
As experienced programmers know, however, modern machines have caches, multiple pipelines, and virtual memory. It is important in terms of performance to try to arrange computations so that cache locality is preserved, and so that a sequence of memory accesses follows the line orientation of the cache. Instructions should be scheduled to more efficiently utilize pipeline slots. We also try to arrange our program code so that we do not cause excessive paging by the memory system, because accessing disk costs a factor of up to 1,000,000 in performance.
To some extent, compilers can help with this by analyzing our code and rearranging loops and index expressions to improve locality and operation schedules. However, such automatic aids are easily thwarted by clever programming, especially when a programmer has previously hand-optimized a piece of code for a specific machine.
Furthermore, as architectures become more complex, they require more sophistication from the compiler to optimally schedule functional units and avoid bubbles in pipelines. Architects are placing more of the burden of performance on the compiler writers to try to automatically map the simple linguistic programming model to complex underlying architectures.
There is one camp that argues that programmers should not have to be concerned about the underlying architecture -- that there is an architecture independent programming model that will be sufficient to express all algorithms for all machines. Some take this argument to mean that there should be a standard parallel model. The result is that we see a number of proposed architecture-independent models such as Linda, PRAM, and BSP that work on multiple machines, but with only modest efficiency.
Others take architectural independence to mean that we should be able to stay with the sequential model and that compilers should take care of translating to the different parallel architectures. This is the basis for parallelizing and vectorizing compiler research. Unfortunately, most of these compilers are examples of "Pavlov's programmers." in that they promise much, then require the programmer to insert compiler directives as comments to reach the promised goals. In the end, the programmer has written a parallel program in the comments.
Another camp argues that programmers want to have the option to work at a level of abstraction that reveals the machine architecture, so that they can optimize their code. This is, for example, the philosophy of C and to some extent Fortran. Thus, for many parallel machines, we see machine-specific language extensions or libraries that are not portable to other architectures. If the programmer is willing to sacrifice some efficiency, then higher-level abstractions such as the PVM and MPI standard libraries are available.
Attempts have been made to develop parallel languages, but these have been only marginally successful as they are often biased toward a particular model of hardware parallelism. Newer languages such as Java and C# include support for threading, which offers some potential to exploit parallelism. But the jury is still out on whether these rather simplistic models will be a boon or a curse. Experience has shown that threaded programs are very difficult to debug and to make reliable. Transactional memory may be a way out of this trap.
We thus have the dilemma of trying to achieve portability but with a lack of performance, or achieving performance at the cost of portability. How might we approach this problem?
Taxonomies
The most popular taxonomy in parallel processing is one proposed by Michael Flynn in 1972. It defines an orthogonal taxonomy of instruction streams and data streams.
SISD -- uniprocessors
SIMD -- an instruction stream directs the same operations to be performed on multiple data values simultaneously (array processors)
MIMD -- independent instruction and data streams that can interact (multiprocessors)
MISD -- multiple instruction streams operating on the same data stream
Flynn's taxonomy was augmented in the 1990Ős with SPMD (single program multiple data), in which the same program is executed on different data streams simultaneously, with synchronization for global control points.
Can you think of examples of each of these modes of processing in everyday life? What is MISD useful for?
One popular division of the MIMD model is shared memory versus message passing. In the former, all processors have access to a global memory and they communicate via shared variables and exclusion mechanisms. In the latter, each processor owns some part of memory, and communication is via messages sent between processors.
Shared memory and message passing are actually programming models. They can be implemented through software on architectures that are purportedly of the other type. Architectures can provide specialized mechanisms to accelerate either model. At the lowest level, however, a memory fetch is just sending a message to a memory unit and getting a return message, while sending a message to request a value from another processor is just a form of memory reference.
Larry Snyder defined a set of "Type Architectures" that attempt to capture groups of potential architectures within which a single programming model can be applied. If one crosses from one type to another, it may be necessary to switch programming models in order to obtain efficiency (or vice versa). Snyder also considers the notion that computations may take place in type "phases" where a machine executes code written for one model and then for another and another and so on. The phase abstraction does not deal well, however, with the notion that there may be multiple models interacting at one time.
Where Flynn's and many other taxonomies fall short is that they assume parallelism is homogeneous. However, that need not be the case. Consider a machine that has two separate floating-point adders, a multiplier, and an integer unit. It can execute four operations simultaneous, yet where does it fall in Flynn's classification?
We can take an analogy from elementary data structures, where we have arrays to represent homogeneous collections, and records (structs) to represent heterogeneous collections. In the case of parallel architectures, the collections are processors. Let's explore that analogy a bit more.
What kinds of components can arrays have? Can records have?
This is called orthogonal parallelism.
Can we see possibilities here for advancing architecture in the fifth generation?
By the way -- don't try to stretch the analogy to include pointers and classes. There are reconfigurable parallel processors, but the technology isn't yet up to supporting metamorphic architectures, and we have a lot to do before our software is up to the task as well. Processors that can dynamically instantiate themselves, modify their own structures, and group themselves into collections dynamically are mostly at the stage of science fiction, unless one considers some distributed web applications to represent a primitive case of this class of system.