Lecture 2

 

Technology and Economics of High Performance Systems

The basic technology of computer architecture today is the VLSI chip. To understand the capacity and the rate of progress in VLSI, it helps to have an understanding of the process by which a chip is built. The first step is to create the substrate for the chip, which is the platform upon which all of the circuitry is constructed. Most chips today are built on substrates formed of very pure crystaline silicon.

Silicon begins as sand (SiO2) that is purified into silicon of 99.9999999% purity. In addition, very old sand is employed so that fewer radioactive isotopes of silicon are present in the purified material. If a radioactive atom of silicon decays after it has been built into a chip, the ejected decay products have sufficient mass and energy to damage the circuitry of the chip.

The pure silicon is then turned into a crystal ingot that is grown through a very carefully controlled deposition process. A typical ingot used today is 12 inches in diameter (30 cm), and 12 to 18 inches long. Ingot diameter has increased steadily from about 2 inches (5   cm) in the early 1970s, to 4 inches (10 cm) in the late 1970s, 6 inches (15 cm) in the mid 1980s, 8 inches (20 cm) in the early 1990s, to the current size in the late 1990s. The ingot is machined into a uniform cylinder and then a flat is ground into one side to facilitate the manufacturing process.

The cylindrical ingot is then sliced by diamond saw into thin wafers that are polished and coated with a protective layer of SiO2 insulation (about 0.05 µm thick). A material is then applied that is resistive to acid as long as it is not exposed to ultraviolet light. The photoresist is then exposed to an ultraviolet light source through a lens system and a 10X size mask made of quartz that has lines drawn in chromium on its surface. Quartz is used because it is has an extremely low coefficient of thermal expansion, thereby reducing variations in the positioning of the drawn lines. The lines themselves are drawn directly from computer aided design files using very high precision positioning tables and steered electron beams.

After the photoresistive layer is exposed to ultraviolet light, the exposed areas are washed off with a solvent and then the exposed SiO2 is etched away. A very thin layer (<100 Å) of SiO2 is then applied, called the gate oxide. A conductive layer of polysilicon (non-crystaline silicon) is then deposited. Another layer of photoresist is added and a new pattern mask is used to expose it. The exposed resist is washed away and the exposed polysilicon and gate oxide are etched off. The chip is then doped with impurities that turn the exposed silicon into a semiconductor, a process called diffusion. A typical transistor now has the following geometry:

 The photoresist is then removed and a new layer of SiO2 is added. Another photoresist/etch cycle cuts holes through the SiO2 insulator and then aluminum is deposited to form wires that contact the transistor through these holes. (Copper is used in additional layers in some processes. The most modern processes are actually built on a thin layer of insulator.)

In modern technology, there are actually many more layers and steps than this. Early MOS processes used one type of doping to create either positive or negative channel transistors. Complimentary MOS (CMOS) devices today use both positive and negative channel devices in complimentary configurations. While this requires more manufacturing steps and more transistors, the advantage is that less power is required and so the devices are smaller and faster. Let's consider the reason for this.

In an N-channel device such as the one illustrated, the source and drain have been doped so that electrons carry the flow of electricity, while the substrate carries current by the movement of "holes" -- positive ionizations of its atoms. The difference between the charge carriers prevents current from flowing. However, applying a negative charge to the gate creates a field that drives holes out of the surface of the substrate, creating a thin channel in which electrons are carriers. Thus, a current of electrons can flow between the source and drain in either direction. The device gets the name "field effect transistor" from this behavior. The transistor thus acts as a switch that is controlled by applying current to the gate.

A P-channel device works similarly, however, it requires that an N-well be formed in the P-substrate to contain the two P-diffusion areas. Of course, a positive charge is applied to the gate to drive the electron carriers away from the surface of the well, and create a thin channel of holes.

The P-channel device is not as efficient as an N-Channel device because electrons have two to three times the mobility of holes. However, we can compensate for this by increasing the size of the transistor to decrease its resistance. Early MOS devices were P-channel because it was easier to control the doping of the small wells than of an entire substrate. Once it became possible to use the substrate as the channel, NMOS devices were built that were more efficient. However, to create an inverter in NMOS, one had to build a circuit in which a switch connects Ground to the output and Vdd is connected to the output via a resistor:

Thus, when the transistor is on, the output is grounded and the resistor prevents a direct short between power and ground. When the transistor is off, then Vdd charges up the output via the resistor. Thus a 1 input produces a 0 output and vice versa. More complex gates are based on this same sort of circuit. The trouble is that the resistor dissipates a considerable amount of power when the transistor is turned on, and when the transistor is turned off, it introduces a considerable delay into the positive charging of the output (the charge time is proportional to resistance times capacitance, so a high resistance lengthens the charge time). Thus a negative-to-positive signal transition is slower than the reverse.

By combining the two types of transistor, however, we get a circuit that has lower power dissipation and equal transition times in both directions.

In this circuit, when the input is 1 the N-channel device turns on and the P-channel device turns off. Thus, current flows between ground and the output with little resistance and there is little power dissipated through the very high resistance of the P-channel device. When the input is 0 then the configuration reverses and there is again very low power dissipation.

Because of the reduced power dissipation, it is possible to pack many more transistors onto a chip without burning it out. Thus, the density and size of chips is enabled to grow significantly.

A technology that gained some popularity in the mid 1990s is called BiCMOS because it combines faster bipolar transistors (that nest N, P, and N wells within each other, creating a larger channel) with CMOS transistors on one substrate. Of course, it involves more manufacturing steps.  BiCMOS also dissipates more power as heat (because it takes more power to charge the P well and create a carrier path than to create the field of a MOSFET). Subsequent developments in CMOS, such as Silicon on Insulator (SoI) and copper interconnect have made the use of BiCMOS less attractive, except in certain circumstances, as similar speeds have been obtained with less power dissipation. In 2001, Motorola developed process for combining CMOS with Gallium Arsenide (GaAs -- another semiconductor). Their goal in doing this was to enable a single chip to support signal processing and also contain the high frequency radio circuitry used in a cellular phone. While this is a niche application, the other feature of GaAs is that it is optically active, and can be used for semiconductor lasers. Thus, this advance in technology may enable optical interconnect between chips or even across chips.

Our example has followed the creation and operation of a single transistor, but of course the process takes place in parallel for all of the transistors on a chip and in fact for all of the chips on a wafer (although in some cases the photo-exposure process is done chip-by-chip with a precision stepping table, a technique called direct-step-on-wafer (DSW)). Thus, there is an economic incentive to increase the size of the wafer so that more chips can be created at once. This is balanced by the increased difficulty of manufacturing a larger defect-free wafer and of maintaining consistency of processing across a larger surface area.

A wafer fabrication process line is a carefully tuned combination of machinery and chemistry that must be continuously monitored and maintained in order to consistently yield good chips. If a line is shut down for any length of time (say due to a natural disaster cutting off power), then the line is said to "go sour" and must be fully cleaned and restarted. It can take weeks for a line to reach a production level of thermal, chemical, and cleanliness stability.

Even the slightest chemical contamination can ruin the quality of product emerging from a fabrication line (or fab). The clean rooms in which fabrication is done achieve levels of as few as 10-6 particles of micron size contaminants per cubic foot. Workers are not allowed to wear makeup, must wear dust-free bunny suits and face masks, and must go through an elaborate cleansing process before entering the clean room. Inside the clean room, filtered air is pumped in from above and sucked out through a gridwork in the floor so that any particles are immediately drawn out of the room. In addition, the fabrication facility is isolated from any external vibration that could disturb the delicate machinery. A modern fab line can cost as much as $2B to construct. Such an investment clearly requires a large profit margin for its products, as the typical life of a line is about 6 years with only 3 of those years being prime. Of course, continual upgrading of the line can hold off obsolescence, but eventually the facility must be extensively refitted in order to keep up with advances in cleaning technology. The cost to build a new fab line roughly doubles with each generation, which has some staggering economic implications if Moore’s law (doubling performance every eighteen months) is to continue for much longer.

Once chips have been created on wafers, they go through a testing process. Test structures on the wafer are probed to determine the general quality of the circuitry on the wafer, and it is possible that an entire wafer could be rejected. Accepted wafers are then probed chip-by-chip, with very fine wire probes being used to contact special test ports on the chips. The good chips are identified and then the wafer is "diced" either with a diamond saw or by scribing and snapping. The bad chips are discarded and the good chips are packaged.

There are a variety of packaging techniques. The approach that was used for most of the 20th century involved gluing the chip to an open carrier and then stringing fine gold wires between pads on the chip and on the carrier. The wires are bonded through pressure welding -- essentially the end of the wired is slammed onto the chip (after being heated) so that the energy of the impact causes the metals to fuse together. Toward the end of the 1990s, a technique gained popularity in which solder is deposited onto the chip’s connection pads, and the chip is then mounted face-down onto a carrier with a matching pattern of pads and traces. This solder-ball technique has the advantage that connections can be made virtually anywhere on the chip, rather than only around its edges. Needless to say, there are additional losses in the packaging process. Once the package has been sealed, the chip is again tested and if it passes it is burned in (run at full power for a short time to identify early failures due to thermal stress or other causes).

Next the part is graded for speed and marked appropriately. Even on the same wafer, variations in the processing can result in chips with significantly different maximum speeds. The grading process selects the best chips for higher pricing and slower chips are sold at standard or discount prices.

It should be noted also that memory chips undergo different processing than do processor chips. Because of the desire to obtain the maximum density of memory devices, their extremely regular geometry, and the fact that the primary purpose of a memory cell is to store charge, different materials and deeper structures are used. These processes are incompatible with the processes that are used to build chips containing mostly active devices. Thus, the memory that one sees on a processor chip is less space efficient than the memory on a RAM chip, and this is the reason that processor chips still have relatively small on-chip memories. Even though a chip has a megabyte cache, for example, it is serving a main memory that is potentially gigabytes in size.

Progress of Technology

Processor chips have advanced at a slightly declining rate of increasing transistor count by 37% per year in the 1970's to about 28% per year in the 1990's, doubling in 2 to 2.5 years. Where does this increase come from? Two places -- transistors get smaller and chips get larger.

Transistors shrink in size because the ability to control photolithography and process technology improves. We learn to etch finer lines in the silicon. In 1982, typical line widths were 3 µm. By 1997, 0.3 µm was common. Thus, the width of a line decreased by a factor of 10 in 15 years. Because we are considering area, the effective increase in "real estate" is a factor of 100, which corresponds quite closely to an increase in transistor count averaging 36% per year over that period. However transistor counts have actually not kept up with this average, in spite of the fact that chips also grew in size by a factor of about 11 over that time. Given that we should see an increase of effective transistor counts of roughly 1100, why did they actually only go up by a factor of about 110?

Part of this discrepancy can be traced to the fact that a chip is not entirely made up of transistors -- it also contains wires. The wires are often wider than the transistors because aluminum is harder to deposit and etch as cleanly as silicon, and because the wires are on top of the other layers where they must follow an uneven surface (extra width giving them a greater ability to do so without breaking). Also, aluminum is subject to a phenomenon known as metal migration in which an electric current actually draws metal atoms along with the electron flow and the metal can separate at weak points. As chips grow larger and device counts increase, the length of wires and their number also increase. Modern chips are in fact dominated by wire area. This has lead to an increase in the number of layers of wires on a chip to try to increase density. In 1982, chips had one layer of metal. In the mid 1990s, three layers were common and by the late 1990s four-metal processes were in production. Some of those layers also switched to copper in the late 1990s, as copper has lower resistance than aluminum. This may seem minor, but at high frequencies it is a significant factor. Copper was not used originally because it is harder to work with than aluminum. It tends to contaminate other parts of the process, and being softer, it is also harder to control its deposition precisely. IBM was the first to work out this process, and others followed soon after.

Adding layers of metal sounds simple, but it is in fact quite difficult. The metal, being the thickest layer, adds a great deal of vertical relief to the surface of the chip. It is necessary to fill in the valleys between these hills with an insulator that is then polished flat. Then the insulator must be etched away to varying depths to create contact points for the next metal layer. And the metal must be applied in such a way to fill these deep wells and make contact when its natural tendency is to bridge the holes.

As the sizes of transistors shrink, they also switch more quickly, because the distance between the source and drain is reduced and the capacitance of the gate (along with everything else) decreases. Thus, a channel can be opened or closed more quickly. In addition, local runs of wires are shorter, so that the time for a signal to propagate locally is reduced. Processors have accelerated at nearly the same rate that transistor counts have gone up.

However, long runs of wire actually have considerable resistance and capacitance. The time to charge a wire is proportional to the product of these, and it is the time for a wire to charge, rather than the propagation time of electrons in a conductor, that determines signal propagation time on a chip. Because resistance and capacitance increase as device sizes shrink, the delay on long wires increases with a square factor. In addition, because relative distance increases by a square factor as devices shrink, the delay can be considered to have a 4th power factor in relationship to feature size. Thus, some people predict that either die sizes will have to shrink to accommodate delay as devices continue to shrink, or else chips will have to be divided into distinct sections that communicate with each other asynchronously. This is a concern even with the lower-resistance copper interconnect, as clock rates have grown exponentially.

Another potential stumbling block in shrinking transistors is something called the short-channel effect. When the gate region of a transistor gets too small, electrons can tunnel through it whether it is on or off. This requires adjustment of the doping to increase the off-resistance. Eventually, however, the doping reaches a level at which the field that can be generated by charging the gate can no longer create a conductive channel. To put it another way, in order to build a tiny transistor that can be turned off, we have to dope it to the point that it can’t be turned on. The solution to this problem is to place gates on both sides of the channel, and charge them together. The combined fields create a conductive channel, and when they are removed, the transistor turns off. However, it greatly complicates the processing to try to place a gate under the channel as well as on top of it. IBM is once again the first to succeed in this, clearing the path for at least a few more generations of shrinkage.

Memory technology does not suffer from the same problems as processor chips. Because of its special processing and extreme regularity, it increases in size at a rate of 60% per year, quadrupling in three years. However, its speed has not increased at a comparable rate -- only about 50% over 15 years. The reason is that DRAM uses a charge-bucket storage cell. Each cell is effectively a tiny capacitor that can hold a very small charge. The smaller the cell, the smaller the charge (although deepening the cell keeps the decrease in charge capacity below expected amounts). In order to read out this tiny charge, a wire is precharged and a switch is opened to the bucket. If there is no charge in the bucket a tiny momentary drain is detected on the wire. If the bucket contains a charge, then a tiny momentary surge is seen on the line. A very sensitive differential amplifier and detector is used to identify these momentary deviations in the charge on the wire. In order to recognize the change, the detector must observe a steady state level on the wire for a period of time so that it can calibrate itself. Memory design must essentially trade off this time against minimizing the size of the charge bucket. Market forces tend to demand greater density over greater speed, as we shall see when we discuss the memory hierarchy. Thus, the size/speed tradeoff has favored only a modest increase in speed.

Disk technology increases in density approximately 25% per year, doubling about every three years. Like memory, the access speed only increased by about 50% in the 15 years from about 20 ms in 1982 to about 10 ms in 1997. However, costs have decreased dramatically. It is obvious that if memory and disk technology continue to increase in density at the same rate that memory will someday overtake disk in terms of density. However, the cost differential between them will make disk a viable element of the memory hierarchy for the foreseeable future. To see why this is so, consider that making a wafer of DRAM takes many processing steps, while the creation of a disk platter has far fewer steps. To work, the DRAM must have a precisely patterned set of cells, while a disk surface is just a pattern less bulk material. In fact, the difficult aspect of making a disk platter is to avoid giving it a “pattern,” that is, to make it perfectly free from unwanted variations.

Economic Model of Chip Manufacturing

This is extended from Hennesey and Patterson.

Creation of a chip involves a combination of recurring and nonrecurring costs. The recurring costs are the costs of materials, processing, testing, packaging, marketing, etc. The nonrecurring costs are the design of the chip and creation of the mask set that is used for its production.

Depending on the size of the chip and the complexity of the process (i.e. how many steps are involved), the creation of a mask set can cost from $10,000 to $250,000 (or more) today. For sample runs of chips it is typical practice to place different chip designs on a single wafer, and so a portion of the mask creation cost is split between the owners of the designs. This nonrecurring cost must be amortized over the number of chips that are eventually sold, so the cost of mask creation per chip is

mask cost / chips sold

For very long runs, the cost may be amortized over some fixed initial number of chips, and then the remaining chips are no longer burdened with this cost (or other nonrecurring costs). For smaller production runs, it may be necessary to amortize the costs over all of the chips produced. For simplicity, we use the floor function of the nonrecurring costs as we can assume that once the cost falls below a penny per chip, it is no longer applied.

The mask creation cost for a new processor chip is typically dominated by the engineering cost in the total of nonrecurring costs. The high cost of mask creation is a disincentive to producing test-runs to verify a design. An even greater disincentive to such a trial-and-error approach is the cost of revalidating a design after a change has been made to correct an error. It may be necessary to simulate large sections of the chip with great precision (sometimes at the electron-by-electron level) to ensure that, for example, a new run of wire does not induce stray signals into other circuits. Thus, the design process is a huge effort, with each step, proceeding from a logical design through the chip layout, being simulated extensively and validated against a design model. After the chip layout is complete, other tools are used to scan the patterns, and extract the circuits that they create. This independent extraction results in a new logical circuit that is also validated. The goal of this process is to produce an error-free chip that operates at its target speed on the first run of chips. Consider what it would be like if you could not compile and run a program until it had been so thoroughly verified that you were certain it would execute correctly, and you get a sense of the amount of engineering effort that goes into a chip design. And, of course, bugs still manage to slip through.

Nonrecurring engineering cost varies dramatically, depending on many factors. A team of 20 engineers in a startup may work long hours at low pay using a few dozen computers to bring out a novel design, while a major manufacturer could use a team of hundreds of well-paid engineers and a farm of a thousand computers for simulation to bring out a new generation of an established processor. The startup may bring out a processor for just a $20M investment, while the next generation of a commodity microprocessor may cost $400M. It’s easy to see why there are just a few processor architectures remaining in production. To recover these costs, many more units must be sold, which requires development of new markets.

The recurring cost can be summarized as

Cost of IC = (Cost of chip + Cost of test + Cost of package) / Final Yield

The cost of a chip is the cost of a wafer divided by the number of good dies that are found in the initial testing.

Cost of Chip = Cost of wafer / (Chips per wafer * Yielddie)

where Yield is the fraction of good dies (chips). A 6-inch wafer (15 cm) costs about $550 (2-metal CMOS, ca. 1990) and an 8-inch wafer (20 cm) costs about $3500 (4-metal CMOS ca. 1996), and a 12-inch wafer costs $5000 to $6000 (4 to 6 metal CMOS, ca. 2000). The exact cost of the wafer depends on its size and on the complexity of the process (ie. number of steps).

The number of chips per wafer is the area of the wafer divided by the area of the chip less the number of chip sites that are on the edge of the wafer. If the wafer has separate test blocks, then we have to remove these from the count as well.

Where D is the wafer diameter, A is the area of a chip, and T is the number of test sites.

The Yield per wafer is based on the defects per unit area (Dunit), the die area (A) and a measure of the complexity of the process (P). Our cost model must also take into account that a certain fraction of wafers Ywafer are entirely bad (due, for example, to having a mistake occur in performing a step).

A typical value of P for a 2-metal CMOS process (ca. 1992) is 2, for a 4-metal CMOS process (ca. 1996) it is 3, for a 6-metal CMOS process (ca. 2000) it is 4. More complex processes such as BiCMOS and GaAs are higher. Dunit was about 0.6 to 1.2 in 1996, down from approximately 2 in 1990. In 2000 it was down further to 0.4 to 0.8. Processor sizes in 2000 typically range from 0.1 sq. cm for an embedded design to 2 sq. cm for some aggressive high-performance designs.

The cost of testing is based on the time to test each die

Time on a tester costs from $50/hour to $1000/hour, depending on the speed of the chip and the number of pins (imposing requirements on the capabilities of the tester, which affect its cost) and a typical chip takes from a few seconds to several minutes to test, depending on the complexity of the chip and how much test-support circuitry is built into it. Test circuitry enables the tester to put the chip into a mode in which it can directly access portions of the chip that might be hidden from the normal program model (e.g. cache write buffer, shadow registers, etc.) as well as providing for a simpler test procedure. Designers balance the cost of the additional test circuitry (which isn’t useful for the operation of the processor) against the cost of test. In many cases, there is simply a test-cost target and the designers incorporate just enough test circuitry to reach this goal.

An example of a test technique is to provide a mode in which some of the pins become a test port. The port connects to a special set of data paths in the chip that allow a test pattern to be shifted into the chip and loaded into its registers. The chip is then clocked at least once and the new pattern of the registers is shifted out through the port and compared with an expected result. The chip may include registers that are used only for testing so that inputs and outputs of particular functional units that are not normally buffered can be controlled and examined.

Another form of test circuitry provides the ability to test a chip once it has been mounted on a circuit board without having to directly probe its leads (in fact, some packages such as those using solder-balls, are impossible to probe once mounted). The in-circuit testing standard is called JTAG (Joint Test Action Group) boundary scan architecture. It essentially connects the pads of the chip together with a shift register. In test mode, the pad drivers can be switched to take signals either from these shift register elements or from the actual wires connected to the bonding area. A single 4-pin port is then used to shift signals into the pads and shift them out again. The boundary scan architecture permits easy verification that the chip has good connections to the outside world. It also provides for limited testing of components that are connected directly to the chip.

For a low-power chip with only a few I/O pins a package may cost only a few cents. Packages with enough pins for a typical processor, but a low power level (< 1W), such as a plastic quad flat pack cost a few dollars (e.g., $2). Higher power packages such as ceramic pin grid arrays cost tens of dollars (e.g., $20 to $60) but can handle several watts. For really high power chips, special heat-sinks, fan-sinks, or chilled-fluid cooling may be employed, with costs of over $100 per chip. Mounting a chip in a typical package costs about $2. Then the packaged part is burned-in, which costs only about 25 cents. Finally, a fraction of the chips fail during burn in, and their cost must be amortized over the remaining good die.

A typical wafer yield might is close to 100%, a typical defect rate might be 0.6 per square cm, and a typical chip today is about 1 square cm. Thus, we could expect to see fewer than one in three chips actually work.

Doubling the size of a chip can more than double its cost -- a 1-cm square chip might cost $25 while two 0.5-cm square chips would cost about $14.

On top of these base costs, as mentioned above, the nonrecurring costs must be amortized. In addition, the final price reflects profit. And more often than not, the price also reflects knowledge of what the market will bear, and the price point of the competition.

Cost of Assembled Systems

A processor chip in 2000 accounts for about 23% of the system cost of a PC. In 1996 it was only about 6% (down from 10% to 15% in 1990). The reason for this shift in relative cost is mainly due to a sharp drop in memory cost. In 1995 the memory in a PC accounted for roughly 1/3 of the cost (and nearly half in 1990), but it is now less than 5%. The cost of the processor, which is indirectly related to its design, can have a broader overall impact on system cost. A faster processor requires faster (more expensive) memory. A faster processor and memory require a higher quality circuit board, more power, and greater cooling capacity (although these are about 3% of the total cost, unless chilled cooling becomes necessary).

Costs unrelated to the expense of the CPU are the I/O circuitry (video (5% -- down from 14% in 1995, due mainly to lower memory cost), network, disk (9%), secondary cache, etc.) which totals about 25%. The monitor, keyboard and mouse are about 26% of the cost in 2000 (about the same as in 1995, after an increase from 12% in 1990, due mostly to the additional cost of a color monitor – early PCs typically had monochrome monitors) and the chassis, cables, etc. which are about 4% of the cost.

The point to note in this discussion is that the part of the system that we focus on is only about 23% of the total cost. Thus, our design choices may seem to have only a modest effect on cost. In reality, poor choices can have a further impact on cost by requiring additional I/O circuitry or more expensive RAM. In a commodity market, there is very little margin between a product that is profitable and one that is not. Thus, even modest differences in cost can mean success or failure of a product or a whole company.

Given that a competitive system must have certain features, and must meet certain market-driven cost goals, we can see that the computer architect has little room to choose with regard to the cost of the system. Why then worry about architecture? Because the customer is not concerned with price alone, but with the price performance ratio. And it is in the area of providing maximum performance within given cost limitations that the architect's knowledge comes into play. Major manufacturers have comparable technology for building processors. IBM may pull ahead briefly with copper, SoI, double-gated technology, but then Intel and Motorola soon catch up. To gain an additional performance edge, architecture come into play.