Number
Representations
As
we saw in the discussion of Boolean logic, we can perform arithmetic on binary
numbers using combinations of logical operations. Now we'll take a somewhat
closer look at the specifics of arithmetic operations.
Recall
from assembly language that binary numbers have a power of two associated with
each of their digits. If we use this to number the digits as we write them
down, we get a number scheme in which a 32-bit number has bit 31 on the left
and bit 0 on the right.

In
decimal notation we represent a negative number by writing a minus in front of
a positive number. With binary numbers we do not have the convenience of such a
notation, although we could declare one bit (say bit 31) to be a sign bit and
the rest of the number to be the quantity. Such a notation is called
signed-magnitude representation. Suppose we were using a four-bit word. Then 6
would be represented as
0110
and
-6 as
1110
This
is easy to read, but it has the disadvantage that if we add the numbers in this
form, we do not get 0.
0110
1110
------
0100 with a 1 overflow.
so
the result of 6 + (-6) appears to be 4 with overflow.
We
don't particularly care about readability of binary numbers in the computer
because we rarely look at them anyway (we write software to translate them into
decimal notation). So we should really be more concerned with what
representation is most useful for the computer. A system that allows us to use
an adder to sum positive and negative numbers will simplify the design of the
ALU. There are two such systems of representation and they both depend on the
complement of a number.
The
complement of a number in a given base can be defined as the difference between
each digit of the number and the maximum digit value for the base. Thus, in
base 10 the complement of 26 would be 73, because the maximum digit value is 9.
In base two, the complement is easy to compute -- it is just the bitwise
inverse of a number. Thus the complement of 11010 would be 00101.
This
scheme is known as ones-complement and was used on machines designed by Seymour
Cray at Control Data Corporation. They used ones complement because it made
negating a number as fast and simple as passing it through a word-wide bank of
inverters. It is also a symmetric number system in that, for any given word
size, there are an equal number of positive and negative numbers. The problem
is that one of those numbers is -0. So, a ones complement machine has to
recognize the presence of all 1's in a word (with a giant AND gate) and invert
the bits again whenever -0 appears. In ones complement, the above addition
would be represented by:
0110
1001
-----
1111 (-0) => 0000
Similarly 26 + 73 = 99 (-0)
=> 0
We
can avoid the problem of -0 by using two's complement, in which we add 1 to the
complement of a number to negate it.
0110
1010
-----
10000 => 0000 with
overflow
The
10's complement of 26 would be 74.
So 26 + 74 = 100 => 0 with
overflow
Thus
we never have negative 0 to worry about, but we do have a minor asymmetry --
there is one more negative number than positive number in our system. The
unique 0 is essentially taken from the positive side of the number line in a
two's complement system.
As
we saw before, we can perform the +1 part of a two's complement by using the
carry-in to the low-order bit of an adder, so there is almost no cost
associated with this operation.
In
two's complement, a high-order bit of 1 always signifies a negative number (in
ones complement, we have to check for -0 when the high order bit is 1). Thus,
the high order bit is usually called the sign bit.
If
we need to convert a two's complement number to a larger word size, we simply
extend the sign bit into the high order digits to produce a number of the
correct form. That is, if the number is positive, all of the higher order bits
in the wider word will be 0. If it is negative, they are all 1.
Shifting
Shifting
a word right or left (which is equivalent to multiplying or dividing by a power
of 2) is used in multiplication and division and also to align data on byte or
word boundaries. It can be implemented in several ways. The least expensive is
with a shift register.
Basically
a chain of D flip-flops is arranged with gates that connect their inputs to an
external source or to each other.

We
can make this bidirectional by adding a third input that takes its data from
the flip-flop on the other side.

The
drawback of this approach is that it requires each shift on n places to take n
clock cycles. We can instead use a logic circuit that selects any of the input
32 bits to go to a given output bit, with all of the other output bits
receiving a parallel selection.
For
example, a shift of three to the right would establish the following pattern of
connections:

Here
we see one bit of a 32-bit shifter of this type:

Each
bit would have this same arrangement of 32 AND gates feeding a32-bit OR gate.
What is not as obvious from the logic diagram is that this results in 32 x 32
=1024 wires leading from the
inputs to the AND gates. An actual implementation would use several stages of
gates to avoid the 32-input OR gate and the requirement for the inputs to drive
32 AND gates at once. This arrangement is clearly more expensive but it can
compute any shift in just one cycle (or even less, since it is not a clocked
circuit). It is called a barrel shifter. An alternative to supporting every
shift distance, which is used for shifters that do data alignment (for example,
loading a specific byte into the low-order byte of a register) is to provide
just a few of the possible shift distances as required by the particular
application.
Carry
Lookahead Addition
The
book does a nice job of showing how an adder/subtractor/AND/OR/less than can be
built so it won't be repeated here. Of particular interest is to note how zero
detection and overflow are detected in Figure 4.17.
We'll
just briefly go over the operation of the carry lookahead since we've discussed
the concept before. The basic idea behind its implementation is that some
clever algebra has been used to simplify the logic functions by factoring out
some common terms in each of the carry expressions.
These
terms depend only on the bits being input at a given position -- not on any of
the other positions. They are called
generate
= A AND B (because if A and B are both 1, then a carry is generated regardless
of the carry in to that position.
propagate
= A OR B (because if either is a 1, then a carry in will be propagated to the
next stage (and if they are both 1, the carry in doesn't matter -- it is an
inclusive propagate)
So
we have
c1
= g0 + (p0 c0) that is, we get a
carry if we have either a generate, or a propagate with a carry in
c2
= g1 + (p1 c1) = g1 + (p1 g0) + (p1 p0 c0)
c3
= g2 + (p2 c2) = g2 + (p2 g1) + p2 p1 g0) + (p2 p1 p0 c0)
c4
= g3 + (p3 c3) = g3 + (p3 g2) + (p3 p2 g1) + (p3 p2 p1 g0) + (p3 p2 p1 p0 c0)
As
you can see, we can build each of the carry out computations from a circuit with
just two stages of gates, but the n-th bit contains a circuit with an n+1 input
OR gate and n AND gates who's input counts range from 2 to n+1.
For
an 8-bit adder this works OK, but for 32 bits the fan-in is too high. So a
hierarchical scheme is used instead -- it will usually consist of 4 or 8-bit
carry lookahead adders and these feed another level of lookahead circuitry that
produces the carry in to each of the blocks at the same time.
Multiplication
The
basic idea behind multiplication is that we shift (leftwards) and add the
multiplicand to a result register repeatedly. At each shift position we look at
a corresponding digit of the multiplier and if it is 0 we skip the addition and
if it is 1 we carry out the addition. The multiplier is thus placed in a
register that shifts right, with the least significant bit being used to
control the adder. Thus the multiplicand and the result are stored in
double-length registers (the multiplicand to enable the shifting, and the
result to accommodate the fact that it naturally grows to double the length as
a result of multiplication).

More
clever schemes recognize that as the data shifts out of the multiplier, space
is left in its register that we can take advantage of to avoid having to use a
double-length register for the
result. So some of the bits recirculate to fill the space being vacated by the
departing bits of the multiplier as they shift across the register.
Booth's
algorithm is another approach that tries to speed up multiplication by taking
advantage of the patterns of bits in the operands. The algorithm notes that if
the multiplier contains a string of ones (for example)
0000111111110000
that
the partial sum that results from this string is equal to the multiplicand
times a 1 in the next higher bit minus itself times the 1 in the lowest order
bit of the string.
That
is 0000111111110000 X 1 = 0001000000000000 – 0000000000010000.
So
if a number contains long strings of 1 bits, we can avoid all of the
intermediate shifts and adds and replace them by one add and one subtract. The
problem is that this algorithm is actually slower for patterns like
0101010101010101
where
an add and subtract would be done for each 1 in the multiplier.
We
can build a really fast multiplier by creating a chain of adders that each take
the multiplicand as one input (controlled by the appropriate bit of the
multiplier) and take the output of the preceding stage as the other input.

This
scheme produces a result in the time that it takes the data to propagate
through all of the adder stages. It has as many adders as there are bits in the
multiplier, and the adders are as wide as the number of bits in the
multiplicand. Thus, for N-bit words, N-squared gates are required for a ripple
carry adder based circuit, and a CLA based circuit is even more expensive.
One
point to note about this approach is that if we put a register between each
pair of adders, and clock the data through the register, then we can start a
new multiply operation into this circuit with each clock cycle. After N clock
ticks the first result emerges and subsequent multiplies emerge on each
additional tick. Such an approach is called a pipelined multiplier and is
useful in machines that are doing vector operations.
Wallace
proposed an alternative scheme that uses fewer gates. He recognized that we
don't have to complete all of the carry operations at each stage before passing
a value on to the next stage. As long as the carries get added in before the
final result is produced, we will have a valid sum of all of the partial
products.
Thus,
in Wallace's scheme, all of the partial products are generated in parallel by
gating N copies of the N-bit the multiplicand by the bits of the multiplier
into the inputs of a tree of "pseudo-adders". (Much as in the diagram
above, except that the pipelined stages are replaced by the adder tree).
Each
pseudo-adder is a bank of full adder circuits. But instead of propagating the
carry output to an adjacent bit within the pseudo-adder, the carry is simply
another output number:

A
pseudo-adder block can be treated as a 3-input 2-output combinational circuit
that produces two sums from three inputs. If we add the two sums together, then
we get the actual binary sum of the three inputs.
Wallace
showed how to construct a tree of these adders. The following diagram shows how
20 partial products could be summed together with pseudo-adders. At the final
stage, a carry lookahead adder is used to sum the last carry and sum output
pair.

Wallace
tree adders can be constructed from larger adder blocks that take more inputs
and produce more outputs -- all that is required of the outputs of a pseudo
adder is that if you sum them up you get the sum of the inputs. Thus, for
example, one might find a convenient Boolean function that takes 15 inputs and
generates 4 outputs for which the sum is the sum of the 15 inputs.
Division
The
same basic approach can be used for division. Start with the dividend and
divisor aligned. Shift the dividend left and subtract the divisor. If the result
is positive, then the quotient gets a 1 and the remainder replaces the
dividend. If the result is negative, then the quotient gets a 0 and the
dividend is restored except that it is still shifted by one.
Given
a fast multiplier, however, there are faster methods that can be applied based
on a Newton-Raphson iterative solver. Flynn showed that a first-order
Newton-Raphson formula can start with a 4-bit approximation and in just 3
iterations achieve 32-bit precision. Each of the iterations takes 2 multiplies
and some shifts and complements.
The
initial value can be obtained with a table lookup that uses the high order bits
of the divisor as the address and produces the 4-bit approximation to the
quotient. Thus, division of this form takes only about 6 times as long as
multiplication. If the multiplier is like the Wallace Tree, it may be several
times faster than the shift-and-subtract approach.
Elementary
Functions
Computing
sines, cosines, etc. in hardware, is actually included in many floating point
units. One popular method that involves only simple hardware (and is used, for
example, in most calculators) is the CORDIC (COordinate Rotation DIgital
Computer) approach developed by Volder.
He
observed that if one rotates a vector by a decreasing (in magnitude) series of
positive or negative angles that the vector eventually converges on a desired
angle. Further, if the angles are selected appropriately and at each stage
either a positive or negative version of the rotation is applied (whichever
gives the closer result) that the length of the resultant vector is a constant.
Thus, if the final X and Y values are multiplied by an appropriate pair of
constants to change the length of the vector into a unit, then as we know from
trigonometry, the X and Y values of the unit vector at that angle are the
values of the sine and cosine.
The
CORDIC method uses a table of constants. For each iteration it compares the
current angle to the desired angle, and either adds or subtracts the constants
for that iteration to the angle, X, and Y values. The precision of the result
increases one digit per iteration. At the end of this series of add/subtract
operations, one multiply is required to produce the desired function.
Floating
Point
As
you probably recall from assembly language programming, the basic concept of
floating point representation is that it provides a portion of a word that
represents a mantissa, part that represents an exponent, and one bit for a
sign. The sign of the exponent is usually subsumed by a bias value that must be
subtracted from the stored value to obtain the correct exponent value. The
advantage of the bias is that it makes the exponents all positive which saves
an explicit bit for the sign and also means that the arithmetic (such as
comparisons) does not have to treat positive and negative exponents
differently. However, after adding two exponents (as in multiplication), the
extra bias amount must be subtracted. Biased exponent representations also have
the advantage that zero is represented by a word of all zeros.
The
mantissa for a floating point number is usually normalized -- that is, it is
shifted left until the first one bit is in the high order position. The
exponent is adjusted by a corresponding amount to compensate for the shifting.
Because a normalized representation always has a one in the leading position,
we need not represent it explicitly. Thus, one more bit can be used to extend
the precision of the mantissa.
Floating
Point Operations
Addition
involves aligning the mantissas of the two numbers (determining the difference
between their exponents and then shifting the smaller of the two numbers right
by that many positions, with the implicit one made explicit), then adding the
mantissas and if there is an overflow renormalizing by shifting the result one
to the right.
Subtraction
is the same as addition except that one mantissa is subtracted from the other
after alignment, and the renormalization may involve an arbitrary left shift of
the result.
Multiplication
does not require the mantissas to be aligned. They are multiplied, the
exponents are added, and the result is renormalized if necessary (a shift of
one position). Of course, the result is double the size of the original values,
so it must be shortened either through truncation or rounding. (The other
operations require this as well whenever a value is renormalized and bits of
precision are lost.)
Division
similarly involves no prealignment of the mantissas. When the division is
performed, one exponent is subtracted from the other. Unlike multiplication,
the mantissa result of the division can require arbitrary shifting to be
renormalized.
IEEE
Floating Point Format
The
majority of computer systems today use the IEEE 754 standard that was first
adopted in 1985. The adoption of this standard was a rare show of apparent
cooperation among computer manufacturers. However, it was partly a matter of
pragmatics (handling floating point properly is complicated, and adopting the
standard saves both design cost and potentially embarrassing errors), partly
competition (it became a positive marketing point to be able to claim IEEE
compatibility), partly the effect that the microprocessor generation of
manufacturers had a clean slate to start from (most of them did not have
existing mainframe-based standards -- those that did were the ones like DEC and
IBM who entered the microprocessor business late), and partly that the
manufacturers uniformly underestimated the difficulty of implementing the IEEE
standard (so they publicly jumped onto the bandwagon before they knew how much
it would really cost).
Because
the standard is well designed, its widespread adoption has been a very positive
experience for the software community. However, even 15 years after it was
introduced, there was very little language support to take advantage of the
more sophisticated features of the standard, and some of the features are
either unimplemented or very costly to employ on some machines.
The
standard defines four formats for floating point numbers, of which two are
commonly used (single and double). In single-precision numbers there is a
one-bit sign, 23 bits in the mantissa (an implicit 24th bit is the leading 1 in
all mantissas, which isn't stored), and the 8-bit exponent ranges from -126 to
127 with a bias of 127. In double-precision numbers there is the one-bit sign,
53 bits of precision in the mantissa and the exponent ranges from -1022 to 1023
with a bias of 1023.
The
bias is a number that must be subtracted from the stored value of the exponent
to get its value for purposes of computation. Thus, the minimum value that can
appear in an exponent field of a proper single-precision floating point number
is 1 and the maximum value is 254. This leaves 0 and 255 for representing
special cases. In double-precision, the exponent can be 1 through 2046, with 0
and 2047 reserved for special cases.
The
other two formats are "extended" versions of the two common
representations. They assume that a machine will devote an entire word (32 or
64 bits) to the mantissa and at least 10 (single) or 14 (double) bits to the
exponent field. The extended formats do not have defined biases.
Special
Case Numbers
When
the exponent of an IEEE FP number is all ones (255 or 1023) and the mantissa is
all zeroes the number is explicitly equal to infinity (+ or -, depending on the
sign bit). The standard requires arithmetic such as division by 0 to produce
infinity as a result, and inversely, division by infinity to produce 0.
Overflow can also produce an infinite result.
If
the exponent is all ones and the mantissa is nonzero, then the value is called
"Not a Number" or NaN. The NaN values can be used to represent
results such as the square root of a negative. Arithmetic on NaN values should
also produce NaN values as output so that if an improper floating point value
enters a string of calculations, it cannot result in a valid floating point
result.
When
the exponent is zero, then the mantissa represents an unnormalized or
"subnormal" number. These are numbers that are smaller than the
number 1 x 2^-126. That is, the implicit leading 1 disappears and the number of
significant digits in the mantissa can decrease to fewer than 24 bits. These
numbers are useful when we consider that any two numbers with exponents of -126
will differ by an amount that is smaller than we can otherwise represent. So if
we try to compare the numbers by subtracting them, we get an unnormalized
number. In some systems, unnormalized numbers are not allowed and the
difference would simply be zero. Thus, it would be possible to have two very
small but different numbers that appear to be equal when we compare them.
While
everyone agrees that having subnormal numbers enhances the accuracy of
calculations and reduces programming effort, hardware designers find them to be
a very difficult part of the standard to implement. Subnormals are a special
case that occur relatively rarely, but they also require that all computations
make the checks for the special combination of a zero exponent and a nonzero
mantissa and then turn off the usual prealignment and renormalization hardware
appropriately. Thus, implementing subnormals can both increase the cost of a
floating point unit and result in slower computation overall.
In
some processors, for example the PowerPC, the programmer can disable the use of
subnormal numbers. In the PowerPC, when subnormals are disabled, a subnormal
result generates a normalized value and an exception is raised so that the
software can handle the underflow. Of course, the exception can be ignored if
speed is more desirable than accuracy. When subnormals are enabled, no
underflow exception is raised and the FPU takes care of all of the special
handling.
Rounding
The
IEEE standard has four different rounding modes. The usual mode is to round to
the nearest value, with a number that falls midway between two others being
rounded to the nearest value with an even (zero) low-order digit. The other
modes are round toward zero, round toward plus infinity, and round toward minus
infinity.
Rounding
is based on two additional bits of precision that are kept in intermediate
stages of a floating point calculation. These are the guard and round bits. To
see why two bits are needed, consider that adding two numbers can require an
extra digit in order to know how to round them (as when one value is shifted by
one position with respect to the other) and that the result of such an addition
can also produce a carry out of the high order place. Thus, we need to be able
to store a result that is two bits longer than normal.
One
problem with the "round to even" mode is that it complicates the
rounding process. For example, the result of a multiply might end with a long
string of zeros and then a one. Without the one, the result would be rounded
down to the nearest even value. But with the one, the result does not fall half
way between the two rounded possibilities -- it actually falls closer to the
odd number just above. Thus, the FPU has to keep track of whether there are any
one bits beyond the guard and round -- it does this with a bit called the "sticky
bit" because it is often implemented by watching the low order bits as
they are shifted out to the right during normalization, and sticking to a one
if any one bits are seen.
© Copyright 1995, 1996, 2001 Charles C. Weems Jr.
All rights reserved.
Back to Chip Weems'
home page.
Back to Computer Science
Department home page.