Lecture 14: Floating Point (by Trek Palmer) =========================== Up til now we have been focused on integers. Integers are useful. You can do a lot with them, but not everything. Anything that requires fractional values will need new techniques and representations. Fractional values ----------------- The way we represent fractional values is to separate the integral and fractional values with '.', So 1234.4321 means 1234 + 4*10^-1 + 3*10^-2 + 2 * 10-3 + 1 * 10^-4. We can similarly (textually) represent binary numbers by using the radix point ('.'). So 101010.0101 = 101010 + 0*2^-1 + 1 * 2^-2 + 0 * 2^-3 + 1 * 2^-4. This is simple and (or at least should be) obvious, but the non-obvious question is how do you convert from one to the other? To do this we recall that integer conversion was essentially a form of division, in which case fractional conversion will be a form of multiplication. The algorithm is simple. Convert the integer part separately. Write out the fractional part. Multiply by two. If the result has an integral part, that becomes the fractional digit at that point in the expansion. Replace the integral part with 0, repeat until the result is 0. Ex 1: 1.25 = 1 + 25/100 = 1 + 1/4 so, we know that 1/4 = 1/2^2 so that, a priori, we can deduce that the binary form should be 1.01. Now we compute it with the algorithm: 0.25 * 2 ---- 0.50 => .0 * 2 ---- 1.00 => .01 => 0.00 * 2 ---- 0 Halt Ex 2: 14.3125 0.3125 * 2 ------ 0.6250 => .0 * 2 ------ 1.2500 => .01 => 0.2500 * 2 ------ 0.5000 => .010 * 2 ------ 1.0000 => .0101 => 0.0000 * 2 ------ 0.0000 Halt Now, there is an issue with fractional values that doesn't occur with integers. Namely, non-terminating values. As I'm certain you recall from elementary school 1/3 cannot be represented with a finite string using a radix point system (for decimal). Binary isn't special, it too has non-terminating values. The catch is that non-terminating values in binary aren't necessarily non-terminating in decimal. For instance: Ex 3. 1/10 = 0.10 0.10 * 2 ---- 0.20 => .0 <----------------+ * 2 | ---- | 0.40 => .00 | * 2 | ---- | 0.80 => .000 | * 2 | ---- | 1.60 => .0001 | => 0.60 | * 2 | ---- | 1.20 => .00011 | => 0.20 ------------------------+ Which is the same as this guy! Welcome to infinite-loop land, population you. This number (1/10) in fact was responsible for patriot missile systems failing in the first gulf war. People died because somebody forgot how to do binary math. So this stuff is actually important. Representing this stuff in memory ---------------------------------- Now we have a way of textually representing fractional values and of converting between different bases in this textual form, but how do we turn text into a collection of bits that hardware can understand? Fixed-Point ------------ The first thing that comes to mind, and indeed the first thing tried in the real world was to just chop up a word into an integer part and a fraction part. So 1111.0001 could be represented in a fixed precision 8-bit word as 11110001 as long as the radix point was assumed to be between bits 4 and 3. This was, in fact, the representation assumed by the first high-level language (COBOL), but in this example, you can already see some of the limitations of fixed point representation. First, it limits the size of both the integer and fractional part. What if the number was 1000.1? Strictly speaking, it should only require 6 bits in total, but because the integer part is larger than 4 bits, it cannot fit into our fixed-point representation. Similarly, if the number was 1.00001, we couldn't represent it because the fractional part required too many bits. The problem is essentially that with a fixed point representation, the place values of the bit-positions is preassigned and not optimized for the number in question. Consider the following number: 1*2^-35. Even with a 32-bit word, even if the entire word is assigned to represent the fractional part, this value cannot be represented. The galling thing is that it really only needs a single bit! So, in summary, fixed point is bad. Floating point --------------- The superior solution (and the one most systems use) is known as floating point. In the floating point scheme, the number is actually represented as an encoding of a simple data structure. First, we need to come up with a new way of thinking about numbers (this, coincidentally is one of the ways real numbers are represented in real and complex analysis). In this new scheme, a number is composed of the following things: 1) a sign (a bit) 2) a sequence of digits (possibly infinite in length) 3) an exponent (a signed integer) If we assume that the radix point is at the beginning of the digit sequence, then the exponent becomes an offset that tells us where the radix point should go in the actual value that's encoded. Because the exponent can take on different values, the radix point can be in different locations in the sequence. Hence the name floating point. So now, we need to encode this data structure into a fixed size bit string. This is where things get somewhat complicated. Early on, many systems had their own, incompatible representations. Fortunately, the IEEE has generated several floating point representation standards. Most systems use these now. Here's the layout of the IEEE 32-bit (single-precision) floating-point format: 31 30 23 22 0 +----+----------+-------------------------------------------------------+ |sign| exponent | mantissa | +----+----------+-------------------------------------------------------+ So, in this system, as long as the non-zero portions of the number aren't wider than 23 bits and the exponent can fit in 8 bits, you can represent it perfectly in 32 bits. Here, too we see that floating point has it's limitations. If the mantissa is too large (like, for instance if it's a non-terminating number) or the exponent then merely encoding the value as a floating-point number will introduce error. In an effort to squeeze as much out of the representation as possible, the IEEE encoding has a couple of extra tricks up its sleeve. The first is that IEEE values are normalized. What this means is that the first digit after the radix point is assumed to be a one. Therefore, you don't have to actually encode it. You can drop the one and shift the value over to get one more bit of precision. The second is that the exponent is not stored as a 2's complement value, rather it is encoded in what's known as excess-127. What this means is that the value in the exponent is actually 127 + the real exponent. Also, the two exponent values 0 and 255 have special semantics. 0 means that the if the mantissa is 0, the value is EXACTLY 0. If the exponent is 255 and the mantissa is 0, then the number is encoding infinity (from dividing by zero). If the exponent is 0 and the mantissa non-zero, then the value is encoding a denormal number. In this case, there isn't a one assumed to the left of the radix point. If the exponent is 255 and mantissa is not zero, then this value represents NaN (not a number) which is a result you get from things like 0/0, or the square root of -1. Convenient table 1: exponent | mantissa | meaning ---------+----------+--------- 0 | 0 | number represented is EXACTLY 0 ---------+----------+--------- 0 | !0 | value is a denormal number ---------+----------+--------- 255 | 0 | infinity ---------+----------+--------- 255 | !0 | NaN (Not a Number) ---------+----------+--------- When you use the type 'float' in java, you're actually using one of these single-precision numbers. The IEEE also has a standard for 64-bit floating point values (double-precision). These are what you're using if you have 'double' types in java. 63 62 52 51 0 +----+----------+-------------------------------------------------------+ |sign| exponent | mantissa | +----+----------+-------------------------------------------------------+ Here, the exponent is stored in excess 1023. Apart from that, it's much the same as the single-precision form.