Lecture 15: More Lovely Floating Point (by Trek Palmer) ======================================= Now that we know how to encode fractional values into a floating point representation, we can proceed to learn how to do operations on them. All floating point operations are composed of the same three basic steps: 1) Perform some operation on the exponents 2) Perform some operation on the mantissas 3) Perform rounding/truncation on the resulting mantissa, if necessary Addition --------- For addition, the important thing is to get the exponents to match. Otherwise addition on the mantissas will mean nothing. So, to add to floating point numbers: 1) Shift smaller number's mantissa right by the difference between the two exponents 2) Set the result's exponent equal to the larger number's exponent 3) Perform add/sub on the mantissas and determine the result's sign 4) Normalize result (if necessary) and round (if necessary) Multiplication -------------- Multiplication is actually easier in floating point. This derives from the arithmetic fact that 2^n * 2^m = 2^(n+m). Therefore, to multiply: 1) Add the exponents and subtract 127 (or 1023 if using doubles). This adjustment by 127 is necessary because the exponents are stored in excess-127 form. Therefore E1' + E2' = (E1 + 127) + (E2 + 127) = (E1 + E2) + 127 + 127. What we need is (E1 + E2) + 127, so we just subtract off the extra 127. 2) Multiply the mantissas and determine the sign 3) Normalize the result (if necessary) and round (if necessary) Division -------- Division is similarly easy, 1) Subtract the exponents and add 127 2) Divide the mantissas and determine sign 3) Normalize the result (if necessary) and round (if necessary) Rounding ---------- Rounding (or truncation as it is sometimes called) is where much of the trickiness of floating-point comes in. Because it is possible to have a resulting mantissa twice as large (in terms of bits) as the representation can contain because (2^n - 1) * (2^n - 1) = 2^2n - 2^n+1 + 1. So, we need to get rid of the excess bits before we can squeeze the result into our friendly IEEE representations. method 1: chopping ------------------- Chopping is the simplest method. It simply discard the excess bits. This, it turns out, is bad. Chopping, you see, is a biased form of rounding. By biased, I mean it tends towards some values more than others. In practical terms, this means that you're always rounding to zero, which means that your error is not evenly distributed about zero. The reason why this matters is that if the error is evenly distributed around zero then positive errors are as likely to happen as negative errors, and furthermore after a stream of floating point operations, many of the positive errors will end up canceling out the negative errors (statistically speaking, that is). So chopping is bad because it basically ensures that your truncation error will grow with the number of operations you perform. Ex1: 8-bit system 00110100|11111100 -> 00110100, error ~ 2^-8 Ex2: 00110101|10000000 -> 00110101, error = 2^-9 method 2: Von Neumann Rounding ------------------------------ This is a better method than chopping. In this scheme, if the truncated bits are all zero, you do nothing. Just drop them on the floor. If, however, any of the bits are 1, you set the LSB of the truncated mantissa to 1. This is unbiased, actually, and fairly simple to implement in hardware and it has an error range of -1 to +1 in the LSB. Ex1: 00110100|11111100 -> 00110101, error = 2^-15 + 2^-16 Ex2: 00110101|10000000 -> 00110101, error = -2^-9 method 3: Actual Rounding -------------------------- This method has the lowest error rate, and is also unbiased. It is, however, more complicated. In this scheme, a 1 is added to the LSB position of the result if the MSB of the truncated bits is 1. The error range is then (-1/2, 1/2) in the LSB. But it has a special case. If the truncated value is 100...0 (exactly 1/2 of the LSB value) then it sits between 0 and 1. This is analogous to the case of 5 when doing decimal rounding. Here, the technique is to round the result bits to the nearest even value. This is actually the rounding procedure required by the IEEE standard. To implement it in hardware, the hardware actually only preserves 3 bits. The first two bits are the two most significant bits, and the third bit is the logical OR of all the rest of the truncated bits (this guy is also known as the sticky bit). So, in this case, the annoying in-between value would be represented as 100. Ex1: 00110100|11111100 -> 00110101, error = 2^-15 + 2^-16 Ex2: 00110101|10000000 -> 00110110, error = 2^-9 So, now we know everything about floating point, we can understand why floating-point is so much more painful than integer operations. The hardware is also much more complicated, and consequently slower. So, the real take home lesson is only use floating-point when absolutely necessary. All the issues with speed and error are too difficult to reason about that you shouldn't bother if you don't have to. Dynamic Memory and Process Layout ---------------------------------- Throughout the course, you have been introduced piecemeal to all the separate structures that actually comprise the state of an executing program. Those pieces are: 1) Text: This is the actual code. Stuck in memory somewhere and pointed at by the PC 2) PC: the program counter, points at the currently executing instruction 3) Registers: the system registers, they contain values currently being computed on and temporaries necessary for future operations 4) Static Data: These are the constant sized data elements that the program reads and writes. All values in the .data section are of a fixed size, but may be modified by the program. Anything whose size can be ascertained at compilation/assembly time can be placed in the static data section 5) The Stack: The stack holds the state for function calls, and is used to preserve function-local data and the return address (link register) across potentially destructive function calls. Note that the PC and the registers are the only part of this program state that are actually on the CPU. Everything else is in main memory. Looks like we've got everything we need here, right? Wrong. There is one crucial missing part. The heap.