Floating Point
Need numbers with fractions (type float in C)
3.1415
3.15576ten * 109
Scientific notation:
Normalized: no leading 0
Base 10:    +/- D.FFFF * 10exp
Base K:    +/- D.FFFF * Kexp
Base 2:    +/- D.FFFF X 2exp
Binary example
1.0 * 2-1 0.5ten
   binary point
General form: canonical binary scientific notation
1.fffftwo * 2eeee
ffff fraction (signficand)
eeee exponent (expressed  in decimal for simplicity)
Note that both have signs
Standardized notation in normal form
simplifies exchange of data
simplifies arithmetic algorithms
increases accuracy of numbers that can be stored (no leading 0's)
Representation tradeoff
Precision (size of significand)
vs Range (size of exponent)
Design principle
"Good design demands good compromises."
b31 b30       b23 b22 b0
s e  f     IEEE 754 standard
(1985)
s (1 bit): sign
e (8 bits): exponent
excess 127 (NOT 128!)
-127 <= exp <= 128
f (23 bits): fraction
called significand or mantissa
assume leading 1 ("hidden one")
value =  -1s * (1.ff…f)two * 2e
normalized magnitude range: 2.0 * 10-38ten (approximate)
2.0 * 10+38ten
overflow: exponent too large to fit
underflow: exponent too small to fit
categories of floating point numbers
    * normalized numbers: standard floating point numbers.
Most bitstring patterns in IEEE 754
    * denormalized numbers: fewer bits of precision, and smaller (in magnitude)
than normalized numbers.
    * zero: a positive and negative representation of 0 (sign bit 0 or 1)
    * infinity: also a positive and negative infinity
For example, 1.0/0.0 produces infinity.
    * NaN: "not a number"; undefined value like sqrt (-4)
Category Sign Bit Exponent Fraction
Zero Anything 00000000 23 0's
Infinity Anything 11111111 23 0's
NaN Anything 11111111 Not 23 0's
Denormalized numbers Anything 00000000 Not 23 0's
Normalized numbers Anything Neither of the above Anything
S    Exp             Fraction Value
- --------- ----------------------------
0 1111 1111 1111 1111 1111 1111 1111 111 NaN
0 1111 1111 0000 0000 0000 0000 0000 001 NaN
0 1111 1111 0000 0000 0000 0000 0000 000 Infinity
0 1111 1110 1111 1111 1111 1111 1111 111 1.11 . . . 1 x 2127
Normalized
0 1111 1110 0000 0000 0000 0000 0000 000 2127
0 1000 0000 0000 0000 0000 0000 0000 000 21
0 0111 1111 0000 0000 0000 0000 0000 000 1
0 0111 1110 0000 0000 0000 0000 0000 000 2-1
0 0000 0001 0000 0000 0000 0000 0000 000 2-126
0 0000 0000 1111 1111 1111 1111 1111 111 0.11 . . . 1 x 2-126
Denormalized
0 0000 0000 1000 0000 0000 0000 0000 000 2-127
0 0000 0000 0000 0000 0000 0000 0000 001 2-149
0 0000 0000 0000 0000 0000 0000 0000 000 0 Zero
Denormalized numbers
Exponent 00000000:
No hidden 1.
b22-0: bits after radix point.
Fix the exponent to -126. (Why?)
Largest positive denormalized
S    Exp             Fraction
- --------- ----------------------------
0 0000 0000 1111 1111 1111 1111 1111 111
0.11 . . . 1 x 2-126
23 bits of precision, since there are 23 1's after the radix point
Smallest positive NORMALIZED
S    Exp            Fraction
- --------- ----------------------------
0 0000 0001 0000 0000 0000 0000 0000 000
1.0 x 2-126 Why 1.0?
choices for Denormalized:
0.11 . . . 1 x 2-127  (exponent is 0 - 127)
0.11 . . . 1 x 2-126  (exponent is 1 - 127)
Both choices are smaller than 1.0 x 2-126, the smallest normalized
By picking -126 instead of -127, the gap between the largest denormalized number
and the smallest normalized number is smaller
Converting from base 10 to normalized IEEE 754
Problem: convert 10.25 from base 10
    (1) Convert whole number (to the left of the radix point) to base 2
1010 is 10102
    (2) Convert fraction (number to the right of the radix point) to base 2
.2510 is .012 (0 * 0.5 + 1 * 0.25)
    (3) Add:
1010 + 0.01 is 1010.01
    (4) Binary scientific notation:
1010.01 * 20, which is 1.01001 X 23
    (5) IEEE 754 single precision:
adjust precision to 23 bits
    (6) Convert 3 to the correct bias:
Add 127 to 3 to get 130 and convert to binary.
Result: 1000 0010 (128 + 2)
    (7) Combine results:
S    Exp            Fraction
- --------- ----------------------------
0 1000 0010 0100 1000 0000 0000 0000 000
Notice hidden "1" is not represented in the fraction.
Converting  from base 10 to IEEE 754 denormalized
Example:  1.1 x 2-128 to IEEE 754 single precision
Since -128 < -126, denormalized number
  - Shift radix point so exponent is -126
0.011 x 2-126
  - Exponent is 8 0's. Why?
  - Bits after the radix point: fraction.
  - Sign bit: 0
S    Exp            Fraction
- --------- ----------------------------
0 0000 0000 0110 0000 0000 0000 0000 000
No unsigned float: always have a sign bit
Why Sign Bit, Exponent, then Fraction?
Comparisons (<, >)
Double precision:
sign
exponent: 11 bits Why not 16?
fraction: 52 bits
You should be able to do the following (after review and practice):
    * Give the names of each of the five categories of floating point numbers
in IEEE 754 single precision.
    * Given a 32-bit string, determine which category the bitstring falls in.
    * Given a normalized or denormalized number, write the number in
canonical binary scientific notation (you can leave the exponent in base 10).
    * Given a number in base 10 or canonical binary scientific notation,
convert it to an IEEE 754 single precision floating point number.
    * Know what bias is used for normalized numbers.
    * Know what exponent is used for denormalized numbers.
    * Know what the hidden 1 is.