|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Floating Point |
|
|
|
|
Need numbers with
fractions (type float in C) |
|
|
|
3.1415 |
|
|
|
|
3.15576ten
* 109 |
|
|
|
|
|
Scientific notation: |
|
|
|
Normalized: no leading 0 |
|
|
|
Base 10: |
|
+/- D.FFFF * 10exp |
|
|
|
Base K: |
|
+/- D.FFFF * Kexp |
|
|
|
Base 2: |
|
+/- D.FFFF X 2exp |
|
|
|
|
|
Binary example |
|
|
|
1.0 * 2-1 |
|
0.5ten |
|
|
|
|
|
|
|
|
binary point |
|
|
|
|
|
General form: canonical binary scientific notation |
|
|
|
1.fffftwo * 2eeee |
|
|
|
ffff |
fraction (signficand) |
|
|
|
eeee |
exponent (expressed in decimal
for simplicity) |
|
|
|
Note that both have signs |
|
|
|
|
|
Standardized notation in normal form |
|
|
|
simplifies exchange of data |
|
|
|
simplifies arithmetic algorithms |
|
|
|
increases accuracy of numbers that can be stored
(no leading 0's) |
|
|
|
|
|
Representation tradeoff |
|
|
|
Precision (size of significand) |
|
|
vs |
Range
(size of exponent) |
|
|
|
|
|
Design principle |
|
|
|
"Good design demands
good compromises." |
|
|
|
|
|
b31 |
b30 b23 |
b22 |
|
b0 |
|
|
s |
e |
f |
|
|
|
IEEE 754 standard |
|
|
|
|
(1985) |
|
|
s (1 bit): sign |
|
|
e (8 bits): exponent |
|
|
|
excess 127 (NOT 128!) |
|
|
|
-127 <= exp <= 128 |
|
|
f (23 bits): fraction |
|
|
|
called significand or
mantissa |
|
|
|
assume leading 1
("hidden one") |
|
|
|
|
|
value = -1s * (1.ff…f)two * 2e |
|
|
|
|
|
normalized magnitude
range: |
2.0 * 10-38ten |
(approximate) |
|
|
|
|
2.0 * 10+38ten |
|
|
|
|
|
|
overflow: exponent too large to fit |
|
|
|
underflow: exponent too small to fit |
|
|
|
|
|
categories of floating
point numbers |
|
|
|
* normalized numbers: standard floating point numbers. |
|
|
|
Most bitstring patterns
in IEEE 754 |
|
|
|
* denormalized numbers: fewer bits of precision, and smaller (in magnitude) |
|
|
|
than normalized numbers. |
|
|
|
* zero: a positive and negative representation of 0 (sign bit 0 or 1) |
|
|
|
* infinity: also a positive and negative infinity |
|
|
|
For example, 1.0/0.0
produces infinity. |
|
|
|
* NaN: "not a number"; undefined value like sqrt (-4) |
|
|
|
|
|
|
Category |
Sign Bit |
Exponent |
Fraction |
|
|
|
Zero |
Anything |
00000000 |
23
0's |
|
|
|
Infinity |
Anything |
11111111 |
23
0's |
|
|
|
NaN |
Anything |
11111111 |
Not
23 0's |
|
|
|
Denormalized
numbers |
Anything |
00000000 |
Not
23 0's |
|
|
|
Normalized numbers |
Anything |
Neither
of the above |
Anything |
|
|
|
|
|
|
S Exp Fraction |
|
Value |
|
|
|
- ---------
---------------------------- |
|
|
|
|
0 1111 1111 1111 1111
1111 1111 1111 111 |
NaN |
|
|
|
0 1111 1111 0000 0000
0000 0000 0000 001 |
NaN |
|
|
|
0 1111 1111 0000 0000
0000 0000 0000 000 |
Infinity |
|
|
|
0 1111 1110 1111 1111
1111 1111 1111 111 |
1.11 . . . 1 x 2127 |
|
Normalized |
|
|
|
0 1111 1110 0000 0000
0000 0000 0000 000 |
2127 |
|
|
|
|
0 1000 0000 0000 0000
0000 0000 0000 000 |
21 |
|
|
|
|
0 0111 1111 0000 0000
0000 0000 0000 000 |
1 |
|
|
|
|
0 0111 1110 0000 0000
0000 0000 0000 000 |
2-1 |
|
|
|
|
0 0000 0001 0000 0000
0000 0000 0000 000 |
2-126 |
|
|
|
|
0 0000 0000 1111 1111
1111 1111 1111 111 |
0.11 . . . 1 x 2-126 |
|
Denormalized |
|
|
0 0000 0000 1000 0000
0000 0000 0000 000 |
2-127 |
|
|
|
|
0 0000 0000 0000 0000
0000 0000 0000 001 |
2-149 |
|
|
|
|
0 0000 0000 0000 0000
0000 0000 0000 000 |
0 |
|
Zero |
|
|
|
|
|
|
|
|
|
|
|
Denormalized numbers |
|
|
|
Exponent 00000000: |
|
|
|
No
hidden 1. |
|
|
|
b22-0: bits after radix point. |
|
|
|
Fix the exponent to
-126. |
(Why?) |
|
|
|
|
|
|
Largest positive
denormalized |
|
|
|
S Exp Fraction |
|
|
|
|
- ---------
---------------------------- |
|
|
|
0 0000 0000 1111 1111
1111 1111 1111 111 |
|
|
|
|
|
|
0.11 . . . 1 x 2-126 |
|
|
|
23 bits of precision,
since there are 23 1's after the radix point |
|
|
|
|
|
|
Smallest positive
NORMALIZED |
|
|
|
S Exp Fraction |
|
|
|
- ---------
---------------------------- |
|
|
|
0 0000 0001 0000 0000
0000 0000 0000 000 |
|
|
|
1.0 x 2-126 |
|
Why 1.0? |
|
|
|
|
|
|
choices for Denormalized: |
|
|
|
0.11 . . . 1 x 2-127 (exponent is 0 -
127) |
|
|
|
0.11 . . . 1 x 2-126 (exponent is 1 -
127) |
|
|
|
|
|
|
Both choices are smaller
than 1.0 x 2-126, the
smallest normalized |
|
|
|
By picking -126 instead
of -127, the gap between the
largest denormalized
number |
|
|
|
and the smallest normalized number is smaller |
|
|
|
|
|
Converting from base 10
to normalized IEEE 754 |
|
|
|
Problem: convert 10.25 from base 10 |
|
|
|
(1) Convert whole number (to the left of
the radix point) to base 2 |
|
|
|
1010 is 10102 |
|
|
|
(2) Convert fraction (number to the
right of the radix point) to base 2 |
|
|
|
.2510 is .012 |
(0 * 0.5 + 1 * 0.25) |
|
|
|
(3) Add: |
|
|
|
1010 + 0.01 is 1010.01 |
|
|
|
(4) Binary scientific notation: |
|
|
|
1010.01 * 20, which is 1.01001 X 23 |
|
|
|
(5) IEEE 754 single precision: |
|
|
|
adjust precision to 23 bits |
|
|
|
(6) Convert 3 to the correct bias: |
|
|
|
Add 127 to 3 to get
130 and convert to binary. |
|
|
|
Result: 1000 0010 |
|
(128 + 2) |
|
|
|
(7) Combine results: |
|
|
|
|
|
|
S Exp Fraction |
|
|
|
- ---------
---------------------------- |
|
|
|
0 1000
0010 0100 1000 0000 0000
0000 000 |
|
|
|
|
|
|
Notice
hidden "1" is not represented in the fraction. |
|
|
|
|
|
Converting from base 10 to IEEE 754 denormalized |
|
|
|
Example: 1.1 x 2-128 to IEEE 754 single
precision |
|
|
|
|
|
|
Since -128 < -126,
denormalized number |
|
|
|
- Shift radix point so exponent is -126 |
|
|
|
0.011 x 2-126 |
|
|
|
- Exponent is 8 0's. |
Why? |
|
|
|
- Bits after the radix point: fraction. |
|
|
|
- Sign bit: 0 |
|
|
|
|
|
|
|
S Exp Fraction |
|
|
|
- ---------
---------------------------- |
|
|
|
0 0000 0000 0110
0000 0000 0000 0000 000 |
|
|
|
|
|
No unsigned float: always have a sign bit |
|
|
|
|
|
Why Sign Bit, Exponent, then
Fraction? |
|
|
|
Comparisons (<, >) |
|
|
|
|
|
Double
precision: |
|
|
|
sign |
|
|
|
|
exponent: 11 bits |
Why not 16? |
|
|
|
fraction: 52 bits |
|
|
|
|
|
You should be able to do
the following (after review and practice): |
|
|
* Give the names of each of the five categories of floating point numbers |
|
|
|
in IEEE 754 single
precision. |
|
|
* Given a 32-bit string, determine which category the bitstring falls in. |
|
|
* Given a normalized
or denormalized number, write the number in |
|
|
|
canonical binary scientific notation (you can
leave the exponent in base 10). |
|
|
* Given a number in base 10 or canonical
binary scientific notation, |
|
|
|
convert
it to an IEEE 754 single
precision floating point number. |
|
|
* Know what bias is used for normalized numbers. |
|
|
* Know what exponent is used for denormalized numbers. |
|
|
* Know what the hidden
1 is. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|