Floating Point

Need numbers with fractions (type float in C)

3.1415

3.15576_ten* 10⁹

Scientific notation:

Normalized: no leading 0

Base 10:

+/- D.FFFF * 10^exp

Base K:

+/- D.FFFF * K^exp

Base 2:

+/- D.FFFF X 2^exp

Binary example

1.0 * 2^-1

0.5_ten

binary point

General form: canonical binary scientific notation

1.ffff_two * 2^eeee

ffff

fraction (signficand)

eeee

exponent (expressed in decimal for simplicity)

Note that both have signs

Standardized notation in normal form

simplifies exchange of data

simplifies arithmetic algorithms

increases accuracy of numbers that can be stored (no leading 0's)

Representation tradeoff

Precision (size of significand)

Range (size of exponent)

Design principle

"Good design demands good compromises."

b₃₁

b₃₀ b₂₃

b₂₂

b₀

IEEE 754 standard

(1985)

s (1 bit): sign

e (8 bits): exponent

excess 127 (NOT 128!)

-127 <= exp <= 128

f (23 bits): fraction

called significand or mantissa

assume leading 1 ("hidden one")

value = -1^s * (1.ff…f)_two * 2^e

normalized magnitude range:

2.0 * 10^-38_ten

(approximate)

2.0 * 10⁺³⁸_ten

overflow: exponent too large to fit

underflow: exponent too small to fit

categories of floating point numbers

* normalized numbers: standard floating point numbers.

Most bitstring patterns in IEEE 754

* denormalized numbers: fewer bits of precision, and smaller (in magnitude)

than normalized numbers.

* zero: a positive and negative representation of 0 (sign bit 0 or 1)

* infinity: also a positive and negative infinity

For example, 1.0/0.0 produces infinity.

* NaN: "not a number"; undefined value like sqrt (-4)

Category

Sign Bit

Exponent

Fraction

Zero

Anything

00000000

23 0's

Infinity

Anything

11111111

23 0's

NaN

Anything

11111111

Not 23 0's

Denormalized numbers

Anything

00000000

Not 23 0's

Normalized numbers

Anything

Neither of the above

Anything

S Exp Fraction

Value

- --------- ----------------------------

0 1111 1111 1111 1111 1111 1111 1111 111

NaN

0 1111 1111 0000 0000 0000 0000 0000 001

NaN

0 1111 1111 0000 0000 0000 0000 0000 000

Infinity

0 1111 1110 1111 1111 1111 1111 1111 111

1.11 . . . 1 x 2¹²⁷

Normalized

0 1111 1110 0000 0000 0000 0000 0000 000

2¹²⁷

0 1000 0000 0000 0000 0000 0000 0000 000

2¹

0 0111 1111 0000 0000 0000 0000 0000 000

0 0111 1110 0000 0000 0000 0000 0000 000

2^-1

0 0000 0001 0000 0000 0000 0000 0000 000

2^-126

0 0000 0000 1111 1111 1111 1111 1111 111

0.11 . . . 1 x 2^-126

Denormalized

0 0000 0000 1000 0000 0000 0000 0000 000

2^-127

0 0000 0000 0000 0000 0000 0000 0000 001

2^-149

0 0000 0000 0000 0000 0000 0000 0000 000

Zero

Denormalized numbers

Exponent 00000000:

No hidden 1.

b_22-0: bits after radix point.

Fix the exponent to -126.

(Why?)

Largest positive denormalized

S Exp Fraction

- --------- ----------------------------

0 0000 0000 1111 1111 1111 1111 1111 111

0.11 . . . 1 x 2^-126

23 bits of precision, since there are 23 1's after the radix point

Smallest positive NORMALIZED

S Exp Fraction

- --------- ----------------------------

0 0000 0001 0000 0000 0000 0000 0000 000

1.0 x 2^-126

Why 1.0?

choices for Denormalized:

0.11 . . . 1 x 2^-127 (exponent is 0 - 127)

0.11 . . . 1 x 2^-126 (exponent is 1 - 127)

Both choices are smaller than 1.0 x 2^-126, the smallest normalized

By picking -126 instead of -127, the gap between the largest denormalized number

and the smallest normalized number is smaller

Converting from base 10 to normalized IEEE 754

Problem: convert 10.25 from base 10

(1) Convert whole number (to the left of the radix point) to base 2

10₁₀ is 1010₂

(2) Convert fraction (number to the right of the radix point) to base 2

.25₁₀ is .01₂

(0 * 0.5 + 1 * 0.25)

(3) Add:

1010 + 0.01 is 1010.01

(4) Binary scientific notation:

1010.01 * 2⁰, which is 1.01001 X 2³

(5) IEEE 754 single precision:

adjust precision to 23 bits

(6) Convert 3 to the correct bias:

Add 127 to 3 to get 130 and convert to binary.

Result: 1000 0010

(128 + 2)

(7) Combine results:

S Exp Fraction

- --------- ----------------------------

0 1000 0010 0100 1000 0000 0000 0000 000

Notice hidden "1" is not represented in the fraction.

Converting from base 10 to IEEE 754 denormalized

Example: 1.1 x 2^-128 to IEEE 754 single precision

Since -128 < -126, denormalized number

- Shift radix point so exponent is -126

0.011 x 2^-126

- Exponent is 8 0's.

Why?

- Bits after the radix point: fraction.

- Sign bit: 0

S Exp Fraction

- --------- ----------------------------

0 0000 0000 0110 0000 0000 0000 0000 000

No unsigned float: always have a sign bit

Why Sign Bit, Exponent, then Fraction?

Comparisons (<, >)

Double precision:

sign

exponent: 11 bits

Why not 16?

fraction: 52 bits

You should be able to do the following (after review and practice):

* Give the names of each of the five categories of floating point numbers

in IEEE 754 single precision.

* Given a 32-bit string, determine which category the bitstring falls in.

* Given a normalized or denormalized number, write the number in

canonical binary scientific notation (you can leave the exponent in base 10).

* Given a number in base 10 or canonical binary scientific notation,

convert it to an IEEE 754 single precision floating point number.

* Know what bias is used for normalized numbers.

* Know what exponent is used for denormalized numbers.

* Know what the hidden 1 is.