Lecture 7 Flashcards
Floating point numbers + precision
x = ± b0.b1b2…bn x 2^m, L≤m≤U, bi \in {0,1}
Precision p=n+1
Normalized floating point representation
x = ± 1.b1b2…bn x 2^m = ± 1.f x 2^m, L≤m≤U, bi \in {0,1}
Hidden bit representation: we don’t store b0=1, thus we add 1 bit of precision.
Normalized floating point representation of 47.125
(101111.001) = (1.01111001)x2^5
Smallest positive normalized FP number
1.000…0 x 2^L = 2^L (UFL)
Largest positive normalized FP number
S = 1.111…1 x 2^U = 2^U+2^{U-1}+…+2^{U-n}
2S = 2^{U+1}+2^{U}+…+2^{U-n+1}
2S - S = S = 2^{U+1}-2^{U-n} = 2^{U+1}(1-2^{-p}) (OFL)
Overflow
To -∞ or +∞ if number < -2^{U+1}(1-2^{-p}) or > 2^{U+1}(1-2^{-p})
Underflow
To zero if number -2^L < x < 2^L
Machine epsilon
Distance/gap between 1 and the next floating point number, depends on n only (# digits of the fractional part f). ϵm = 0.00…01 x 2^0 = 2^{-n}
Subnormal/denormalized FP representation
We set b0=0 and m=L. It provides a more gradual underflow, but a loss of precision/slower computation.
Subnormal/denormalized FP representation additional numbers
2(2^n - 1) – n #digits of f, x2 for positive + negative
Smallest positive subnormal number
0.00…1 x 2^L = 2^{-n}2^L = 2^{L-n}