Lecture 8 Flashcards
IEEE-754 Single Precision
32 bits - sign s (1) / exponent c (8) / significand f (23)
x=(-1)^s 1.f x 2^m
c = m + 127, 1 ≤ c ≤ 254, m \in [L= -126, U=127] (c=255 / c=0 reserved)
IEEE-754 Single Precision Zero
c = (00000000) f = (000000...00)
IEEE-754 Single Precision Subnormal numbers
c = (00000000) but f≠0
Set m=L= -126 (NOT -127!) and leading digits to 0.
x = +0.f x 2^{-126}
IEEE-754 Single Precision Infinity
c=(11111111), f=(00…0)
IEEE-754 Single Precision NaN
c=(11111111), f≠0
IEEE-754 Double Precision
64 bits, sign s (1), exponent c (11), significand f (52)
c = m + 1023, 1 ≤ c ≤ 2046,m \in [-1022,1023]
IEEE-754 Rounding
Round toward zero (truncate): x_- = ± 1.b1b2…bn x 2^m
Round toward ±∞ (add ϵm.2^{m}): x_+ = ± 1.b1b12…bn x 2^m + 0.00…1 x 2^m
IEEE-754 Round up/down
Round up (ceil) toward ∞: x_+ if x positive, x_- if x negative Round down (floor) toward -∞: x_- if x positive, x_+ if x negative
IEEE-754 Rounding Absolute/relative error
err_abs = |~x - x| ≤ |x_+ - x_-| = ϵm x 2^m err_rel = |~x - x|/|x|≤ ϵm
IEEE Single precision, find the smallest ⍺ s.t. 2^8 + ⍺ ≠ 2^8.
x_+ = 2^8 + 2^8.ϵ_m = 2^{-15}
Gap from number x can be estimated x.ϵm
IEEE Single/double precision Machine epsilon
single: ϵm = 2^{-23} = 10^{-7}
double: ϵm = 2^{-52} = 10^{-16}
a=10^5, b=1.0
while a+b > a:
b = b/2
For which b will it stop?
Will stop when a+b=a, that is when b = a.ϵm = 10^5 10^{-16} = 10^{-11}
Catastrophic Cancellation
c = a - b when a≃b
a = 1.1011 ×2^1
b = 1.1010 ×2^1
Normalization: c=1.???? ×2^{-3}
Cancellation
c = a+b with a≪b or b≪a
x = 0.3721448693 and y = 0.3720214371, compute (x-y) using 5 decimal digits of accuracy. Relative error due to rounding vs. relative error due to subtraction?
Rounding: 1.3 x 10^{-5}
Substraction: 3 x 10^{-2}