Shannon's information measures Flashcards
Shannon entropy
Entropy is the uncertainty in a random variable X
H(X) = -\sum_{i=1}^N p(x_i) log(p(x_i))
It has a limit of
0 ≤ H(X) ≤ log(N).
It is non increasing under functions
H(X) ≥ H(g(X))
with equality iff g is invertible. Furthermore, conditioning can not increase entropy.
Renyi entropy
General form of the Shannon entropy
Let β ∈ [0, ∞], then Renyi entropy of order β is
H_β(X) = 1 / (1-β) log [ \sum_{i=1}^N P(X_i)^β ].
For β=1, Shannon entropy is obtained.
Joint entropy
H(X, Y) = -\sum_{x,y} p(x,y) log( p(x,y) ) = -E[ log( p(x,y) ) ]
There are some relations
H(X,Y) = H(X) + H(Y|X)
H(X,Y) = H(Y) + H(Y|X)
Chain rule:
H(X_1, X_2, …, X_n) =\sum_ {i=1}^n H(X_i | X_{i-1}, … , X_1)
Conditional entropy
H(X | Y) = -\sum_{x,y} p(x,y) log( p(x | y) ) = -E[ log( p(x | y) ) ]
Conditioning can never increase the entropy:
H(Y | X) ≤ H(Y),
with equality iff X and Y independent.
Differential entropy
It is the entropy of a continuous random variable with support set S, with density f(x),
h(X) = - \int_S f(x) log( f(x) ) dx
Joint and conditional differential entropy
The differential entropy of a set X_1, X_2, … , X_n of random variables with density f(x_1, x_2, … , x_n ) is defined as
h(X1, X2, … , Xn ) = −\int f(x^n) log f(x^n) dx^n.
If X, Y have a joint density function f (x, y), we can define the conditional differential entropy h(X|Y) as
h(X|Y) = −\int f (x, y) log f (x|y) dx dy = h(X, Y) − h(Y).
Mutual information and relative entropy for densities
The mutual information I (X; Y ) between two random variables with joint density f (x, y) is defined as
I (X; Y ) = \int f (x, y) log[ f (x, y) / (f(x)* f(y)) ] dxdy.
From the definition it is clear that
I (X; Y ) = h(X) − h(X|Y ) = h(Y ) − h(Y |X) = h(X) + h(Y ) − h(X, Y ).
The relative entropy (or Kullback–Leibler distance) D(f ||g) between two densities f and g is defined by
D(f ||g) = \int f * log(f/g).
Note that D(f ||g) is finite only if the support set of f is contained in the support set of g.
Relation between entropy and differential entropy
If the density f (x) of the random variable X is Riemann integrable, then
H (X Δ ) + log Δ → h(f ) = h(X), as Δ → 0.
Thus, the entropy of an n-bit quantization of a continuous random variable X is approximately h(X) + n
Mutual information
It is the information about X provided by Y
I(X;Y) = I(Y;X) = \sum_{x,y} p(x,y) log[ p(x,y) / (p(x) * p(y) ) ] = H(X) + H(Y) - H(X,Y) = H(X) - H(Y|X) = H(Y) − H(Y|X)
It quantifies the relevant information.
Chain rule:
I(X_1, X_2, …, X_n ; Y) =\sum_ {i=1}^n I(X_i ; Y | X_{i-1}, … , X_1)
Conditional mutual information
I(X;Y|Z) = \sum_{x,y,z} p(x,y,z) log[ p(x,y|z) / (p(x|z) * p(y|z) ) ]
I(X;Y|Z) = H(X|Z) - H(X|Y,Z) = H(X|Z) + H(Y|Z) - H(X,Y|Z).
Kullback-Leibler divergence
Also called relative entropy. It is not a distance measure, since it is not symmetric. It is non negative.
Distinction between p and q
D(p||q) = \sum_x p(x) log[ p(x)/q(x) ].
For continuity, 0 log(0) = 0.
For q(x) = 0, p(x) > 0, p(x) log(p(x)/0) = ∞.
We often require p and q to have same support, because q(x) may be zero for some x even though p(x) is non-zero.
Directed information
I(x^k → y^k) = \sum_{i=1}^k I(y(i) ; x^i | y^{i-1})
Markov chains
Let X ↔ Y ↔ Z form a Markov chain, then
I(X;Z) ≤ I(X;Y).
The intuition is that processing can only decrease the mutual information, thus variables further apart have a smaller mutual information.
However, conditioning can increase mutual information. Let Y be an encrypted version of X using the secret key Z. Then I(X;Y) = 0 and I(X;Z) = 0. However, I(X;Y|Z) > 0 since with Z we can recover Y from X.
What is information?
Information is stimuli which has meaning.
Which distributions maximize the entropy?
For discrete sources, the uniform distribution is entropy maximizing.
For continuous sources:
* Given the constraint of finite support: Uniform distribution maximizes differential entropy.
* Given the constraint of fixed finite variance: Gaussian distribution maximizes differential entropy.