Information Theory Flashcards
log(A) + log(B)
log(AB)
Entropy for N equiprobable events
Entropy = E[-log p(x)] = - sum_{over classes} p(x_c) log p (x_c)
For equiprobable events, p(x_c) = 1/C, so
entropy = log(C)
Is entropy ever negative?
No
When is entropy additive?
For independent events
Interpretation of -log p_i
of Bits required to represent ith symbol efficiently
Shannon bit
0 or 1, but not both simultaneously (as opposed to qubit)
1 bit = amount of info gained
if we have apriori info about which event is more probable, then amount of info gain < 1 bit
Shanon Coding
Minimize the bits required to represent message
Done by assigning shorter codes to symbols with higher probabilities
Mean Length = sum(p_x log_k (p_x)), where k = arity of code e.g., 2 for binary and p_x is probability for symbol x
Coding must be efficient but also decodable (unique symbols)
Here, log_k(p_x) is the bits of information and p_x is the proba of x happening
Convert base of log
log_c(a) = log_b(a)/log_b(c)
Interpretation of cross-entropy
- sum_x p_x log(q_x) where p is the true distribution and q is the predicted distribution
Avg bits required to transmit p if q is used instead
Interpretation of KL divergence
KLD(P||Q) = H(P, Q) - H(P)
i.e., cross-entropy - entropy (not symmetric)
i.e., additional bits required to transmit p since we are using the predicted distribution q instead of true distribution p
Is KLD a distance measure? Why/Why not?
No, since it is not symmetric
Mutual Information
I(X, Y) = H(X) + H(Y) - H(X, Y)
I(X, Y) = H(Y) - H(Y|X)