Information theory and uses Flashcards
To add
more examples of usage and practice problems
usage in research and pratice
different ways to think about things
K-L divergence
continuous random variables
how to calculate these things–MATLAB or other algorithms, complexity, approximations
history of this kind of thing
a distance metric based o kl divergence?
hamming distance, dices coefficient, tversky inex sorensen similarity index
Entropy of a discrete random variable X
entropy H of a discrete random variable X is a measure of the uncertainty associated with the value of X.
I(x) is the self informaion which is the entropy contribution of each individual message.
An important property of entropy is that it is maximized when all the messages in the message space are equiprobable p(x)=1/n,—i.e., most unpredictable—in which case H(X) = log(n)
Binary entropy function
The special case of information entropy for a random variable with two outcomes is the binary entropy function, usually taken to the logarithmic base 2
Joint entropy of two discrete variables X, Y
Conditional entropy of X given Y
Conditional entropy also known as conditional uncertainty of X given Y is the average conditional entropy over Y.
Basic property is that H(X|Y) = H(X,Y)-H(Y).
Mutual information
Mutual information I(X;Y), also known as transinformation, is the amount of information that can be obtained about one random variable by observing another. It is important in communication where it can be used to maximize the amount of information shared between sent and received signals.
Basic property: I(X;Y) = H(x) - H(X|Y)
That means that if we know Y, we save an average of I(X;Y) bits in encoding X compared to not knowing Y.
Mutual information can be epressed as the average KL divergence (information gain) between the posterior probability distrubution of X given the value of Y and the prior distribution on X.
Kullback Liebler divergence
The KL divergence, or information gain, or information divergence or relative entropy is a way of comparing two distributions, a “true” probability distribution p(X) and an arbitrary probatility distrbution q(X). If we compress data in a way that assumes q(X) is the distribution underlying some data, when m, in reality, p(X) is the correct distribution, the KL divergence is the nummber of average additional bits per datum necessary for compression.
Sometimes used as a ‘distance metric’ it is not a metric, because it is not symmetric and does not obey the triangle inequality.
Another interpretation of KL divergence is this: suppose a number X is about to be drawn randomly from a discrete set with probability distribution p(x). If Alice knows the true distribution p(x), while Bob believes (has a prior) that the distribution is q(x), then Bob will be more surprised than Alice, on average, upon seeing the value of X. The KL divergence is the (objective) expected value of the Bob’s (subjective) surprisal minus Alice’s surprisal, measured in bits if the log is in base 2. In this way, the extent to which Bob’s prior is “wrong” can be quantified in terms of how “unnecessarily surprised” it’s expected to make him.
Information theory pie diagram showing the relationship between entropy, mutual information, conditional entropy, etc.
Variation of information
Many applications require a metric, that is, a distance measure between points. The quantity
satisfies the properties of a metric (triangle inequality, non-negativity, indiscernability and symmetry). This distance metric is also known as the Variation of information
Normalized distance metric based on mutual information and entropy
D(X,Y)/H(X,Y)<=1
The metric D is a universal metric, in that if any other distance measure places X and Y close-by, then the D will also judge them close.
Note also D(X,Y)=1-I(X;Y)/H(X,Y) is effectively the Jaccard distance between X and Y
Jaccard index
Jaccard similarity coefficient
a statistic for comparing the similarity and diversity of sample sets.
The Jaccard coefficient measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:
Jaccard distance
The Jaccard distance, which measures dissimilarity between sample sets, is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1, or, equivalently, by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union. This distance is a proper metric
Hamming distance
In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In another way, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other.
Ex. Hamming distance between 1011101 and 1001001 is 2.
For a fixed length n, the Hamming distance is a metric on the vector space of the words of length n, as it obviously fulfills the conditions of non-negativity, identity of indiscernibles and symmetry, and it can be shown easily by complete induction that it satisfies the triangle inequality as well. The Hamming distance between two words a and b can also be seen as the Hamming weight of a−b for an appropriate choice of the − operator.
For binary strings a and b the Hamming distance is equal to the number of ones (population count) in a XOR b. The metric space of length-n binary strings, with the Hamming distance, is known as the Hamming cube; it is equivalent as a metric space to the set of distances between vertices in a hypercube graph. One can also view a binary string of length n as a vector in by treating each symbol in the string as a real coordinate; with this embedding, the strings form the vertices of an n-dimensional hypercube, and the Hamming distance of the strings is equivalent to the Manhattan distance between the vertices.
Applications of mutual information
Source: http://en.wikipedia.org/wiki/Mutual_information
In many applications, one wants to maximize mutual information (thus increasing dependencies), which is often equivalent to minimizing conditional entropy. Examples include:
- In telecommunications, the channel capacity is equal to the mutual information, maximized over all input distributions.
- Discriminative training procedures for hidden Markov models have been proposed based on the maximum mutual information (MMI) criterion.
- RNA secondary structure prediction from a multiple sequence alignment.
- Phylogenetic profiling prediction from pairwise present and disappearance of functionally link genes.
- Mutual information has been used as a criterion for feature selection and feature transformations in machine learning. It can be used to characterize both the relevance and redundancy of variables, such as the minimum redundancy feature selection.
- Mutual information is used in determining the similarity of two different clusterings of a dataset. As such, it provides some advantages over the traditional Rand index.
- Mutual information is often used as a significance function for the computation of collocations in corpus linguistics.
- Mutual information is used in medical imaging for image registration. Given a reference image (for example, a brain scan), and a second image which needs to be put into the same coordinate system as the reference image, this image is deformed until the mutual information between it and the reference image is maximized.
- Detection of phase synchronization in time series analysis
- In the infomax method for neural-net and other machine learning, including the infomax-based Independent component analysis algorithm
- Average mutual information in delay embedding theorem is used for determining the embedding delay parameter.
- Mutual information between genes in expression microarray data is used by the ARACNE algorithm for reconstruction of gene networks.
- In statistical mechanics, Loschmidt’s paradox may be expressed in terms of mutual information.[4][5] Loschmidt noted that it must be impossible to determine a physical law which lacks time reversal symmetry (e.g. the second law of thermodynamics) only from physical laws which have this symmetry. He pointed out that the H-theorem of Boltzmann made the assumption that the velocities of particles in a gas were permanently uncorrelated, which removed the time symmetry inherent in the H-theorem. It can be shown that if a system is described by a probability density in phase space, then Liouville’s theorem implies that the joint information (negative of the joint entropy) of the distribution remains constant in time. The joint information is equal to the mutual information plus the sum of all the marginal information (negative of the marginal entropies) for each particle coordinate. Boltzmann’s assumption amounts to ignoring the mutual information in the calculation of entropy, which yields the thermodynamic entropy (divided by Boltzmann’s constant).
- The mutual information is used to learn the structure of Bayesian networks/dynamic Bayesian networks, which explain the causal relationship between random variables, as exemplified by the GlobalMIT toolkit [1]: learning the globally optimal dynamic Bayesian network with the Mutual Information Test criterion.