Chapter 24 Information Gain and Mutual Information Flashcards
Information gain is calculated by ____.
Mutual information calculates the ____ and is the name given to information gain when applied to variable selection.
P 242
comparing the entropy of the dataset before and after a transformation.
statistical dependence between two variables.
How can entropy be used as a calculation of the purity of a dataset?
P 244
According to Entropy formula for binary classes: Entropy = −(p(0) × log(P(0)) + p(1) × log(P(1)))
This shows how balanced the distribution of classes happens to be. An entropy of 0 bits indicates a dataset containing one class; an entropy of 1 or more bits suggests maximum entropy for a balanced dataset (depending on the number of classes), with values in between indicating levels between these extremes.
A smaller entropy suggests ____(less/more) purity or ____(more/less) surprise.
P 244
More, less
Information gain provides a way to use entropy to calculate how a change to the dataset impacts the purity of the dataset, e.g. the distribution of classes. True/False
P 244
True
What’s information gain?
P 244
information gain, is simply the expected reduction in entropy caused by partitioning the examples according to an attribute
What’s the Information Gain formula for dataset S and variable a?
P 244
IG(S, a) = H(S) − H(S|a)
Where IG(S, a) is the information gain for the dataset S for the variable a for a random variable, H(S) is the entropy for the dataset before any change and H(S|a) is the conditional entropy for the dataset given the variable a.
Maximizing the entropy is equivalent to maximizing the information gain. True/False
P 247
False, minimizing the entropy is equivalent to maximizing the information gain.
How is information gain used in decision trees?
P 247
The information gain is calculated for each variable in the dataset. The variable that has the largest information gain is selected to split the dataset. Generally, a larger gain indicates a smaller entropy or less surprise. (According to IG(S, a) = H(S) − H(S|a)
)
The process is then repeated on each created group, excluding the variable that was already chosen. This stops once a desired depth to the decision tree is reached or no more splits are
possible.
What’s important for splitting, is the purity of each group resulted from the split, if the information gain is large, it means the expected entropy for the variable is small, which means more purity on average in each result split, which is what we want from a good classifier; to separate the classes well, so that each split is more pure. Worked Example P 246
What’s mutual information?
P 248
Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable.
A quantity called mutual information measures the amount of information one can obtain from one random variable given another.
It measures the average reduction in uncertainty about x that results from learning the value of y; or vice versa, the average amount of information that x conveys about y.
How is mutual information calculated?
P 248
I(X; Y ) = H(X) − H(X|Y)
Where I(X; Y ) is the mutual information for X and Y , H(X) is the entropy for X and H(X|Y ) is the conditional entropy for X given Y . The result has the units of bits.
Entropy of a variable is a measure of expected surprise, or uncertainty in that variable, subtracting the conditional entropy from it, quantifies how much of the uncertainty (surprise, entropy) is explained by the other variable, therefore, the definition of mutual information: Mutual information measures the reduction in uncertainty for one variable given a known value of the other variable.
Why is mutual information symmetrical?
P 248
Mutual information is a measure of dependence or mutual dependence between two random variables. As such, the measure is symmetrical, meaning that I(X; Y ) = I(Y ; X).
The mutual information can be calculated using KL-Divergence. True/False
P 248
True.
The mutual information can also be calculated as the KL divergence between the joint probability distribution and the product of the marginal probabilities for each variable.
This can be stated formally as follows: I(X; Y ) = KL(p(X, Y )||p(X) × p(Y ))
Mutual information is always larger than or equal to ____, where the larger the value, the greater the relationship between the two variables. If the calculated result is zero, then the variables are ____.
P 248
Zero, Independent
Mutual Information and Information Gain are the same thing. True/False
P 248
True.
Mutual Information and Information Gain are the same thing, although the context or usage of the measure often gives rise to the different names.
Mutual information is sometimes used as a synonym for information gain.