Chapter 24 Information Gain and Mutual Information Flashcards

1
Q

ˆ Information gain is calculated by ____.
ˆ Mutual information calculates the ____ and is the name given to information gain when applied to variable selection.

P 242

A

comparing the entropy of the dataset before and after a transformation.

statistical dependence between two variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How can entropy be used as a calculation of the purity of a dataset?

P 244

A

According to Entropy formula for binary classes:
Entropy = −(p(0) × log(P(0)) + p(1) × log(P(1)))
This shows how balanced the distribution of classes happens to be. An entropy of 0 bits indicates a dataset containing one class; an entropy of 1 or more bits suggests maximum entropy for a balanced dataset (depending on the number of classes), with values in between indicating levels between these extremes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

A smaller entropy suggests ____(less/more) purity or ____(more/less) surprise.

P 244

A

More, less

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Information gain provides a way to use entropy to calculate how a change to the dataset impacts the purity of the dataset, e.g. the distribution of classes. True/False

P 244

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What’s information gain?

P 244

A

information gain, is simply the expected reduction in entropy caused by partitioning the examples according to an attribute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What’s the Information Gain formula for dataset S and variable a?

P 244

A

IG(S, a) = H(S) − H(S|a)
Where IG(S, a) is the information gain for the dataset S for the variable a for a random variable, H(S) is the entropy for the dataset before any change and H(S|a) is the conditional entropy for the dataset given the variable a.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Maximizing the entropy is equivalent to maximizing the information gain. True/False

P 247

A

False, minimizing the entropy is equivalent to maximizing the information gain.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How is information gain used in decision trees?

P 247

A

The information gain is calculated for each variable in the dataset. The variable that has the largest information gain is selected to split the dataset. Generally, a larger gain indicates a smaller entropy or less surprise. (According to IG(S, a) = H(S) − H(S|a) )
The process is then repeated on each created group, excluding the variable that was already chosen. This stops once a desired depth to the decision tree is reached or no more splits are
possible.

What’s important for splitting, is the purity of each group resulted from the split, if the information gain is large, it means the expected entropy for the variable is small, which means more purity on average in each result split, which is what we want from a good classifier; to separate the classes well, so that each split is more pure. Worked Example P 246

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What’s mutual information?

P 248

A

Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable.

A quantity called mutual information measures the amount of information one can obtain from one random variable given another.

It measures the average reduction in uncertainty about x that results from learning the value of y; or vice versa, the average amount of information that x conveys about y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is mutual information calculated?

P 248

A

I(X; Y ) = H(X) − H(X|Y)
Where I(X; Y ) is the mutual information for X and Y , H(X) is the entropy for X and H(X|Y ) is the conditional entropy for X given Y . The result has the units of bits.

Entropy of a variable is a measure of expected surprise, or uncertainty in that variable, subtracting the conditional entropy from it, quantifies how much of the uncertainty (surprise, entropy) is explained by the other variable, therefore, the definition of mutual information: Mutual information measures the reduction in uncertainty for one variable given a known value of the other variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why is mutual information symmetrical?

P 248

A

Mutual information is a measure of dependence or mutual dependence between two random variables. As such, the measure is symmetrical, meaning that I(X; Y ) = I(Y ; X).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The mutual information can be calculated using KL-Divergence. True/False

P 248

A

True.
The mutual information can also be calculated as the KL divergence between the joint probability distribution and the product of the marginal probabilities for each variable.
This can be stated formally as follows:
I(X; Y ) = KL(p(X, Y )||p(X) × p(Y ))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Mutual information is always larger than or equal to ____, where the larger the value, the greater the relationship between the two variables. If the calculated result is zero, then the variables are ____.

P 248

A

Zero, Independent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Mutual Information and Information Gain are the same thing. True/False

P 248

A

True.
Mutual Information and Information Gain are the same thing, although the context or usage of the measure often gives rise to the different names.

Mutual information is sometimes used as a synonym for information gain.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly