Chapter 23 Cross-Entropy for Machine Learning Flashcards

1
Q

What’s cross-entropy? how is it used in ML?

P 225

A

Cross-entropy is a measure from the field of information theory, building upon entropy and generally calculating
the difference between two probability distributions. Cross-entropy is commonly used in machine learning as a loss function.

Specifically, it builds upon the idea of entropy from information theory and calculates the average number of bits required to represent or transmit an event from one distribution compared to the other distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Cross-entropy is closely related to but is different from KL divergence that calculates the ____ between two probability distributions,
whereas cross-entropy can be thought to calculate the ____ between the distributions.

P 225

A

relative entropy,total entropy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What’s the intuition behind cross-entropy?

P 225

A

The intuition for this definition comes if we consider a target or underlying probability
distribution P and an approximation of the target distribution Q, then the cross-entropy of Q from P is the number of additional bits to represent an event using Q instead of P. The cross-entropy between two probability distributions, such as Q from P, can be stated formally
as:
H(P, Q)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

The result of cross-entropy, will be a ____(positive/negative) number measured in bits and will be equal to the entropy of the distribution if ____.

P 226

A

positive, the two probability distributions are identical (refer to the formula)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define:

ˆ Cross-Entropy H(P, Q):
ˆ Relative Entropy (KL Divergence) KL(P||Q) :

P 227

A

Average (Expected) number of total bits to represent an event from Q instead of P.
Average (Expected) number of extra bits to represent an event from Q instead of P.

Average comes from the formula, which is similar to calculating a type of expected value ( sum of p(x) * sth), the sth is what’s different when using cross-entropy or KL-Divergence

Me: Expected value of the added information(bits) when using Q instead of P (KL-Divergence)
Expected value of the total information (bits) we need, to represent the events using Q instead of P (Cross-Entropy)- this is why if Q and P are identical, Cross-Entropy=Entropy
Me: information is the amount of bits needed to represent an event, iow. Log base 2(probability of the event)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

cross-entropy is not symmetrical, meaning that:
H(P, Q) != H(Q, P). True/False

P 227

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What’s the benefit of using cross-entropy instead of sum of squares for classification problems?

P 232

A

It leads to faster training as well as improved generalization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When calculating cross-entropy for classification tasks, the base-e or natural logarithm is used. True/False

P 233

A

True

This means that the units are in nats, not bits.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How is cross-entropy used for classification problems as a loss function?

Worked example P 235

A
  1. In classification problems we want to predict each data instance (row, in tabular data) class. therefore, P would be the probability distribution of each instance’s class and Q would be the probability distribution of the predicted class.
  2. To create the prbability distribution for each instance, we use one-hot-encoding, so if in a binary classification, the true class of an instance is 0, then the probability distribution for the instance’s class is [1,0], because we are certain it belongs to class 0
  3. According to the expanded formula for cross-entropy (a binary example):
    H(P, Q) = −(P(class0) × log(Q(class0)) + P(class1) × log(Q(class1)))
    This formula is for ONE instance in the dataset, to calcuate the loss, we average the sum of cross-entropy for all the instances
  4. Let’s say P(class1)=y, since we have a binary example here, P(class0)=1-y. and let’s call Q(class1)= pred, therefore Q(class0)=1-pred
  5. Rewriting the cross-entropy equation again we have:
    H(P, Q) = −((1-y) × log(1-pred) + y × og(pred))

And for all of the instances:

1/N * -sum((1-y) × log(1-pred) + y × log(pred))

NOTE for binary classification, cross-entropy formula is the same as log loss

Cross-Entropy is not Log Loss, but they calculate the same quantity when used as loss functions for binary classification problems P 237
In fact, the negative log-likelihood for Multinoulli distributions (multiclass classification) also matches the calculation for cross-entropy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why can we minimize KL-Divergence instead of Cross-Entropy as loss function and get the same result?

P 235

A

We could just minimize the KL divergence as a loss function instead of the crossentropy. Recall that the KL divergence is the extra bits required to transmit one variable compared to another. It is the cross-entropy without the entropy of the class label, which we know would be zero anyway (Because there is only one certain true class for each instance). As such, minimizing the KL divergence and the cross entropy for a classification task are identical.

KL-Divergence (P||Q)=H(P,Q)-H(P)
Reminder: H(P, Q) ={ Σ P(x) × log(Q(x))| x∈X}
KL(P||Q) ={ Σ P(x) × log(P(x)/Q(x))| x∈X}
H(P)={ Σ P(x) × log(P(x))| x∈X}
Proof

The specific values would be different, but the effect would be the same as the two values are proportional to each other P 237

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

More generally, the terms cross-entropy and negative log-likelihood are used interchangeably in the context of loss functions for classification models. True/False

P 238

A

True
calculating log loss will give the same quantity as calculating the cross-entropy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

A linear regression optimized under the maximum likelihood estimation framework assumes a Gaussian continuous probability distribution for the target variable and involves minimizing the mean squared error function. This is equivalent to the cross-entropy for a random variable with a Gaussian probability distribution. True/False

P 238

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly