Chapter 23 Cross-Entropy for Machine Learning Flashcards

Question 1

Q

What’s cross-entropy? how is it used in ML?

P 225

Answer

A

Cross-entropy is a measure from the field of information theory, building upon entropy and generally calculating
the difference between two probability distributions. Cross-entropy is commonly used in machine learning as a loss function.

Specifically, it builds upon the idea of entropy from information theory and calculates the average number of bits required to represent or transmit an event from one distribution compared to the other distribution.

Question 2

Q

Cross-entropy is closely related to but is different from KL divergence that calculates the ____ between two probability distributions,
whereas cross-entropy can be thought to calculate the ____ between the distributions.

P 225

Answer

A

relative entropy,total entropy

Question 3

Q

What’s the intuition behind cross-entropy?

P 225

Answer

A

The intuition for this definition comes if we consider a target or underlying probability
distribution P and an approximation of the target distribution Q, then the cross-entropy of Q from P is the number of additional bits to represent an event using Q instead of P. The cross-entropy between two probability distributions, such as Q from P, can be stated formally
as:
H(P, Q)

Question 4

Q

The result of cross-entropy, will be a ____(positive/negative) number measured in bits and will be equal to the entropy of the distribution if ____.

P 226

Answer

A

positive, the two probability distributions are identical (refer to the formula)

Question 5

Q

Define:

Cross-Entropy H(P, Q):
Relative Entropy (KL Divergence) KL(P||Q) :

P 227

Answer

A

Average (Expected) number of total bits to represent an event from Q instead of P.
Average (Expected) number of extra bits to represent an event from Q instead of P.

Average comes from the formula, which is similar to calculating a type of expected value ( sum of p(x) * sth), the sth is what’s different when using cross-entropy or KL-Divergence

Me: Expected value of the added information(bits) when using Q instead of P (KL-Divergence)
Expected value of the total information (bits) we need, to represent the events using Q instead of P (Cross-Entropy)- this is why if Q and P are identical, Cross-Entropy=Entropy
Me: information is the amount of bits needed to represent an event, iow. Log base 2(probability of the event)

Question 6

Q

cross-entropy is not symmetrical, meaning that:
H(P, Q) != H(Q, P). True/False

P 227

Question 7

Q

What’s the benefit of using cross-entropy instead of sum of squares for classification problems?

P 232

Answer

A

It leads to faster training as well as improved generalization.

Question 8

Q

When calculating cross-entropy for classification tasks, the base-e or natural logarithm is used. True/False

P 233

Answer

A

True

This means that the units are in nats, not bits.

Question 9

Q

How is cross-entropy used for classification problems as a loss function?

Worked example P 235

Answer

A

In classification problems we want to predict each data instance (row, in tabular data) class. therefore, P would be the probability distribution of each instance’s class and Q would be the probability distribution of the predicted class.
To create the prbability distribution for each instance, we use one-hot-encoding, so if in a binary classification, the true class of an instance is 0, then the probability distribution for the instance’s class is [1,0], because we are certain it belongs to class 0
According to the expanded formula for cross-entropy (a binary example):
H(P, Q) = −(P(class0) × log(Q(class0)) + P(class1) × log(Q(class1)))
This formula is for ONE instance in the dataset, to calcuate the loss, we average the sum of cross-entropy for all the instances
Let’s say P(class1)=y, since we have a binary example here, P(class0)=1-y. and let’s call Q(class1)= pred, therefore Q(class0)=1-pred
Rewriting the cross-entropy equation again we have:
H(P, Q) = −((1-y) × log(1-pred) + y × og(pred))

And for all of the instances:

1/N * -sum((1-y) × log(1-pred) + y × log(pred))

NOTE for binary classification, cross-entropy formula is the same as log loss

Cross-Entropy is not Log Loss, but they calculate the same quantity when used as loss functions for binary classification problems P 237
In fact, the negative log-likelihood for Multinoulli distributions (multiclass classification) also matches the calculation for cross-entropy.

Question 10

Q

Why can we minimize KL-Divergence instead of Cross-Entropy as loss function and get the same result?

P 235

Answer

A

We could just minimize the KL divergence as a loss function instead of the crossentropy. Recall that the KL divergence is the extra bits required to transmit one variable compared to another. It is the cross-entropy without the entropy of the class label, which we know would be zero anyway (Because there is only one certain true class for each instance). As such, minimizing the KL divergence and the cross entropy for a classification task are identical.

KL-Divergence (P||Q)=H(P,Q)-H(P)
Reminder: H(P, Q) ={ −Σ P(x) × log(Q(x))| x∈X}
KL(P||Q) ={ Σ P(x) × log(P(x)/Q(x))| x∈X}
H(P)={ −Σ P(x) × log(P(x))| x∈X}
Proof

The specific values would be different, but the effect would be the same as the two values are proportional to each other P 237

Question 11

Q

More generally, the terms cross-entropy and negative log-likelihood are used interchangeably in the context of loss functions for classification models. True/False

P 238

Answer

A

True
calculating log loss will give the same quantity as calculating the cross-entropy

Question 12

Q

A linear regression optimized under the maximum likelihood estimation framework assumes a Gaussian continuous probability distribution for the target variable and involves minimizing the mean squared error function. This is equivalent to the cross-entropy for a random variable with a Gaussian probability distribution. True/False

P 238

Chapter 23 Cross-Entropy for Machine Learning Flashcards

(12 cards)