Chapter 22 Divergence Between Probability Distributions Flashcards

1
Q

What is statistical distance and give an example of its use case?

P 224

A

Statistical distance is the general idea of calculating the difference between statistical objects like different probability distributions for a random variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What’s the divergence between two probability distributions? Give an example of a use case.

P 216

A

Divergence is a scoring
of how one distribution differs from another. For example, we may have a single random variable and two different probability distributions for the variable, such as a true distribution and an approximation of that distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does this mean: “Divergence is not a symmetric score”?

P 216

A

It means calculating the divergence for distributions P and Q would give a different score from Q and P.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Divergence scores are an important
foundation for many different calculations in information theory and more generally in machine learning. give 2 examples of its use in ML.

P 216

A

For example, they provide shortcuts for calculating scores such as mutual information (information gain) and cross-entropy used as a loss function for classification models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What’s the intuition of KL-divergence?

P 217

A

The intuition for the KL-divergence score is that when the probability for an event from P is large/small, but the probability for the same event in Q is small/large, there is a large divergence. the KL-divergence score is the sum (or integral) of all events of the variable.

The KL divergence summarizes the number of additional bits (i.e. calculated with the base-2 logarithm) required to represent an event from the random variable. The better our approximation, the less additional information is required.

the KL divergence is the average number of extra bits needed to encode the data, due to the fact that we used distribution q to encode the data instead of the true distribution p.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

KL-Divergence can be used to measure the divergence between discrete or continuous probability distributions. True/False

P 217

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What’s another name for KL-Divergence?

P 217

A

Relative entropy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When the KL-Divergence score is 0, it suggests that distributions are ____, otherwise the score is ____(Negative/Positive). Importantly, the KL divergence score is not symmetrical, which means: ____

P 217

A

Identical, Positive
KL(P||Q) != KL(Q||P)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

The Jensen-Shannon divergence, or JS-Divergence for short, is another way to quantify the difference (or similarity) between two probability distributions. It uses the KL divergence to calculate a ____ score that is ____.

P 220

A

normalized, symmetrical (JS(P||Q) ≡ JS(Q||P))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is JS-Divergence more useful than KL-Divergence?

P 221

A

JS-Divergence is more useful as a measure as it provides a smoothed and normalized version of KL-Divergence, with scores between 0 (identical) and 1 (maximally different), when using the base-2 logarithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What’s the square root of JS-Divergence called?

P 221

A

The square root of the score gives a quantity referred to as the Jensen-Shannon distance, or JS distance for short.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly