Smoothing Flashcards

Question 1

Q

What does smoothing achieve?

Answer

A

reduce the effects of over-fitting and avoid zero probabilities. It assigns more probability mass to less likely events, taking some from the most likely events

Question 2

Q

What does add-one smoothing do?

Answer

A

For all possible n-grams: using maximum-likelihood estimation, add one to the count of occurrences and add the vocabulary size to the denominator. For example, the ML of P(w|v) is: P(w|v) = count(w,v)/count(v) Add one smoothing: P(w|v) = (count(w,v) + 1 )/count(v) + v

Question 3

Q

1) Why do we add v to the denominator?
2) Show why adding v works for the first question.

Answer

A

To make sure the n-gram probability is a valid probability and thus sums to 1.

Question 4

Q

The problems with add-one smoothing.

Answer

A

It takes too much weight from larger probability masses.
The ML estimates for each probability are quite accurate so we do not want to change these too much.

Question 5

Q

What is add α-smoothing

Answer

A

add a constant α<1 to each count on the numerator and multiply the “v” by α as well.

P(w|v) = C(v,w) + α/C(v) + αv

Question 6

Q

What methodology can be used to optimise α for α-smoothing?

Answer

A

validation testing.

Split the dataset into a training set, validation set and test set. Train models with various values for the parameter α and then report their result on the validation set. The selected value for α of the model with the best score on the validation set is used on the test set to report the overall performance of the model.

Question 7

Q

Advantage and disadvantage of using n-grams with large n

Answer

A

high-order N-grams are sensitive to more context, but have sparse counts

Question 8

Q

Advantage and disadvantage of using n-grams with low n

Answer

A

low-order N-grams consider only very limited context, but have robust counts for maximum-likelihood estimation.

Question 9

Q

What is interpolation?

Answer

A

Composing a higher order n-gram of lower order n-grams in a mixture model. So for example:

P_I(w₃|w₁,w₂) = λ₁P₁(w₃) + λ₂P₂(w₃|w₂) + λ₃P₃(w₃|w₁,w₂)

Where Σ_iλ_i = 1

Question 10

Q

What is a “mixture model”?

Answer

A

Any weighted combination of distributions

Question 11

Q

What is any weighted combination of distributions?

Answer

A

A mixture model

Question 12

Q

What is Kneser-Ney smoothing? How is it done?

Answer

A

A smoothing technique for n-grams.

Replace the raw counts of each vector with the count of their distinct histories.

This is given by N₁₊(•w_i) = |{w_i-1:c(w_i-1,w_i) > 0}|

So instead of

P_ML(w) = Count(w_i)/Σ_iCount(w_i)

We have

P_KN(w_i) = N₁₊(•w)/ Σ_iN₁₊(•w_i)

Question 13

Q

What are using smoothing techniques equivalent to?

Answer

A

Bayesian estimation.

Using alpha-smoothing is the same as using a Dirichlet prior

Using Kneser-Ney smoothing is the same as using a Pitman-Yor prior

Question 14

Q

What would we use if we wanted to construct a model that captures semantic similarities between words/sentences?

Answer

A

Word embeddings (also called distributed word representations)

Question 15

Q

An advantage of an n-gram LMs to a distributed LMs

Answer

A

They are much quicker to compute

Question 16

Q

What problems do interpolation and back-off solve?

Answer

A

Previous smoothing operations assign equal probability to all unseen events.

Intuition: We have information about potential unseen sequences because we may create lower-order n-grams.

Question 17

Q

Pro and cons of lower vs higher-order n-grams

Answer

A

Higher-order n-grams are sensitive to more context, however they are less robust and have sparser counts.
Lower-order n-grams are more robust and have larger counts but are less sensitive to context.

Question 18

Q

Differene between backoff and interpolation

Answer

A

There are two ways to use this N-gram “hierarchy”, backoff and interpolation.

In backoff, if we have non-zero trigram counts, we rely solely on the trigram counts. We only “back off” to a lower order N-gram if we have zero evidence for a higher-order N-gram.

By contrast, in interpolation, we always mix the probability estimates from all the N-gram estimators, i.e., we do a weighted interpolation of trigram, bigram, and unigram counts.