Smoothing Flashcards

1
Q

What does smoothing achieve?

A

reduce the effects of over-fitting and avoid zero probabilities. It assigns more probability mass to less likely events, taking some from the most likely events

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does add-one smoothing do?

A

For all possible n-grams: using maximum-likelihood estimation, add one to the count of occurrences and add the vocabulary size to the denominator. For example, the ML of P(w|v) is: P(w|v) = count(w,v)/count(v) Add one smoothing: P(w|v) = (count(w,v) + 1 )/count(v) + v

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

1) Why do we add v to the denominator?
2) Show why adding v works for the first question.

A

To make sure the n-gram probability is a valid probability and thus sums to 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

The problems with add-one smoothing.

A
  1. It takes too much weight from larger probability masses.
  2. The ML estimates for each probability are quite accurate so we do not want to change these too much.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is add α-smoothing

A

add a constant α<1 to each count on the numerator and multiply the “v” by α as well.

P(w|v) = C(v,w) + α/C(v) + αv

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What methodology can be used to optimise α for α-smoothing?

A

validation testing.

Split the dataset into a training set, validation set and test set. Train models with various values for the parameter α and then report their result on the validation set. The selected value for α of the model with the best score on the validation set is used on the test set to report the overall performance of the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Advantage and disadvantage of using n-grams with large n

A

high-order N-grams are sensitive to more context, but have sparse counts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Advantage and disadvantage of using n-grams with low n

A

low-order N-grams consider only very limited context, but have robust counts for maximum-likelihood estimation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is interpolation?

A

Composing a higher order n-gram of lower order n-grams in a mixture model. So for example:

PI(w3|w1,w2) = λ1P1(w3) + λ2P2(w3|w2) + λ3P3(w3|w1,w2)

Where Σiλi = 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a “mixture model”?

A

Any weighted combination of distributions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is any weighted combination of distributions?

A

A mixture model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Kneser-Ney smoothing? How is it done?

A

A smoothing technique for n-grams.

Replace the raw counts of each vector with the count of their distinct histories.

This is given by N1+(•wi) = |{wi-1:c(wi-1,wi) > 0}|

So instead of

PML(w) = Count(wi)/ΣiCount(wi)

We have

PKN(wi) = N1+(•w)/ ΣiN1+(•wi)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are using smoothing techniques equivalent to?

A

Bayesian estimation.

Using alpha-smoothing is the same as using a Dirichlet prior

Using Kneser-Ney smoothing is the same as using a Pitman-Yor prior

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What would we use if we wanted to construct a model that captures semantic similarities between words/sentences?

A

Word embeddings (also called distributed word representations)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

An advantage of an n-gram LMs to a distributed LMs

A

They are much quicker to compute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What problems do interpolation and back-off solve?

A

Previous smoothing operations assign equal probability to all unseen events.

Intuition: We have information about potential unseen sequences because we may create lower-order n-grams.

17
Q

Pro and cons of lower vs higher-order n-grams

A
  1. Higher-order n-grams are sensitive to more context, however they are less robust and have sparser counts.
  2. Lower-order n-grams are more robust and have larger counts but are less sensitive to context.
18
Q

Differene between backoff and interpolation

A

There are two ways to use this N-gram “hierarchy”, backoff and interpolation.

In backoff, if we have non-zero trigram counts, we rely solely on the trigram counts. We only “back off” to a lower order N-gram if we have zero evidence for a higher-order N-gram.

By contrast, in interpolation, we always mix the probability estimates from all the N-gram estimators, i.e., we do a weighted interpolation of trigram, bigram, and unigram counts.