Smoothing Flashcards
What does smoothing achieve?
reduce the effects of over-fitting and avoid zero probabilities. It assigns more probability mass to less likely events, taking some from the most likely events
What does add-one smoothing do?
For all possible n-grams: using maximum-likelihood estimation, add one to the count of occurrences and add the vocabulary size to the denominator. For example, the ML of P(w|v) is: P(w|v) = count(w,v)/count(v) Add one smoothing: P(w|v) = (count(w,v) + 1 )/count(v) + v
1) Why do we add v to the denominator?
2) Show why adding v works for the first question.
To make sure the n-gram probability is a valid probability and thus sums to 1.

The problems with add-one smoothing.
- It takes too much weight from larger probability masses.
- The ML estimates for each probability are quite accurate so we do not want to change these too much.
What is add α-smoothing
add a constant α<1 to each count on the numerator and multiply the “v” by α as well.
P(w|v) = C(v,w) + α/C(v) + αv
What methodology can be used to optimise α for α-smoothing?
validation testing.
Split the dataset into a training set, validation set and test set. Train models with various values for the parameter α and then report their result on the validation set. The selected value for α of the model with the best score on the validation set is used on the test set to report the overall performance of the model.
Advantage and disadvantage of using n-grams with large n
high-order N-grams are sensitive to more context, but have sparse counts
Advantage and disadvantage of using n-grams with low n
low-order N-grams consider only very limited context, but have robust counts for maximum-likelihood estimation.
What is interpolation?
Composing a higher order n-gram of lower order n-grams in a mixture model. So for example:
PI(w3|w1,w2) = λ1P1(w3) + λ2P2(w3|w2) + λ3P3(w3|w1,w2)
Where Σiλi = 1
What is a “mixture model”?
Any weighted combination of distributions
What is any weighted combination of distributions?
A mixture model
What is Kneser-Ney smoothing? How is it done?
A smoothing technique for n-grams.
Replace the raw counts of each vector with the count of their distinct histories.
This is given by N1+(•wi) = |{wi-1:c(wi-1,wi) > 0}|
So instead of
PML(w) = Count(wi)/ΣiCount(wi)
We have
PKN(wi) = N1+(•w)/ ΣiN1+(•wi)
What are using smoothing techniques equivalent to?
Bayesian estimation.
Using alpha-smoothing is the same as using a Dirichlet prior
Using Kneser-Ney smoothing is the same as using a Pitman-Yor prior
What would we use if we wanted to construct a model that captures semantic similarities between words/sentences?
Word embeddings (also called distributed word representations)
An advantage of an n-gram LMs to a distributed LMs
They are much quicker to compute
What problems do interpolation and back-off solve?
Previous smoothing operations assign equal probability to all unseen events.
Intuition: We have information about potential unseen sequences because we may create lower-order n-grams.
Pro and cons of lower vs higher-order n-grams
- Higher-order n-grams are sensitive to more context, however they are less robust and have sparser counts.
- Lower-order n-grams are more robust and have larger counts but are less sensitive to context.
Differene between backoff and interpolation
There are two ways to use this N-gram “hierarchy”, backoff and interpolation.
In backoff, if we have non-zero trigram counts, we rely solely on the trigram counts. We only “back off” to a lower order N-gram if we have zero evidence for a higher-order N-gram.
By contrast, in interpolation, we always mix the probability estimates from all the N-gram estimators, i.e., we do a weighted interpolation of trigram, bigram, and unigram counts.