N-gram Language Models Flashcards
ew
Language Models
Models that assign probabilities to sequences of words.
N-gram
A sequence of n
words.
r
Markov models
A class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past.
Extrinsic Evaluation
An end-to-end evaluation.
E.g. embedding the model in an application and measuring how much the application improves.
Intrinsic Evaluation Metric
A metric that measures the quality of a model independent of any application.
Perplexity
The perplexity (PP) of a language model on a test set is the inverse probability of the test set, normalised by the number of words.
rrssw
OOV rate
The percentage of out of vocabulary words that appear in the test set.
Open vocabulary system
A system in which we model potential unknown words in the test set by adding a pseudo-word called <UNK>
.
Smoothing
A method of dealing with words that are in our vocabulary, but appear in the test set in an unknown context (e.g. after a word they never appeared after in training).
To keep the model from assigning zero probability, shave off a bit of probability mass from some more frequent events and give it to events never seen.
Laplace Smoothing
The simplest smoothing technique.
Add one to all the n-gram counts, before normalising them into probabilities.
All the counts that used to be zero will now have a count of 1, counts of 1 will be 2, etc.
a.k.a. add-one smoothing
add-k smoothing
Like Laplace smoothing. But instead of adding 1 to each count, we add a fractional count k
(.5, .05, .01).
backoff smoothing
We use the trigram if the evidence is sufficient, otherwise use the bigram, otherwise the unigram.
I.e. we only “back off” to a lower-order n-gram if we have zero evidence for a higher order n-gram.
interpolation smoothing
We mix the probability estimates from all the n-gram estimators, weighing and combining the trigram, bigram and unigram counts.
Discount for a backoff model
In order for a backoff model to give a correct probability distribution, we have to discount the higher-order n-grams to save some probability mass for the lower order n-grams.
P_CONTINUATION
Instead of P(w)
, it tries to answer the question, “How likely is w
to appear as a novel continuation?”