Language Modelling Flashcards
What is a language model?
It is a model that assigns probabilities to sequences of worsd
What is the most basic of language models?
The n-gram model, which assigns probabilities to sentences and sequences of words
What can n-gram models be used to?
Estimate the probability of the last word of an n-gram given the previous words, and assign probabilities to entire sequences
Where can language models be used?
Speech Recognition
Spelling correction or grammatical error correction
Machine translation
Augmentative and alternative communication systems
What does a unigram assume?
That the appearance of words are independent of each other
P(w1, w2, w3, w4) = P(w1) * P(w2) * P(w3) * P(w4)
It assumes that the previous words have no influence on the next word
What do n-gram models inform us?
The probability of the next word in the text is dependent on the previous n-1 words in the text.
If we have a word w, and some history h, we want to find out of the times that h occurred, how many times was it followed by w.
In general, how do n-gram probabilities work?
P(w1) = P(w1)
P(w1,w2) = P(w2 | w1) * P(w1)
P(w1, …, wn) = P(wn | w1, …, wn-1) * P(w1, …, wn-1)
How does the bigram model approximate the probability of the next word?
We can use the probability of the next word given the previous word:
P(wn | wn-1)
What assumption is made with the n-gram model?
The Markov Assumption - the probability of a word depends only on the previous word, this can be generalised to trigrams and so on
What does MLE stand for?
Maximum Likelihood Estimation
What does MLE do?
It is the process of choosing the right set of bigram parameters to make our model correctly predict (maximise the likelihood of) the nth word in the the text
How do we obtain the MLE for the parameters of an n-gram model?
We observe the n-gram counts from a representative corpus Normalise them (dividing by a total count) to lie between 0 and 1
What format is used when computing language model probabilities?
We use the log format so we can add the probabilities instead of multiplying them
What are some n-gram problems?
Even if we have a large corpus, only a tiny minority of possible n-grams exist in any corpus. The probability of a word appearing given the previous word is 0 for most sequences due to the sparsity of the matrix.
What are some n-gram problems?
Even if we have a large corpus, only a tiny minority of possible n-grams exist in any corpus. The probability of a word appearing given the previous word is 0 for most sequences due to the sparsity of the matrix.
What do we use to keep a language model from assigning a zero probability to unseen n-grams?
A number of smoothing/discounting methods.
What does Laplace smoothing do?
It adds 1 to all the bigram counts, before we normalise them into probabilities, meaning that all counts that were 0 are now 1, 1s become 2s, and so one.
What is another name for Laplace smoothing?
Add-one smoothing
How do we normalise the count when we do Laplace smoothing?
We add one to the observed bigram, and divide by the number of appearances of N, plus V, where V is the the total number of words in the vocabulary
What can we do if instead of changing the numerator and denominator in Laplace smoothing?
We can add one to the count, multiplied by N and all divided by N + V
What are some problems with Laplace smoothing?
The extra V observations added to normalisation are problematic as too much probability mass has been moved to the zero-occurrence cases
What is k-smoothing?
It is a way of moving a bit less probability mass from the seen to unseen events.
What is the equation for k-smoothing?
Does k-smoothing solve the problem of imbalanced probability counts?
No - it can still be problematic
What are some alternatives to smoothing?
Backoff and interpolation
What is backoff?
It is where if we do not have data about higher order n-grams, we can fall back to information about lower order n-grams
For example, if we don’t have trigrams, we use bigrams, if we don’t have bigrams, we use trigrams
What does backoff allow?
It allows for generalisation for contexts that the model has not been trained on
Explain how backoff is different to interpolation
Backoff only uses lower order n-gram if we have zero evidence of higher order n-grams
Interpolation always mixes the probability from all the n-gram estimators, weighing and combining the counts
How does simple linear interpolation work?
We combine different order n-grams by linearly interpolating the models, we add weights to the trigram, bigram and unigram, where the weights sum to one to get the probabilities
What is interpolated Kneser-Ney smoothing?
It makes use of two approaches - an absolute discounting factor of 0.75 and the probability of a unigram being a novel continuation of another word in a bigram.
An example is a frequent word, e.g. Britain, will have a low continuation probability if it only appears in a single context of Great Britain
What is an even better way than interpolated Kneser-Ney and Laplace smoothing to solve the discounting problem?
Stupid backoff - although it only works for very large models
What is stupid backoff?
Stupid backoff is where you multiply the lower order estimate probability by a constant factor (0.4) where the count is 0
What is class-based N-gram model?
It is where probabilities of words are based on the clases of the previous words. E.g. you cluster Shanghai with other city names and the estimate based on the likelihood of matching phrases from the entire cluster
What are skip-grams?
Skip grams is where we have a value k, and we take words that are not necessarily adjacent, but separated by a distance k.
In the text: “the rain in spain falls mainly on the plain”, the 1-skip-2grams are: all the bigrams, and the non adjacent pairs, such as the in, rain spain, in falls, spain mainly…
What are skip-grams good for?
They capture word contexts with less sparsity than strict n-grams - useful in Word2Vec