Language models Flashcards
Language modeling and problem formulation
Language modeling is the task of predicting which word comes next in a sequence of words.
More formally, given a sequence of words w1w2 … wt we want to know the probability of the next word wt+1:
P(wt+1|w1w2 … wt) = P(w1:t+1)/P(w1:t)
Language models as generative models
Generative models are a type of machine learning models that are trained to generate new data instances similar to the ones in the training data.
Rather than as predictive models, language models can also be viewed as generative models that assign probability to a piece of text:
P(w1 … wt) = P(w1:t)
Probability of a sequence of words as a product of conditional probabilities
P(w1:n) = product t=1 to n P(wt|w1:t-1)
Types vs tokens in a corpus
- Types are the elements of the vocabulary V associated with the corpus, that is, the distinct words of the corpus.
- Tokens are the running words (occurrences). The length of a corpus is the number of tokens
Relative frequency estimator in the next word prediction task
P(wt|w1:t-1) = C(w1:t)/C(w1:t-1)
This estimator is very data-hungry, and suffers from high variance: depending on what data happens to be in the corpus, we could get very different probability estimations.
N-gram model, complexity, N-gram probabilities and bias-variance tradeoff
A string w(t-N+1):t of N words is called N-gram.
The N-gram model approximates the probability of a word given the entire sentence history by conditioning only on the past N-1 words.
General equation for the N-gram model:
P(wt|w1:t-1) = P(wt|w(t-N+1):t-1)
Give the relative frequency estimator…
N-gram model is exponential wrt N (V^N)
N is a hyperparameter. When setting its value, we face the bias-variance tradeoff:
- When N is too small (high bias), it fails to recover long-distance word relations
- When N is too large, we get data sparsity (high variance)
Example of computing next word probability using an N-gram model (Slide 16 pdf 5…)
…
Evaluation of LMs: Intrinsic evaluation and perplexity measure
Intrinsic evaluation of language models is based on the inverse probability of the test set, normalized by the number of words.
For a test set W = w1w2 … wn we define perplexity as:
PP(W) = P(1:n)^(-1/n)
The multiplicative inverse probability 1/P(wj|w1:j-1) can be seen as a measure of how surprising the next word is.
The degree of the root averages over all words of the test set, providing average surprise per word.
The lower the perplexity, the better the model.
An (intrinsic) improvement in perplexity does not guarantee an (extrinsic) improvement in the performance of a language processing task like speech recognition or machine translation.
Sparse data: zero or undefined probabilities for N-gram models
Using the relative frequency estimator in LMs: if there isn’t enough data in the training set, counts will be zero for some grammatical sequences.
There are three scenarios we need to consider:
- zero numerator: smoothing, discounting
- zero denominator: backoff, interpolation
- out-of-vocabulary words in test set: estimation of unknown words
Smoothing techniques
Given the Relative frequency estimator:
P(wt|w1:t-1) = C(w1:t)/C(w1:t-1)
Smoothing techniques (also called discounting) deal with words that are in our vocabulary V but were never seen before in the given context (zero numerator).
Smoothing prevents LM from assigning zero probability to these events.
Laplace smoothing
IDEA:
Pretend that everything occurs once more than the actual count.
Consider 1-gram model:
PL(wt) = (C(wt)+1)/(n+V).
Consider C*, the relative discount d(wt):
d(wt) = C*(wt)/C(wt) > 1
for high frequency words (do the calculations..).
Add-k smoothing
Add-k smoothing is a generalization of add-1 smoothing (that is the Laplace smoothing).
For some 0 <= k < 1 (considering the 2-gram model):
PAdd-k(wt|wt-1) = (C(wt-1..wt)+k)/(C(wt-1)+kV)
Jeffreys-Perks law corresponds to the case k = 0.5, which works well in practice and benefits from some theoretical justification.
Backoff and interpolation techniques
Deal with words that are in our vocabulary, but in the test set combine to form previously unseen contexts.
These techniques prevent LM from creating undefined probabilities for these events (zero/zero).
IDEA:
- if you have trigrams, use trigrams
- if you don’t have trigrams, use bigrams
Stupid backoff
With very large text collections (web-scale) a rough approximation of Katz backoff is often sufficient, called stupid backoff.
Give the recursive Ps(wt|wt-N+1:t-1) = … (using λ)
Linear interpolation
In simple linear interpolation, we combine different order N-grams by linearly interpolating all the models.
PL(wt|wt-2 wt-1) = λ1P(wt|wt-2 wt-1) + λ2P(wt|wt-1) + λ3P(wt)