4. Language Model Flashcards
What is POS Tagging?
A Part-Of-Speech Tagger reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc.
eg. I want a ticket: I -> pronoun want -> verb a -> det ticket -> noun
What is a Language Model?
A Language model is a model that assigns a probability to each possible next word, given a history of the words.
- Can be generalised for entire sentences.
- Can also operate at character level (or any symbol)
What is the difference between a Language Model and learning Word Embeddings?
When learning word embeddings we don’t actually care about the output, we are just trying to learn the embedding of our words. Also with WE we have pairs of words, not a history like in LM. In LM we don’t want to learn the embedding, we only care about the predictions we obtain.
What are some applications for Language Models?
- Word completion/prediction
- Speech Recognition
- Machine Translation
- Spelling Correction
What is the n-gram model?
We have sequences of n words, and we are trying to predict P(w | h) where w is the word and h is the history. eg:
P(sat | the, cat) = 0.0013
P(sat | the, mat) = 0.0007
What is a problem with n-gram model and how is it addressed?
The problem is that when trying to predict sentences with a long history, eg of length = 8, it is almost impossible to keep a count on all the sentences of length=8 in order to carry out the calculations.
To address this, we do an approximation by decomposing the problem using the chain rule, and use the markov assumption to approximate history by only using a few words. In the bigram model we use 1 word, in the triagram model we use 2 words.
Since this involves many multiplications of probabilities (< 1) , we transform the problem to log-space to have more numerically stable results.
How do we decide n in the n-gram model?
- The larger, the better
- As long as we can get counts from corpus, the n is fine.
Google uses 5-20 grams with more than 1 billion words.
What is perplexity in Language Models? Give its formula
Perplexity is the inverse probability of the test set, normalised by its number of words. It is a measurement of how well a probability distribution or probability model predicts a sample. It may be used to compare probability models. A low perplexity indicates the probability distribution is good at predicting the sample.
PP(W) = sqrt_N(1 / Product_N P(w_i | w_i-1))
What is a common issue in practice with Language Models, concerning the training and test set? How can we address these issues?
In practice test set often has different distribution than training set, e.g. new words. As a result the probability we calculate will be 0, since we have never seen that new word in the test set before.
We can address this issue using:
- Smoothing
- Back-off
- Interpolation
What is smoothing? What are some issues with add-one smoothing for Language Models?
Smoothing is a way to deal with finding probabilities of words in a test set which have not been encountered during training. A way is by adding one to the count (numerator) and adding size(vocab) to denominator.
Issue:
It takes a lot of mass out of frequent words when there are a lot of zeros in our table, which can be a problem.
What is back-off? What are some issues?
In back-off we use a larger n-gram if it can be computed, and if not we back-off to smaller n sizes to use less context. eg. we can start by using a trigram, if it doesn’t work a diagram, if it doesn’t work a unigram.
Issues:
The issue is it can be misleading due to the nature of how it works, by changing the n size. eg we have a sentence
“I play cats”. for n = 2 p(cats | I, play) can’t be computed, so we fall back to n = 1, p(I | play), which is very likely, but this doesn’t mean that our original sentence is probable.
What is interpolation?
Interpolation mixes unigrams, bigrams and trigrams by performing a weighted linear combination of the three.
P_interp(w_n | w_n-2, w_n-1) = λ_1 * P(w_n | w_n-2, w_n-1) + λ_2 * P(w_n | w_n-1) + λ_3 * P(w_n)
What is a problem for n-gram which led us to the use of RNNs?
With an n-gram, even with a larger n (say 4-5) we fail to model long-distance relationships of words. eg:
Sentence:
“The GPU machines which I had just bought from a reputable supplier and put in the server room in the other building crashed.”
It’s impossible to capture the relationship between ‘The GPU machines’ and ‘crashed’.