Lecture 2 - Language Models with N-grams Flashcards
What are n-grams?
An n-gram is a sequence of N words.
What is the goal of probabilistic language modeling?
Compute the probability of a sentence or sequence of works
e. g., in Machine Translation → P(high winds tonite)> P(large winds tonite)
e. g., in Spell Correction → P(about fifteen minutes from) > P(about fifteen minuets from)
e. g., in Spell Recognition → P(I saw a van)»_space; P(eyes awe of an)
What rule can be used to calculate the probability of a sentence?
You can use the chain rule to calculate the probability of a sentence
How does the chain rule work? I.e what is the formula?
P(x1, x2, x3,…, xn) = P(x1)P(x2|x1)P(x3|x1,x2)P(xn|x1,…,xn-1)
Example:
P(“the water is so cold”) =
P(“the”)P(“water | the”)P(“is | the water”)P(“so | the water is”) P(“cold | the water is so”)
What is the Markov Assumption?
A simplifying assumption:
P(the | its water is so transparent that) ≈ P(the | that)
Or maybe
P(the | its water is so transparent that) ≈ P(the | transparent that)
Essentially, we estimate the probability by taking the last few words, rather than the whole sentence
What is the simplest case of a markov model?
The Unigram Model
Uses one word
What’s a slightly more advanced version of the Unigram Model?
The Bigram Model
Uses two words
We can extend these to trigrams, 4-grams, 5-grams etc. Is this a sufficient model of language?
No, because language has long-distance dependencies.
BUT often we can get away with ngram models.
Why do we do everything in log space?
To avoid underflow
(Also, adding is faster than multiplying)
p1p2p3*p4 = log p1+log p2+ log p3 + log p4
When we train the parameters of our model, we do it using which split of a dataset?
The training set
When we test the model’s performance, we use which split of a dataset?
The test set
What is the best way of evaluation for comparing different models?
Put each model in a task (In our case sentiment analysis)
Run the task, and get an accuracy for model a and model b.
Confusion matrix for example
This is called “Extrinsic evaluation” -> We’re using something external to the ngram model itself
What is the downside of extrinsic evaluation?
It’s time consuming -> Can take days or weeks
What can you use instead of extrinsic evaluation?
Intrinsic evaluation: Perplexity
Is perplexity a good approximation?
No. Perplexity is a bad approximation unless the test data looks JUST like the training data.
So generally, it’s only useful in pilot experiments
-> But it is helpful to think about
What is the Shannon Game?
A game about how well we can predict the next word
Examples:
I always order pizza with cheese and ___
The 33rd President of the US was ___
I saw a ___
Unigrams are terrible at the Shannon Game. Can you try to explain why?
Because unigram does not take into consideration, the last N-words.
Perplexity is the…
Probability of the test set, normalized by the number of words
Example:
PP(W) = P(w1w2…wn)^-1/n
What is the perplexity (i.e how hard is it) of recognizing the digits ‘0,1,2,3,4,5,6,7,8,9’?
Perplexity = 10
If all digits are equally likely
What is the perplexity (i.e how hard is it) of recognizing (30,000) names at microsoft?
Perplexity = 30,000
If all names are equally likely
Is lower or higher perplexity better for a model?
Lower perplexity means that it’s a better model