Lecture 2 - Language Models with N-grams Flashcards
What are n-grams?
An n-gram is a sequence of N words.
What is the goal of probabilistic language modeling?
Compute the probability of a sentence or sequence of works
e. g., in Machine Translation → P(high winds tonite)> P(large winds tonite)
e. g., in Spell Correction → P(about fifteen minutes from) > P(about fifteen minuets from)
e. g., in Spell Recognition → P(I saw a van)»_space; P(eyes awe of an)
What rule can be used to calculate the probability of a sentence?
You can use the chain rule to calculate the probability of a sentence
How does the chain rule work? I.e what is the formula?
P(x1, x2, x3,…, xn) = P(x1)P(x2|x1)P(x3|x1,x2)P(xn|x1,…,xn-1)
Example:
P(“the water is so cold”) =
P(“the”)P(“water | the”)P(“is | the water”)P(“so | the water is”) P(“cold | the water is so”)
What is the Markov Assumption?
A simplifying assumption:
P(the | its water is so transparent that) ≈ P(the | that)
Or maybe
P(the | its water is so transparent that) ≈ P(the | transparent that)
Essentially, we estimate the probability by taking the last few words, rather than the whole sentence
What is the simplest case of a markov model?
The Unigram Model
Uses one word
What’s a slightly more advanced version of the Unigram Model?
The Bigram Model
Uses two words
We can extend these to trigrams, 4-grams, 5-grams etc. Is this a sufficient model of language?
No, because language has long-distance dependencies.
BUT often we can get away with ngram models.
Why do we do everything in log space?
To avoid underflow
(Also, adding is faster than multiplying)
p1p2p3*p4 = log p1+log p2+ log p3 + log p4
When we train the parameters of our model, we do it using which split of a dataset?
The training set
When we test the model’s performance, we use which split of a dataset?
The test set
What is the best way of evaluation for comparing different models?
Put each model in a task (In our case sentiment analysis)
Run the task, and get an accuracy for model a and model b.
Confusion matrix for example
This is called “Extrinsic evaluation” -> We’re using something external to the ngram model itself
What is the downside of extrinsic evaluation?
It’s time consuming -> Can take days or weeks
What can you use instead of extrinsic evaluation?
Intrinsic evaluation: Perplexity
Is perplexity a good approximation?
No. Perplexity is a bad approximation unless the test data looks JUST like the training data.
So generally, it’s only useful in pilot experiments
-> But it is helpful to think about