Lecture 2 - Language Models with N-grams Flashcards

1
Q

What are n-grams?

A

An n-gram is a sequence of N words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the goal of probabilistic language modeling?

A

Compute the probability of a sentence or sequence of works

e. g., in Machine Translation → P(high winds tonite)> P(large winds tonite)
e. g., in Spell Correction → P(about fifteen minutes from) > P(about fifteen minuets from)
e. g., in Spell Recognition → P(I saw a van)&raquo_space; P(eyes awe of an)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What rule can be used to calculate the probability of a sentence?

A

You can use the chain rule to calculate the probability of a sentence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does the chain rule work? I.e what is the formula?

A

P(x1, x2, x3,…, xn) = P(x1)P(x2|x1)P(x3|x1,x2)P(xn|x1,…,xn-1)

Example:
P(“the water is so cold”) =
P(“the”)P(“water | the”)P(“is | the water”)P(“so | the water is”) P(“cold | the water is so”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the Markov Assumption?

A

A simplifying assumption:

P(the | its water is so transparent that) ≈ P(the | that)

Or maybe

P(the | its water is so transparent that) ≈ P(the | transparent that)

Essentially, we estimate the probability by taking the last few words, rather than the whole sentence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the simplest case of a markov model?

A

The Unigram Model

Uses one word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What’s a slightly more advanced version of the Unigram Model?

A

The Bigram Model

Uses two words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

We can extend these to trigrams, 4-grams, 5-grams etc. Is this a sufficient model of language?

A

No, because language has long-distance dependencies.

BUT often we can get away with ngram models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why do we do everything in log space?

A

To avoid underflow
(Also, adding is faster than multiplying)

p1p2p3*p4 = log p1+log p2+ log p3 + log p4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

When we train the parameters of our model, we do it using which split of a dataset?

A

The training set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

When we test the model’s performance, we use which split of a dataset?

A

The test set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the best way of evaluation for comparing different models?

A

Put each model in a task (In our case sentiment analysis)

Run the task, and get an accuracy for model a and model b.

Confusion matrix for example

This is called “Extrinsic evaluation” -> We’re using something external to the ngram model itself

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the downside of extrinsic evaluation?

A

It’s time consuming -> Can take days or weeks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What can you use instead of extrinsic evaluation?

A

Intrinsic evaluation: Perplexity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Is perplexity a good approximation?

A

No. Perplexity is a bad approximation unless the test data looks JUST like the training data.

So generally, it’s only useful in pilot experiments

-> But it is helpful to think about

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the Shannon Game?

A

A game about how well we can predict the next word

Examples:
I always order pizza with cheese and ___
The 33rd President of the US was ___
I saw a ___

17
Q

Unigrams are terrible at the Shannon Game. Can you try to explain why?

A

Because unigram does not take into consideration, the last N-words.

18
Q

Perplexity is the…

A

Probability of the test set, normalized by the number of words

Example:
PP(W) = P(w1w2…wn)^-1/n

19
Q

What is the perplexity (i.e how hard is it) of recognizing the digits ‘0,1,2,3,4,5,6,7,8,9’?

A

Perplexity = 10

If all digits are equally likely

20
Q

What is the perplexity (i.e how hard is it) of recognizing (30,000) names at microsoft?

A

Perplexity = 30,000

If all names are equally likely

21
Q

Is lower or higher perplexity better for a model?

A

Lower perplexity means that it’s a better model