Lecture 2 - Language Models with N-grams Flashcards

Question 1

Q

What are n-grams?

Answer

A

An n-gram is a sequence of N words.

Question 2

Q

What is the goal of probabilistic language modeling?

Answer

A

Compute the probability of a sentence or sequence of works

e. g., in Machine Translation → P(high winds tonite)> P(large winds tonite)
e. g., in Spell Correction → P(about fifteen minutes from) > P(about fifteen minuets from)
e. g., in Spell Recognition → P(I saw a van)&raquo_space; P(eyes awe of an)

Question 3

Q

What rule can be used to calculate the probability of a sentence?

Answer

A

You can use the chain rule to calculate the probability of a sentence

Question 4

Q

How does the chain rule work? I.e what is the formula?

Answer

A

P(x1, x2, x3,…, xn) = P(x1)P(x2|x1)P(x3|x1,x2)P(xn|x1,…,xn-1)

Example:
P(“the water is so cold”) =
P(“the”)P(“water | the”)P(“is | the water”)P(“so | the water is”) P(“cold | the water is so”)

Question 5

Q

What is the Markov Assumption?

Answer

A

A simplifying assumption:

P(the | its water is so transparent that) ≈ P(the | that)

Or maybe

P(the | its water is so transparent that) ≈ P(the | transparent that)

Essentially, we estimate the probability by taking the last few words, rather than the whole sentence

Question 6

Q

What is the simplest case of a markov model?

Answer

A

The Unigram Model

Uses one word

Question 7

Q

What’s a slightly more advanced version of the Unigram Model?

Answer

A

The Bigram Model

Uses two words

Question 8

Q

We can extend these to trigrams, 4-grams, 5-grams etc. Is this a sufficient model of language?

Answer

A

No, because language has long-distance dependencies.

BUT often we can get away with ngram models.

Question 9

Q

Why do we do everything in log space?

Answer

A

To avoid underflow
(Also, adding is faster than multiplying)

p1p2p3*p4 = log p1+log p2+ log p3 + log p4

Question 10

Q

When we train the parameters of our model, we do it using which split of a dataset?

Answer

A

The training set

Question 11

Q

When we test the model’s performance, we use which split of a dataset?

Answer

A

The test set

Question 12

Q

What is the best way of evaluation for comparing different models?

Answer

A

Put each model in a task (In our case sentiment analysis)

Run the task, and get an accuracy for model a and model b.

Confusion matrix for example

This is called “Extrinsic evaluation” -> We’re using something external to the ngram model itself

Question 13

Q

What is the downside of extrinsic evaluation?

Answer

A

It’s time consuming -> Can take days or weeks

Question 14

Q

What can you use instead of extrinsic evaluation?

Answer

A

Intrinsic evaluation: Perplexity

Question 15

Q

Is perplexity a good approximation?

Answer

A

No. Perplexity is a bad approximation unless the test data looks JUST like the training data.

So generally, it’s only useful in pilot experiments

-> But it is helpful to think about

Question 16

Q

What is the Shannon Game?

Answer

Study These Flashcards

A

A game about how well we can predict the next word

Examples:
I always order pizza with cheese and ___
The 33rd President of the US was ___
I saw a ___

Question 17

Q

Unigrams are terrible at the Shannon Game. Can you try to explain why?

Answer

Study These Flashcards

A

Because unigram does not take into consideration, the last N-words.

Question 18

Q

Perplexity is the…

Answer

Study These Flashcards

A

Probability of the test set, normalized by the number of words

Example:
PP(W) = P(w1w2…wn)^-1/n

Question 19

Q

What is the perplexity (i.e how hard is it) of recognizing the digits ‘0,1,2,3,4,5,6,7,8,9’?

Answer

Study These Flashcards

A

Perplexity = 10

If all digits are equally likely

Question 20

Q

What is the perplexity (i.e how hard is it) of recognizing (30,000) names at microsoft?

Answer

Study These Flashcards

A

Perplexity = 30,000

If all names are equally likely

Question 21

Q

Is lower or higher perplexity better for a model?

Answer

Study These Flashcards

A

Lower perplexity means that it’s a better model

Lecture 2 - Language Models with N-grams Flashcards

(21 cards)