N-gram Language Models Flashcards

Question 1

Q

ew

Language Models

Answer

A

Models that assign probabilities to sequences of words.

Question 2

Q

N-gram

Answer

A

A sequence of n words.

Question 3

Q

r

Markov models

Answer

A

A class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past.

Question 4

Q

Extrinsic Evaluation

Answer

A

An end-to-end evaluation.

E.g. embedding the model in an application and measuring how much the application improves.

Question 5

Q

Intrinsic Evaluation Metric

Answer

A

A metric that measures the quality of a model independent of any application.

Question 6

Q

Perplexity

Answer

A

The perplexity (PP) of a language model on a test set is the inverse probability of the test set, normalised by the number of words.

Question 7

Q

rrssw

OOV rate

Answer

A

The percentage of out of vocabulary words that appear in the test set.

Question 8

Q

Open vocabulary system

Answer

A

A system in which we model potential unknown words in the test set by adding a pseudo-word called <UNK>.

Question 9

Q

Smoothing

Answer

A

A method of dealing with words that are in our vocabulary, but appear in the test set in an unknown context (e.g. after a word they never appeared after in training).

To keep the model from assigning zero probability, shave off a bit of probability mass from some more frequent events and give it to events never seen.

Question 10

Q

Laplace Smoothing

Answer

A

The simplest smoothing technique.

Add one to all the n-gram counts, before normalising them into probabilities.

All the counts that used to be zero will now have a count of 1, counts of 1 will be 2, etc.

a.k.a. add-one smoothing

Question 11

Q

add-k smoothing

Answer

A

Like Laplace smoothing. But instead of adding 1 to each count, we add a fractional count k (.5, .05, .01).

Question 12

Q

backoff smoothing

Answer

A

We use the trigram if the evidence is sufficient, otherwise use the bigram, otherwise the unigram.

I.e. we only “back off” to a lower-order n-gram if we have zero evidence for a higher order n-gram.

Question 13

Q

interpolation smoothing

Answer

A

We mix the probability estimates from all the n-gram estimators, weighing and combining the trigram, bigram and unigram counts.

Question 14

Q

Discount for a backoff model

Answer

A

In order for a backoff model to give a correct probability distribution, we have to discount the higher-order n-grams to save some probability mass for the lower order n-grams.

Question 15

Q

P_CONTINUATION

Answer

A

Instead of P(w), it tries to answer the question, “How likely is w to appear as a novel continuation?”

Question 16

Q

Kneyser-Ney intuition for P_CONTINUATION

Answer

Study These Flashcards

A

Base our estimate of P_CONTINUATION on the number of different contexts the word w has appeard in.

I.e. the number of bigram types it completes.

Question 17

Q

Stupid backoff

Answer

Study These Flashcards

A

Stupid backoff does not try to make the language model a true probability distribution. There is no discounting of higher-order probabilities.

If a higher-order n-gram has a zero count, we backoff to a lower order n-gram, weighed by a fixed (context-independent) weight.

S(wᵢ | wᵢ₋ₖ₊₁ ܄ ᵢ₋₁) =

count(wᵢ | wᵢ₋ₖ₊₁ ܄ ᵢ) /
count(wᵢ | wᵢ₋ₖ₊₁ ܄ ᵢ₋₁)

> if count(wᵢ | wᵢ₋ₖ₊₁ ܄ ᵢ) > 0

λ S(wᵢ | wᵢ₋ₖ₊₂ ܄ ᵢ₋₁)

> otherwise

The backoff terminates in the unigram, which has probability

S(w) = count(w) / N

Question 18

Q

Answer

Study These Flashcards

A

N-gram Language Models Flashcards

(18 cards)