Language Modelling Flashcards

1
Q

What is a language model?

A

It is a model that assigns probabilities to sequences of worsd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the most basic of language models?

A

The n-gram model, which assigns probabilities to sentences and sequences of words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What can n-gram models be used to?

A

Estimate the probability of the last word of an n-gram given the previous words, and assign probabilities to entire sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Where can language models be used?

A

Speech Recognition
Spelling correction or grammatical error correction
Machine translation
Augmentative and alternative communication systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does a unigram assume?

A

That the appearance of words are independent of each other
P(w1, w2, w3, w4) = P(w1) * P(w2) * P(w3) * P(w4)
It assumes that the previous words have no influence on the next word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What do n-gram models inform us?

A

The probability of the next word in the text is dependent on the previous n-1 words in the text.
If we have a word w, and some history h, we want to find out of the times that h occurred, how many times was it followed by w.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In general, how do n-gram probabilities work?

A

P(w1) = P(w1)
P(w1,w2) = P(w2 | w1) * P(w1)
P(w1, …, wn) = P(wn | w1, …, wn-1) * P(w1, …, wn-1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does the bigram model approximate the probability of the next word?

A

We can use the probability of the next word given the previous word:
P(wn | wn-1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What assumption is made with the n-gram model?

A

The Markov Assumption - the probability of a word depends only on the previous word, this can be generalised to trigrams and so on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does MLE stand for?

A

Maximum Likelihood Estimation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does MLE do?

A

It is the process of choosing the right set of bigram parameters to make our model correctly predict (maximise the likelihood of) the nth word in the the text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do we obtain the MLE for the parameters of an n-gram model?

A
We observe the n-gram counts from a representative corpus
Normalise them (dividing by a total count) to lie between 0 and 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What format is used when computing language model probabilities?

A

We use the log format so we can add the probabilities instead of multiplying them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some n-gram problems?

A

Even if we have a large corpus, only a tiny minority of possible n-grams exist in any corpus. The probability of a word appearing given the previous word is 0 for most sequences due to the sparsity of the matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some n-gram problems?

A

Even if we have a large corpus, only a tiny minority of possible n-grams exist in any corpus. The probability of a word appearing given the previous word is 0 for most sequences due to the sparsity of the matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What do we use to keep a language model from assigning a zero probability to unseen n-grams?

A

A number of smoothing/discounting methods.

16
Q

What does Laplace smoothing do?

A

It adds 1 to all the bigram counts, before we normalise them into probabilities, meaning that all counts that were 0 are now 1, 1s become 2s, and so one.

17
Q

What is another name for Laplace smoothing?

A

Add-one smoothing

18
Q

How do we normalise the count when we do Laplace smoothing?

A

We add one to the observed bigram, and divide by the number of appearances of N, plus V, where V is the the total number of words in the vocabulary

19
Q

What can we do if instead of changing the numerator and denominator in Laplace smoothing?

A

We can add one to the count, multiplied by N and all divided by N + V

20
Q

What are some problems with Laplace smoothing?

A

The extra V observations added to normalisation are problematic as too much probability mass has been moved to the zero-occurrence cases

21
Q

What is k-smoothing?

A

It is a way of moving a bit less probability mass from the seen to unseen events.

22
Q

What is the equation for k-smoothing?

A
23
Q

Does k-smoothing solve the problem of imbalanced probability counts?

A

No - it can still be problematic

24
Q

What are some alternatives to smoothing?

A

Backoff and interpolation

25
Q

What is backoff?

A

It is where if we do not have data about higher order n-grams, we can fall back to information about lower order n-grams
For example, if we don’t have trigrams, we use bigrams, if we don’t have bigrams, we use trigrams

26
Q

What does backoff allow?

A

It allows for generalisation for contexts that the model has not been trained on

27
Q

Explain how backoff is different to interpolation

A

Backoff only uses lower order n-gram if we have zero evidence of higher order n-grams
Interpolation always mixes the probability from all the n-gram estimators, weighing and combining the counts

28
Q

How does simple linear interpolation work?

A

We combine different order n-grams by linearly interpolating the models, we add weights to the trigram, bigram and unigram, where the weights sum to one to get the probabilities

29
Q

What is interpolated Kneser-Ney smoothing?

A

It makes use of two approaches - an absolute discounting factor of 0.75 and the probability of a unigram being a novel continuation of another word in a bigram.
An example is a frequent word, e.g. Britain, will have a low continuation probability if it only appears in a single context of Great Britain

30
Q

What is an even better way than interpolated Kneser-Ney and Laplace smoothing to solve the discounting problem?

A

Stupid backoff - although it only works for very large models

31
Q

What is stupid backoff?

A

Stupid backoff is where you multiply the lower order estimate probability by a constant factor (0.4) where the count is 0

32
Q

What is class-based N-gram model?

A

It is where probabilities of words are based on the clases of the previous words. E.g. you cluster Shanghai with other city names and the estimate based on the likelihood of matching phrases from the entire cluster

33
Q

What are skip-grams?

A

Skip grams is where we have a value k, and we take words that are not necessarily adjacent, but separated by a distance k.
In the text: “the rain in spain falls mainly on the plain”, the 1-skip-2grams are: all the bigrams, and the non adjacent pairs, such as the in, rain spain, in falls, spain mainly…

34
Q

What are skip-grams good for?

A

They capture word contexts with less sparsity than strict n-grams - useful in Word2Vec