Language Modelling Flashcards by Ben Boyce

What is a language model?

It is a model that assigns probabilities to sequences of worsd

How well did you know this?

Not at all

Perfectly

What is the most basic of language models?

The n-gram model, which assigns probabilities to sentences and sequences of words

How well did you know this?

Not at all

Perfectly

What can n-gram models be used to?

Estimate the probability of the last word of an n-gram given the previous words, and assign probabilities to entire sequences

How well did you know this?

Not at all

Perfectly

Where can language models be used?

Speech Recognition
Spelling correction or grammatical error correction
Machine translation
Augmentative and alternative communication systems

How well did you know this?

Not at all

Perfectly

What does a unigram assume?

That the appearance of words are independent of each other
P(w1, w2, w3, w4) = P(w1) * P(w2) * P(w3) * P(w4)
It assumes that the previous words have no influence on the next word

How well did you know this?

Not at all

Perfectly

What do n-gram models inform us?

The probability of the next word in the text is dependent on the previous n-1 words in the text.
If we have a word w, and some history h, we want to find out of the times that h occurred, how many times was it followed by w.

How well did you know this?

Not at all

Perfectly

In general, how do n-gram probabilities work?

P(w1) = P(w1)
P(w1,w2) = P(w2 | w1) * P(w1)
P(w1, …, wn) = P(wn | w1, …, wn-1) * P(w1, …, wn-1)

How well did you know this?

Not at all

Perfectly

How does the bigram model approximate the probability of the next word?

We can use the probability of the next word given the previous word:
P(wn | wn-1)

How well did you know this?

Not at all

Perfectly

What assumption is made with the n-gram model?

The Markov Assumption - the probability of a word depends only on the previous word, this can be generalised to trigrams and so on

How well did you know this?

Not at all

Perfectly

What does MLE stand for?

Maximum Likelihood Estimation

How well did you know this?

Not at all

Perfectly

What does MLE do?

It is the process of choosing the right set of bigram parameters to make our model correctly predict (maximise the likelihood of) the nth word in the the text

How well did you know this?

Not at all

Perfectly

How do we obtain the MLE for the parameters of an n-gram model?

We observe the n-gram counts from a representative corpus
Normalise them (dividing by a total count) to lie between 0 and 1

How well did you know this?

Not at all

Perfectly

What format is used when computing language model probabilities?

We use the log format so we can add the probabilities instead of multiplying them

How well did you know this?

Not at all

Perfectly

What are some n-gram problems?

Even if we have a large corpus, only a tiny minority of possible n-grams exist in any corpus. The probability of a word appearing given the previous word is 0 for most sequences due to the sparsity of the matrix.

How well did you know this?

Not at all

Perfectly

What are some n-gram problems?

How well did you know this?

Not at all

Perfectly

What do we use to keep a language model from assigning a zero probability to unseen n-grams?

Study These Flashcards

A number of smoothing/discounting methods.

What does Laplace smoothing do?

Study These Flashcards

It adds 1 to all the bigram counts, before we normalise them into probabilities, meaning that all counts that were 0 are now 1, 1s become 2s, and so one.

What is another name for Laplace smoothing?

Study These Flashcards

Add-one smoothing

How do we normalise the count when we do Laplace smoothing?

Study These Flashcards

We add one to the observed bigram, and divide by the number of appearances of N, plus V, where V is the the total number of words in the vocabulary

What can we do if instead of changing the numerator and denominator in Laplace smoothing?

Study These Flashcards

We can add one to the count, multiplied by N and all divided by N + V

What are some problems with Laplace smoothing?

Study These Flashcards

The extra V observations added to normalisation are problematic as too much probability mass has been moved to the zero-occurrence cases

What is k-smoothing?

Study These Flashcards

It is a way of moving a bit less probability mass from the seen to unseen events.

What is the equation for k-smoothing?

Study These Flashcards

Does k-smoothing solve the problem of imbalanced probability counts?

Study These Flashcards

No - it can still be problematic

What are some alternatives to smoothing?

Backoff and interpolation

What is backoff?

It is where if we do not have data about higher order n-grams, we can fall back to information about lower order n-grams For example, if we don't have trigrams, we use bigrams, if we don't have bigrams, we use trigrams

What does backoff allow?

It allows for generalisation for contexts that the model has not been trained on

Explain how backoff is different to interpolation

Backoff only uses lower order n-gram if we have zero evidence of higher order n-grams Interpolation always mixes the probability from all the n-gram estimators, weighing and combining the counts

How does simple linear interpolation work?

We combine different order n-grams by linearly interpolating the models, we add weights to the trigram, bigram and unigram, where the weights sum to one to get the probabilities

What is interpolated Kneser-Ney smoothing?

It makes use of two approaches - an absolute discounting factor of 0.75 and the probability of a unigram being a novel continuation of another word in a bigram. An example is a frequent word, e.g. Britain, will have a low continuation probability if it only appears in a single context of Great Britain

What is an even better way than interpolated Kneser-Ney and Laplace smoothing to solve the discounting problem?

Stupid backoff - although it only works for very large models

What is stupid backoff?

Stupid backoff is where you multiply the lower order estimate probability by a constant factor (0.4) where the count is 0

What is class-based N-gram model?

It is where probabilities of words are based on the clases of the previous words. E.g. you cluster Shanghai with other city names and the estimate based on the likelihood of matching phrases from the entire cluster

What are skip-grams?

Skip grams is where we have a value k, and we take words that are not necessarily adjacent, but separated by a distance k. In the text: "the rain in spain falls mainly on the plain", the 1-skip-2grams are: all the bigrams, and the non adjacent pairs, such as the in, rain spain, in falls, spain mainly...

What are skip-grams good for?

They capture word contexts with less sparsity than strict n-grams - useful in Word2Vec

Language Modelling Flashcards

(35 cards)