Language models Flashcards

Question 1

Q

Language modeling and problem formulation

Answer

A

Language modeling is the task of predicting which word comes next in a sequence of words.

More formally, given a sequence of words w1w2 … wt we want to know the probability of the next word wt+1:

P(wt+1|w1w2 … wt) = P(w1:t+1)/P(w1:t)

Question 2

Q

Language models as generative models

Answer

A

Generative models are a type of machine learning models that are trained to generate new data instances similar to the ones in the training data.

Rather than as predictive models, language models can also be viewed as generative models that assign probability to a piece of text:

P(w1 … wt) = P(w1:t)

Question 3

Q

Probability of a sequence of words as a product of conditional probabilities

Answer

A

P(w1:n) = product t=1 to n P(wt|w1:t-1)

Question 4

Q

Types vs tokens in a corpus

Answer

A

Types are the elements of the vocabulary V associated with the corpus, that is, the distinct words of the corpus.
Tokens are the running words (occurrences). The length of a corpus is the number of tokens

Question 5

Q

Relative frequency estimator in the next word prediction task

Answer

A

P(wt|w1:t-1) = C(w1:t)/C(w1:t-1)

This estimator is very data-hungry, and suffers from high variance: depending on what data happens to be in the corpus, we could get very different probability estimations.

Question 6

Q

N-gram model, complexity, N-gram probabilities and bias-variance tradeoff

Answer

A

A string w(t-N+1):t of N words is called N-gram.

The N-gram model approximates the probability of a word given the entire sentence history by conditioning only on the past N-1 words.

General equation for the N-gram model:

P(wt|w1:t-1) = P(wt|w(t-N+1):t-1)

Give the relative frequency estimator…

N-gram model is exponential wrt N (V^N)

N is a hyperparameter. When setting its value, we face the bias-variance tradeoff:

When N is too small (high bias), it fails to recover long-distance word relations
When N is too large, we get data sparsity (high variance)

Question 7

Q

Example of computing next word probability using an N-gram model (Slide 16 pdf 5…)

Question 8

Q

Evaluation of LMs: Intrinsic evaluation and perplexity measure

Answer

A

Intrinsic evaluation of language models is based on the inverse probability of the test set, normalized by the number of words.

For a test set W = w1w2 … wn we define perplexity as:

PP(W) = P(1:n)^(-1/n)

The multiplicative inverse probability 1/P(wj|w1:j-1) can be seen as a measure of how surprising the next word is.

The degree of the root averages over all words of the test set, providing average surprise per word.

The lower the perplexity, the better the model.

An (intrinsic) improvement in perplexity does not guarantee an (extrinsic) improvement in the performance of a language processing task like speech recognition or machine translation.

Question 9

Q

Sparse data: zero or undefined probabilities for N-gram models

Answer

A

Using the relative frequency estimator in LMs: if there isn’t enough data in the training set, counts will be zero for some grammatical sequences.

There are three scenarios we need to consider:

zero numerator: smoothing, discounting
zero denominator: backoff, interpolation
out-of-vocabulary words in test set: estimation of unknown words

Question 10

Q

Smoothing techniques

Answer

A

Given the Relative frequency estimator:

P(wt|w1:t-1) = C(w1:t)/C(w1:t-1)

Smoothing techniques (also called discounting) deal with words that are in our vocabulary V but were never seen before in the given context (zero numerator).

Smoothing prevents LM from assigning zero probability to these events.

Question 11

Q

Laplace smoothing

Answer

A

IDEA:
Pretend that everything occurs once more than the actual count.

Consider 1-gram model:

PL(wt) = (C(wt)+1)/(n+V).

Consider C*, the relative discount d(wt):

d(wt) = C*(wt)/C(wt) > 1

for high frequency words (do the calculations..).

Question 12

Q

Add-k smoothing

Answer

A

Add-k smoothing is a generalization of add-1 smoothing (that is the Laplace smoothing).

For some 0 <= k < 1 (considering the 2-gram model):

PAdd-k(wt|wt-1) = (C(wt-1..wt)+k)/(C(wt-1)+kV)

Jeffreys-Perks law corresponds to the case k = 0.5, which works well in practice and benefits from some theoretical justification.

Question 13

Q

Backoff and interpolation techniques

Answer

A

Deal with words that are in our vocabulary, but in the test set combine to form previously unseen contexts.

These techniques prevent LM from creating undefined probabilities for these events (zero/zero).

IDEA:

if you have trigrams, use trigrams
if you don’t have trigrams, use bigrams

Question 14

Q

Stupid backoff

Answer

A

With very large text collections (web-scale) a rough approximation of Katz backoff is often sufficient, called stupid backoff.

Give the recursive Ps(wt|wt-N+1:t-1) = … (using λ)

Question 15

Q

Linear interpolation

Answer

A

In simple linear interpolation, we combine different order N-grams by linearly interpolating all the models.

PL(wt|wt-2 wt-1) = λ1P(wt|wt-2 wt-1) + λ2P(wt|wt-1) + λ3P(wt)

Question 16

Q

Unknown/unfrequent words, how can we handle them?

Answer

A

Unknown words, also called out of vocabulary (OOV) words, are words we haven’t seen before.

Replace by new word token UNK all words that occur fewer than d times in the training set, d some small number.

Question 17

Q

Limitations of the N-gram models

Answer

A

Scaling to larger N-gram sizes is problematic
Smoothing techniques are intricate
N-gram models are unable to share statistical strength across word boundaries (‘red apple’ do not affect estimates for ‘green apple’)

Question 18

Q

Neural language models advantages and basic idea

Answer

A

Main advantages of NLM:

can generalize better over contexts of similar words, and are more accurate at word-prediction
can incorporate arbitrarily distant contextual information, while remaining computationally and statistically tractable

IDEA:

get a vector representation for the previous context
generate a probability distribution for the next token

Most natural choice for NN architecture is recurrent neural network (RNN) but feedforward neural network (FNN) and convolutional neural network (CNN) have also been exploited.

Question 19

Q

Feedforward NLM

Answer

A

Uses (like the N-gram language model) P(wt|w1:t-1) ≃ P(wt|wt-N+1:t-1)

We represent each input word wt as a one-hot vector xt of size V
At the first layer:
- we convert one-hot vectors for the words in the N-1 window into word embeddings of size d
- we concatenate the N-1 embeddings (e = [Ext-3; Ext-2; Ext-1]) where e is the concatenation of the N-1 embeddings of the previous words
h = g(We+b) <- vector representation of the N-1 words
z = Uh <- transform dimension of h from d to the vocabulary size V
ŷ = softmax(z)

explain vector dimensions…

Question 20

Q

RNN for language modeling

Answer

A

RNN language models process the input one word at a time, predicting the next word from:

the current word
the previous hidden state

et = Ext
ht = g(Uht-1 + Wet)
yt = softmax(Vht)

where h is the “hidden state”.

explain vector dimensions…

In the end:

the columns of E provide the learned word embeddings
the rows of V provide a second set of learned word embeddings, that capture relevant aspects of word meaning and function

Question 21

Q

Weight tying in RNNs

Answer

A

In the recurrent NLM model: weight tying, also known as parameter sharing, means that we impose E transpose = V.

Weight tying can significantly reduce model size, and has an effect similar to regularization, preventing overfitting of the NLM.

Question 22

Q

RNN pratical issues

Answer

A

vanishing gradient problem
softmax over the entire vocabulary, dominates the computation both at training and at test time (alternatives Hierarchical softmax and Adaptive softmax)

Question 23

Q

Method for modifying language model behavior in RNN (coherence and diversity trade-off)

Answer

A

A very popular method for modifying language model behavior is to change the softmax temperature τ.

Low τ produces peaky distribution (high coherence); large τ produces flat distribution (high diversity).

write the formula…

Question 24

Q

Contrastive evaluation

Answer

A

Is used to test specific linguistic constructions in NLM.

E.g. P(is|w1:t-1) vs P(are|w1:t-1)