LLM – Neural LLM Flashcards

Lecture 09

1
Q

What is a long-range dependencies problem?

A

When in a long sentence the LMs, that are based on counting, have a problem of not getting context from the tokens before. Like in a sentence “The computer which I had just put into
the machine room on the fifth floor crashed.” For word ‘crashed’ it is impossible to get info from ‘the computer’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe Fixed-Window Neural LM

A

We have some sequence of text and we want to predict the next word. Since context is fixed, we need to take the last N number of words/tokens (N is the context size) and we want to predict the next word.

We convert these 4 words into embeddings. These 4 embeddings are concatenated and fed into a neural network and the output is passed through a softmax layer and we get a probability distribution of the whole vocab.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some differences between fixed-windows neural LM and N-gram LM?

A

Long-range dependencies are reduces
In case of OOV words, we can replace the embedding with zeroes or some default embedding and hope the rest of the context is enough.
It should capture the meanings of the words compared to n-gram where words with same meaning have different counts (only exact words are looked at).

There are still problems of small context windows, and if we make it bigger, the bigger NN is needed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What Changed from N-Gram LMs to Neural LMs?

A

In n-grams, we treat all prefixes independently of each other (even those that are semantically similar). Neural LMs are able to share information across these semantically-similar prefixes and overcome the sparsity issue.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How are RNNs used in language models to predict tokens?

A

We have an RNN that takes an arbitrary large number of sequences and it is supposed to encode it into a singular vectors. The only thing we have to do is to pass it through a NN and use softmax and we have a probability distribution of the next token.

  • long size of context but it was shown that they quickly forget information
  • difficult to parallelize
  • Vanishing/exploding gradients
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a vanishing/exploding gradient problem?

A

In RNNs, if we multiply weights over and over, it can happen that weights reach close to 0 or even zero. Vice versa, exploding goes to infinity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is training of Transformer-based Language Models performed?

A

The goal is to predict the next token/word given a previous sequence of tokens. Training is done so that each position is predictor for the next (right) token. We do it until we get tired or EOS token is generated.

For each output position we compute the corresponding distribution over the whole vocabulary. Then, the loss is calculated between this distribution and the gold output label. Sum the position-wise loss and obtain a global loss. Using this loss, we do the backpropagation and update the transformer parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the information leakage when training transformer-based language models?

A

While training, if model sees the future tokens, it would solve the task by copying the next token to output (data leakage). For that reason, attention mask is used to hide the future tokens. It is a matrix of 1.0 and -inf values where we put -inf when j > i, and 1.0 otherwise. We multiply this with the self-attention scores and after the softmax function, the -inf values become 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly