NLP 2 Flashcards

1
Q

What is RNN?

A

neural network with a loop is called a Recurrent Neural Network (RNN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

3 layers of RNN

A

Input (embedding for vocab size to hidden layer), hidden (fully connected), output (to predict the target token)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why BPTT is needed?

A

RNNs process sequential data by maintaining a hidden state that is updated at each time step. However, training an RNN is challenging because the network’s loss depends not just on the current input but also on all the previous inputs due to the recurrence relationship. This sequential dependency means that the network’s weights must be updated based on errors accumulated over multiple time steps, not just a single layer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is BPTT?

A

Backpropagation Through Time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Concept of multilayer RNN and why it’s used

A

outputs of a first RNN are used as the input for a second RNN; While the model is very deep in principle, each predicted token only depends on one linear layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why Long Short-Term Memory is needed?

A

Helps to separately learn (1) information required to predict the next token and (2) contextual information learned throughout the already seen token. (saves gender for the next words)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How LSTM works?

A

Second hidden state: “short term memory” / “cell state”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Four main networks in LSTM

A

Forget gate, input gate (gender), cell gate (female), output gate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is dropout?

A

During each training iteration,
randomly deactivate neurons with a probability p
– To compensate, all activations are multiplied
by 1/(1 – p) during training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Drawbacks of dropout?

A

Less input used to calculate the output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Other regularisation techniques

A

Weight decay, Activation regularisation (AR), Temporal activation regularisation (TAR)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Weight tying?

A

Mapping from input to hidden and hidden to output have the same weights. Used in AWD-LSTM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In an RNN, the embedding of a token does not depend on its position in the sequence.

A

True. The token embedding is determined by a lookup table or an embedding matrix, which assigns a fixed vector representation to each token based on the vocabulary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which factors influence the number of parameters in RNN for next-token-prediction?

A

Size of the embedding, Number of tokens in the vocab

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Techniques used in AWD-LSTM?

A

Activation regularization (AR), dropout

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is AWD-LSTM?

A

Average Weight-Dropped Long Short-Term Memory. It introduces specific modifications and techniques to standard LSTMs to make them more effective for training on sequential data.

17
Q

Explain on an intuitive level how an LSTM works.

A

The LSTM uses a long-short term memory to store meta or contextual information in its cell state that is used together with the short term or hidden state to make predictions.

18
Q

How many times does the cell state flow trough a neural network?

A

The cell state never flows trough any neural network.

19
Q

Is it certain, that both the hidden and cell state contribute to the prediction?

A

Technically it is not certain, as the hidden state flows trough a sigmoid function before being merged for the prediction which has the ability to make it a tensor of all 0. However it is super unlikely for this to happen.

20
Q

Is it possible to have a multilayer LSTM architecture?

A

Yes it is possible to chain LSTM models on top of each other just as is for RNN’s.

21
Q

Explain - in non technical terms - the difference between the sigmoid and tanh function?

A

The sigmoid function chooses IF a value is used, the tanh function scales it.

22
Q

How does the LSTM achieve regularization compared to an RNN?

A

After any network in the LSTM either a sigmoid or tanh function is used to normalize the results.