NLP II: RNNs and LSTMs Flashcards

1
Q

Draw the representation of a Recurring neural network

A

Check notes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define a multilayer RNN? What does it provide compared to single RNN

A

the outputs of a first RNN are used as the input for a second RNN (which then produces the actual output)
This provides more potential for the model to learn

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What’s the Long Short-Term Memory? and what’s the difference with the RNN

A

It’s like a RNN, but in addition to a hidden state (used to predict the next token), the LSTM has another hidden state to capture some context information, its called “short term memory” / “cell state”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does the “cell state” makes?

A

Helps to separately learn (1) information required to predict the next token and (2) contextual
information learned throughout the already seen token (e.g., gender of names)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Draw the LSTM Architecture

A

See notes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the 4 gates of the LSTM Architecture?

A

Forget gate:
– Which information in the cell state should be forgotten?

Input gate:
– Works together with the cell gate to update the cell state
– Which dimensions should be updated? (e.g., “gender”)

Cell gate:
– Works together with the input gate to update the cell state
– To what should the values be updated? (e.g., “female”)

Output gate:
– What should the next hidden state be?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is dropout?

A

During each training iteration, randomly deactivate neurons with a probability p

To compensate, all activations are multiplied by 1/(1 – p) during training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are some other regularization techniques?

A

Weight Decay:
– Adds to the loss the sum of squared network weights (times wd = 0.1)

Activation regularization (AR):
– Add sum of squared outputs to the loss (times α = 1)

Temporal activation regularization (TAR):
– Add sum of square of differences between consecutive activations (times β = 2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the AWD-LSTM?

A
  • Dropout in all parts of the LSTM architecture (with different probabilities)
    – Weight decay, activation regularization and temporal activation regularization
    – Final trick: Weight tying
    – Mapping from input to hidden and hidden to output have the same weights!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What’s the limit of the Human Numbers Datasets?

A

10,000

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What´s the dropout formula that compensates the deactivations?

A

1/(1 – p)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Remember the human numbers dataset from the lecture. You are given a small extract from the tokenized dataset as input:

[‘two’, ‘hundred’, ‘seventy’, ‘eight’, ‘.’]

How does it continue, i.e., what is the next token?
1. The next token is ‘two’.
2. There are multiple tokens that could possibly come next. ‘two’ is the most likely one based on the full dataset.
3. There are multiple tokens that could possibly come next. They are equally likely based on the full dataset.
4. Not enough information is known about the dataset to make any statement.
5. There is no next token, as this is the end of the dataset.

A

2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In a RNN, the embedding of a token differs based on the position in the sequence.

True.
False.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which factors influence the number of parameters in a RNN for next-token-prediction? (Multiple Choice)
1. The activation function of the hidden-to-hidden layer.
2. Size of the embedding.
3. Number of tokens in the vocab.
4. Maximum length of a sequence in the training data.

A

2,3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

You are given the following partial implementation of a simple RNN for a next-token-prediction task in PyTorch:

> class RNN(Module):
def_init(self):
self.i_h = nn.Embedding(xxx,xxx)
self.h_h = nn.Linear(xxx,xxx)
self.h_o = nn.Linear(xxx,xxx)def forward(self, x):
h = 0
for i in range(xxx):
h = h + self.i_h(x[:,i])
h = F.relu(self.h_h(h))
return self.h_o(h)

You have 37 tokens in your vocab and want to use an embedding size of 64. For predicting the next token, you always use 8 previous tokens as context.

Fill in the gaps marked as xxx in the code above!

A

> class RNN(Module):
def_init(self):
self.i_h = nn.Embedding(37,64)
self.h_h = nn.Linear(64,64)
self.h_o = nn.Linear(64,37)def forward(self, x):
h = 0
for i in range(8):
h = h + self.i_h(x[:,i])
h = F.relu(self.h_h(h))
return self.h_o(h)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

You apply dropout, randomly deactivating neurons with probability p = 0.51. By what factor do you multiply the activations during training?

A

2.040816

17
Q

Which of these regularization techniques are used by the AWD-LSTM?
1. Ensembling
2. Dropout
3. Activation regularization (AR)
4. Tying activation regularization (TAR)
5. Cell gate penalization
6. Data augmentation

A

2,3

18
Q

Which of the following is a primary advantage of Long Short-Term Memory (LSTM) networks over basic Recurrent Neural Networks (RNNs)?

A) Simpler architecture
B) Reduced computational cost
C) Ability to handle longer dependencies without vanishing gradient problems
D) Better performance on image data

A

Answer: C) Ability to handle longer dependencies without vanishing gradient problems

19
Q

What is the main purpose of the cell state in an LSTM network?

A) To store the output of the previous layer
B) To keep track of the current input
C) To maintain and carry forward contextual information across time steps
D) To apply dropout regularization

A

Answer: C) To maintain and carry forward contextual information across time steps

20
Q

Explain the concept of a multilayer RNN and its potential benefits.

A

Answer:
A multilayer RNN consists of multiple RNN layers stacked on top of each other, where the output of one layer becomes the input to the next. This structure allows the network to learn more complex patterns and representations, as the higher layers can capture higher-level features based on the outputs of the lower layers.

21
Q

Describe the role of the forget gate in an LSTM network.

A

Answer:
The forget gate in an LSTM network determines how much of the previous cell state should be carried forward to the next time step. It takes the current input and the previous hidden state as inputs and produces a value between 0 and 1 for each element in the cell state, where 0 means “completely forget” and 1 means “completely remember.”

22
Q

An RNN model suffers from vanishing gradients during training. How might switching to an LSTM architecture help mitigate this issue?

A

Answer:
LSTM architectures are designed to mitigate the vanishing gradient problem by using mechanisms like the cell state and gating functions (forget, input, and output gates). These components help maintain gradients over long sequences, allowing the network to learn from long-range dependencies more effectively.

23
Q

A model using AWD-LSTM achieves 85% accuracy on a test set after regularization. What are the key regularization techniques used in AWD-LSTM, and how do they contribute to this performance?

A

Answer:
The key regularization techniques used in AWD-LSTM include dropout (applied to inputs, outputs, and recurrent connections), weight decay (penalizing large weights to prevent overfitting), and activation regularization (penalizing large activations to encourage more distributed representations). These techniques help improve the model’s generalization ability, reducing overfitting and leading to better performance on unseen data.