NLP II: RNNs and LSTMs Flashcards
Draw the representation of a Recurring neural network
Check notes
Define a multilayer RNN? What does it provide compared to single RNN
the outputs of a first RNN are used as the input for a second RNN (which then produces the actual output)
This provides more potential for the model to learn
What’s the Long Short-Term Memory? and what’s the difference with the RNN
It’s like a RNN, but in addition to a hidden state (used to predict the next token), the LSTM has another hidden state to capture some context information, its called “short term memory” / “cell state”
What does the “cell state” makes?
Helps to separately learn (1) information required to predict the next token and (2) contextual
information learned throughout the already seen token (e.g., gender of names)
Draw the LSTM Architecture
See notes
What are the 4 gates of the LSTM Architecture?
Forget gate:
– Which information in the cell state should be forgotten?
Input gate:
– Works together with the cell gate to update the cell state
– Which dimensions should be updated? (e.g., “gender”)
Cell gate:
– Works together with the input gate to update the cell state
– To what should the values be updated? (e.g., “female”)
Output gate:
– What should the next hidden state be?
What is dropout?
During each training iteration, randomly deactivate neurons with a probability p
To compensate, all activations are multiplied by 1/(1 – p) during training
What are some other regularization techniques?
Weight Decay:
– Adds to the loss the sum of squared network weights (times wd = 0.1)
Activation regularization (AR):
– Add sum of squared outputs to the loss (times α = 1)
Temporal activation regularization (TAR):
– Add sum of square of differences between consecutive activations (times β = 2)
What is the AWD-LSTM?
- Dropout in all parts of the LSTM architecture (with different probabilities)
– Weight decay, activation regularization and temporal activation regularization
– Final trick: Weight tying
– Mapping from input to hidden and hidden to output have the same weights!
What’s the limit of the Human Numbers Datasets?
10,000
What´s the dropout formula that compensates the deactivations?
1/(1 – p)
Remember the human numbers dataset from the lecture. You are given a small extract from the tokenized dataset as input:
[‘two’, ‘hundred’, ‘seventy’, ‘eight’, ‘.’]
How does it continue, i.e., what is the next token?
1. The next token is ‘two’.
2. There are multiple tokens that could possibly come next. ‘two’ is the most likely one based on the full dataset.
3. There are multiple tokens that could possibly come next. They are equally likely based on the full dataset.
4. Not enough information is known about the dataset to make any statement.
5. There is no next token, as this is the end of the dataset.
2
In a RNN, the embedding of a token differs based on the position in the sequence.
True.
False.
False
Which factors influence the number of parameters in a RNN for next-token-prediction? (Multiple Choice)
1. The activation function of the hidden-to-hidden layer.
2. Size of the embedding.
3. Number of tokens in the vocab.
4. Maximum length of a sequence in the training data.
2,3
You are given the following partial implementation of a simple RNN for a next-token-prediction task in PyTorch:
> class RNN(Module):
def_init(self):
self.i_h = nn.Embedding(xxx,xxx)
self.h_h = nn.Linear(xxx,xxx)
self.h_o = nn.Linear(xxx,xxx)def forward(self, x):
h = 0
for i in range(xxx):
h = h + self.i_h(x[:,i])
h = F.relu(self.h_h(h))
return self.h_o(h)
You have 37 tokens in your vocab and want to use an embedding size of 64. For predicting the next token, you always use 8 previous tokens as context.
Fill in the gaps marked as xxx in the code above!
> class RNN(Module):
def_init(self):
self.i_h = nn.Embedding(37,64)
self.h_h = nn.Linear(64,64)
self.h_o = nn.Linear(64,37)def forward(self, x):
h = 0
for i in range(8):
h = h + self.i_h(x[:,i])
h = F.relu(self.h_h(h))
return self.h_o(h)
You apply dropout, randomly deactivating neurons with probability p = 0.51. By what factor do you multiply the activations during training?
2.040816
Which of these regularization techniques are used by the AWD-LSTM?
1. Ensembling
2. Dropout
3. Activation regularization (AR)
4. Tying activation regularization (TAR)
5. Cell gate penalization
6. Data augmentation
2,3
Which of the following is a primary advantage of Long Short-Term Memory (LSTM) networks over basic Recurrent Neural Networks (RNNs)?
A) Simpler architecture
B) Reduced computational cost
C) Ability to handle longer dependencies without vanishing gradient problems
D) Better performance on image data
Answer: C) Ability to handle longer dependencies without vanishing gradient problems
What is the main purpose of the cell state in an LSTM network?
A) To store the output of the previous layer
B) To keep track of the current input
C) To maintain and carry forward contextual information across time steps
D) To apply dropout regularization
Answer: C) To maintain and carry forward contextual information across time steps
Explain the concept of a multilayer RNN and its potential benefits.
Answer:
A multilayer RNN consists of multiple RNN layers stacked on top of each other, where the output of one layer becomes the input to the next. This structure allows the network to learn more complex patterns and representations, as the higher layers can capture higher-level features based on the outputs of the lower layers.
Describe the role of the forget gate in an LSTM network.
Answer:
The forget gate in an LSTM network determines how much of the previous cell state should be carried forward to the next time step. It takes the current input and the previous hidden state as inputs and produces a value between 0 and 1 for each element in the cell state, where 0 means “completely forget” and 1 means “completely remember.”
An RNN model suffers from vanishing gradients during training. How might switching to an LSTM architecture help mitigate this issue?
Answer:
LSTM architectures are designed to mitigate the vanishing gradient problem by using mechanisms like the cell state and gating functions (forget, input, and output gates). These components help maintain gradients over long sequences, allowing the network to learn from long-range dependencies more effectively.
A model using AWD-LSTM achieves 85% accuracy on a test set after regularization. What are the key regularization techniques used in AWD-LSTM, and how do they contribute to this performance?
Answer:
The key regularization techniques used in AWD-LSTM include dropout (applied to inputs, outputs, and recurrent connections), weight decay (penalizing large weights to prevent overfitting), and activation regularization (penalizing large activations to encourage more distributed representations). These techniques help improve the model’s generalization ability, reducing overfitting and leading to better performance on unseen data.