NLP II: RNNs and LSTMs Flashcards
Draw the representation of a Recurring neural network
Check notes
Define a multilayer RNN? What does it provide compared to single RNN
the outputs of a first RNN are used as the input for a second RNN (which then produces the actual output)
This provides more potential for the model to learn
What’s the Long Short-Term Memory? and what’s the difference with the RNN
It’s like a RNN, but in addition to a hidden state (used to predict the next token), the LSTM has another hidden state to capture some context information, its called “short term memory” / “cell state”
What does the “cell state” makes?
Helps to separately learn (1) information required to predict the next token and (2) contextual
information learned throughout the already seen token (e.g., gender of names)
Draw the LSTM Architecture
See notes
What are the 4 gates of the LSTM Architecture?
Forget gate:
– Which information in the cell state should be forgotten?
Input gate:
– Works together with the cell gate to update the cell state
– Which dimensions should be updated? (e.g., “gender”)
Cell gate:
– Works together with the input gate to update the cell state
– To what should the values be updated? (e.g., “female”)
Output gate:
– What should the next hidden state be?
What is dropout?
During each training iteration, randomly deactivate neurons with a probability p
To compensate, all activations are multiplied by 1/(1 – p) during training
What are some other regularization techniques?
Weight Decay:
– Adds to the loss the sum of squared network weights (times wd = 0.1)
Activation regularization (AR):
– Add sum of squared outputs to the loss (times α = 1)
Temporal activation regularization (TAR):
– Add sum of square of differences between consecutive activations (times β = 2)
What is the AWD-LSTM?
- Dropout in all parts of the LSTM architecture (with different probabilities)
– Weight decay, activation regularization and temporal activation regularization
– Final trick: Weight tying
– Mapping from input to hidden and hidden to output have the same weights!
What’s the limit of the Human Numbers Datasets?
10,000
What´s the dropout formula that compensates the deactivations?
1/(1 – p)
Remember the human numbers dataset from the lecture. You are given a small extract from the tokenized dataset as input:
[‘two’, ‘hundred’, ‘seventy’, ‘eight’, ‘.’]
How does it continue, i.e., what is the next token?
1. The next token is ‘two’.
2. There are multiple tokens that could possibly come next. ‘two’ is the most likely one based on the full dataset.
3. There are multiple tokens that could possibly come next. They are equally likely based on the full dataset.
4. Not enough information is known about the dataset to make any statement.
5. There is no next token, as this is the end of the dataset.
2
In a RNN, the embedding of a token differs based on the position in the sequence.
True.
False.
False
Which factors influence the number of parameters in a RNN for next-token-prediction? (Multiple Choice)
1. The activation function of the hidden-to-hidden layer.
2. Size of the embedding.
3. Number of tokens in the vocab.
4. Maximum length of a sequence in the training data.
2,3
You are given the following partial implementation of a simple RNN for a next-token-prediction task in PyTorch:
> class RNN(Module):
def_init(self):
self.i_h = nn.Embedding(xxx,xxx)
self.h_h = nn.Linear(xxx,xxx)
self.h_o = nn.Linear(xxx,xxx)def forward(self, x):
h = 0
for i in range(xxx):
h = h + self.i_h(x[:,i])
h = F.relu(self.h_h(h))
return self.h_o(h)
You have 37 tokens in your vocab and want to use an embedding size of 64. For predicting the next token, you always use 8 previous tokens as context.
Fill in the gaps marked as xxx in the code above!
> class RNN(Module):
def_init(self):
self.i_h = nn.Embedding(37,64)
self.h_h = nn.Linear(64,64)
self.h_o = nn.Linear(64,37)def forward(self, x):
h = 0
for i in range(8):
h = h + self.i_h(x[:,i])
h = F.relu(self.h_h(h))
return self.h_o(h)