NLP 2 Flashcards
What is RNN?
neural network with a loop is called a Recurrent Neural Network (RNN)
3 layers of RNN
Input (embedding for vocab size to hidden layer), hidden (fully connected), output (to predict the target token)
Why BPTT is needed?
RNNs process sequential data by maintaining a hidden state that is updated at each time step. However, training an RNN is challenging because the network’s loss depends not just on the current input but also on all the previous inputs due to the recurrence relationship. This sequential dependency means that the network’s weights must be updated based on errors accumulated over multiple time steps, not just a single layer.
What is BPTT?
Backpropagation Through Time
Concept of multilayer RNN and why it’s used
outputs of a first RNN are used as the input for a second RNN; While the model is very deep in principle, each predicted token only depends on one linear layer
Why Long Short-Term Memory is needed?
Helps to separately learn (1) information required to predict the next token and (2) contextual information learned throughout the already seen token. (saves gender for the next words)
How LSTM works?
Second hidden state: “short term memory” / “cell state”
Four main networks in LSTM
Forget gate, input gate (gender), cell gate (female), output gate
What is dropout?
During each training iteration,
randomly deactivate neurons with a probability p
– To compensate, all activations are multiplied
by 1/(1 – p) during training
Drawbacks of dropout?
Less input used to calculate the output
Other regularisation techniques
Weight decay, Activation regularisation (AR), Temporal activation regularisation (TAR)
Weight tying?
Mapping from input to hidden and hidden to output have the same weights. Used in AWD-LSTM
In an RNN, the embedding of a token does not depend on its position in the sequence.
True. The token embedding is determined by a lookup table or an embedding matrix, which assigns a fixed vector representation to each token based on the vocabulary.
Which factors influence the number of parameters in RNN for next-token-prediction?
Size of the embedding, Number of tokens in the vocab
Techniques used in AWD-LSTM?
Activation regularization (AR), dropout