Quiz 4 Flashcards
RNN forward update rule
- At each time step, input and hidden are fed through a linear layer.
- Followed by non-linear (tanh)
- Multiplied by output linear layer to get logits
- Softmax to get probabilities
- Get CE loss
Formally:
U, V, W = weights for input-to-hidden, hidden-to-output, hidden-to-hidden
a_t = Ux_t + Wh_{t-1} + b
h_t = tanh(a_t)
o_t = Vh_t + c
y_hat_t = softmax(o_t)
Lt = CE(y_hat_t, y_t)
RNN: How is loss calculated over entire sequence
Loss at each time step is summed
RNN: True or False. hidden weights are shared across the sequence.
True
RNN: Advantage of sharing parameters across sequence
Sharing allows generalization to sequence lengths that did not appear in the trainings set.
RNN: RNN architecture must always pass the hidden state to next sequence
False. Goodfellow book show examples where the output from t-1 is passed to the hidden layer at t.
RNN: RNN architecture that passes only the output to next time step is likely to be less powerful.
True. If hidden state is not passed, it lacks important information from the past.
Vanishing gradient
Gradient diminishes as they are backpropagated through time, leading to no learning.
Exploding gradient
Gradient grow as they are backpropagated through time, leading to unstable learning.
RNN cons
- Vanishing gradeints
- Exploding gradients
- Limited “memory” - Can handle short-term dependencies but can’t handle long term
- Lack control over memory - unlike LSTM, does not have mechanism to control what information should be kept.
RNN: Advantage over LSTM
Smaller parameter size
LSTM update rule - components and flow
Sequence (FICO):
1. Forget gate
2. Input gate
3. Cell Gate
4. Output gate
LSTM: What state is passed throughout the layer
Cell state
LSTM: What does forget gate do
Take input and hidden state, decides what information from the cell state should be thrown away or kept.
LSTM: What does input gate do
Updates the cell state with new information
LSTM: What does cell gate do
Combines information from forget gate, input gate, and candidate cell state to output new cell state at time step t
LSTM: What does output gate do
Decides the next hidden state based on the updated cell state.
LSTM: Pros over RNN
Controls the flow of the gradient so that they neither vanish or explode
RNN: What is a recursive neural network (RecNNs, not RNN!), and what are its advantages
- Can handle hierarchical structure
- Reduce vanishing gradients by having nested, shorter RNNs.
GRU: Main difference to LSTM
Single gate controls forget and cell state update.
Gradient clipping: main use
Control exploding gradients
Why does MLP not work for modeling sequences
- Can’t support variable-sized inputs or outputs
- No inherent temporal structure
- Can not maintain “state” of sequence
LM: Define Language Models
Estimate probabilities of sequences that allow us to perform comparison.
LM: How are probabilities of an input sequence of words calculated?
Chain rule of probability.
Probability of sentence = Product of conditional probabilities over i, which indexes our words.
p(s) = p(w_1) p(w2 | w1) p(w3 | w1, w2) … p(w_n | w_n-1 … W1)
LM: 3 applications of language modeling
- predictive typing
- ASR
- grammar correction
LM: How is “conditional” language modeling different
Adds an extra context “c” to the chain rule of probability:
p(s | c) = product of p(w_i | c, w_i-1 … w_1)
Conditional LM: What is context and sequence for
Topic-aware language model
C = topic
S = text
Conditional LM: What is context and sequence for
Text summarization
C = long document
S = summary
Conditional LM: What is context and sequence for
Machine Translation
C = French text
S = English text
Conditional LM: What is context and sequence for
Image captioning
C = image
S = caption
Conditional LM: What is context and sequence for
OCR
C = image of a line
S = its content
Conditional LM: What is context and sequence for
Speech Recognition
C = recording
S = its transcription
Teacher forcing
During training, the model is also fed with the true target sequence, not its own generated output.
Knowledge distillation
Smaller model (student) learns to mimic the predictions of a larger, more complex model (teacher) by transferring its knowledge
T or F: Hard labels are passed from teacher model to student
False. Soft labels gives more signals than hard.