Quiz 4 Flashcards
RNN forward update rule
- At each time step, input and hidden are fed through a linear layer.
- Followed by non-linear (tanh)
- Multiplied by output linear layer to get logits
- Softmax to get probabilities
- Get CE loss
Formally:
U, V, W = weights for input-to-hidden, hidden-to-output, hidden-to-hidden
a_t = Ux_t + Wh_{t-1} + b
h_t = tanh(a_t)
o_t = Vh_t + c
y_hat_t = softmax(o_t)
Lt = CE(y_hat_t, y_t)
RNN: How is loss calculated over entire sequence
Loss at each time step is summed
RNN: True or False. hidden weights are shared across the sequence.
True
RNN: Advantage of sharing parameters across sequence
Sharing allows generalization to sequence lengths that did not appear in the trainings set.
RNN: RNN architecture must always pass the hidden state to next sequence
False. Goodfellow book show examples where the output from t-1 is passed to the hidden layer at t.
RNN: RNN architecture that passes only the output to next time step is likely to be less powerful.
True. If hidden state is not passed, it lacks important information from the past.
Vanishing gradient
Gradient diminishes as they are backpropagated through time, leading to no learning.
Exploding gradient
Gradient grow as they are backpropagated through time, leading to unstable learning.
RNN cons
- Vanishing gradeints
- Exploding gradients
- Limited “memory” - Can handle short-term dependencies but can’t handle long term
- Lack control over memory - unlike LSTM, does not have mechanism to control what information should be kept.
RNN: Advantage over LSTM
Smaller parameter size
LSTM update rule - components and flow
Sequence (FICO):
1. Forget gate
2. Input gate
3. Cell Gate
4. Output gate
LSTM: What state is passed throughout the layer
Cell state
LSTM: What does forget gate do
Take input and hidden state, decides what information from the cell state should be thrown away or kept.
LSTM: What does input gate do
Updates the cell state with new information
LSTM: What does cell gate do
Combines information from forget gate, input gate, and candidate cell state to output new cell state at time step t
LSTM: What does output gate do
Decides the next hidden state based on the updated cell state.
LSTM: Pros over RNN
Controls the flow of the gradient so that they neither vanish or explode
RNN: What is a recursive neural network (RecNNs, not RNN!), and what are its advantages
- Can handle hierarchical structure
- Reduce vanishing gradients by having nested, shorter RNNs.
GRU: Main difference to LSTM
Single gate controls forget and cell state update.
Gradient clipping: main use
Control exploding gradients
Why does MLP not work for modeling sequences
- Can’t support variable-sized inputs or outputs
- No inherent temporal structure
- Can not maintain “state” of sequence
LM: Define Language Models
Estimate probabilities of sequences that allow us to perform comparison.
LM: How are probabilities of an input sequence of words calculated?
Chain rule of probability.
Probability of sentence = Product of conditional probabilities over i, which indexes our words.
p(s) = p(w_1) p(w2 | w1) p(w3 | w1, w2) … p(w_n | w_n-1 … W1)
LM: 3 applications of language modeling
- predictive typing
- ASR
- grammar correction
LM: How is “conditional” language modeling different
Adds an extra context “c” to the chain rule of probability:
p(s | c) = product of p(w_i | c, w_i-1 … w_1)
Conditional LM: What is context and sequence for
Topic-aware language model
C = topic
S = text
Conditional LM: What is context and sequence for
Text summarization
C = long document
S = summary
Conditional LM: What is context and sequence for
Machine Translation
C = French text
S = English text
Conditional LM: What is context and sequence for
Image captioning
C = image
S = caption
Conditional LM: What is context and sequence for
OCR
C = image of a line
S = its content
Conditional LM: What is context and sequence for
Speech Recognition
C = recording
S = its transcription
Teacher forcing
During training, the model is also fed with the true target sequence, not its own generated output.
Knowledge distillation
Smaller model (student) learns to mimic the predictions of a larger, more complex model (teacher) by transferring its knowledge
T or F: Hard labels are passed from teacher model to student
False. Soft labels gives more signals than hard.
Knowledge distillation loss components: Teacher-student loss
CE loss between prediction scores between teacher / student
can be hard or soft labels
Knowledge distillation loss components: Student loss
CE loss between student prediction and ground truth
Knowledge distillation loss components: Combined loss
teacher-student loss * weights + student loss * weights
Define cross-entropy loss
Expected number of bits required to represent an event from reference distribution (P*) when using a coding schema optimal for P.
Define per-word cross-entropy
Cross-entropy average over all the words in a sequence
What is the reference distribution (P*) in per-word cross entropy?
Empirical distribution of the words in the sequence
Define perplexity
Geometric mean of the inverse probability of a sequence of words.
What is perplexity of choosing 1 for a 10 sided dice?
10
(discrete uniform distribution of k is k)
Define perplexity using law of logarithms
Perplexity is the log of per-word cross-entropy
Define pretraining task
A task not specifically for the final task, but can help us achieve getting better initial parameters for modeling the final task.
Masked language models: Key idea
Model learns to predict masked tokens of a sequence.
Embeddings: Distributional semantics
A word’s meaning is given by the words that frequently appear close-by
Embeddings: Key idea of “A Neural Probabilistic Language Model” Bengio, 2003
Map feature vector to each word in the vocabulary to predict next word.
Embeddings: Key idea of “A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning” Collobert & Weston, 2008 & “Natural Language Processing (Almost from Scratch)” Collobert et al., 2011
Use CNN to create word embeddings
KNN of vectors show similar syntax + semantics
“Efficient Estimation of Word Representations in Vector Space” Mikolov et al., 2013
Word2Vec, continuous bag of words (CBOW), skip-gram
Collobert & Weston vectors. Key idea:
A word and its context is a positive training sample; a random word in that sample context gives a negative training sample.
positive: “cat chills on a mat”
negative: “cat chills Ohio a mat”
Skip gram objective function
(average) negative log-likelihood
Skip gram parameters
word vectors
Skip gram probability P(w_{t+J} | w_t); theta - how is it defined?
inner product of center word and context word. calculates how likely center word appears with context word.
Skip gram probability - what makes computation expensive
size of vocabulary
Skip gram: Ways to reduce computation
- Hierarchichal softmax
- Negative sampling
GloVe: Key idea
Training of embedding done on global word-occurrence statistics from a corpus.
fastText: Key idea
Handles OOV words + multilingual
Intrinsic evaluation of word embeddings
Evaluated on a sub-task
Extrinsic evaluation of word embeddings
Evaluated on a downstream task
Intrinsic evaluation of word embeddings - example
word analogy task (man:woman, king:?)
Graph embedding definition
Learn features such that connected nodes are more similar than unconnected nodes
Hyperbolic embeddings - what is it good for?
Better at modeling hierarchichal structures
t-SNE key idea
Measures pairwise similarities in high dimensional space and performs SGD to minimize divergence between high-dimensional vs low-dimensional data
Encoders can handle language modeling, T/F
False. Encoders get bidirectional context so we can’t do language modeling
Self-attention: Sizes of Query, Key, and Value matrices, given:
1. Hidden dimension: d_model
2. Q,K,V dimension: d_q, d_k, d_v
Query: d_model * d_k
Key: d_model * d_k
Value: d_model * d_v
Self-attention: Dimension of self-attention output (O), given:
1. Hidden dimension: d_model
2. Q,K,V dimension: d_q, d_k, d_v
3. Number of heads h
in_features: h * d_v
out_features: d_model
Define “cross-attention”
Combine two different input sequence:
1. Sequence returned by the encoder
2. Sequence processed by the decoder.
Attention: big O for each layer, given:
1. seq length “n”
2. representation dimension “d
O (n^2 * d)
Attention: big O for sequential operations
O(1)
Attention: big O for maximum path length
O(1)
Why is self-attention layer’s sequential operation O(1) compared to RNN’s O(n)?
self-attention layer connects all positions together
Self-attention layers are faster than recurrent layers when the _____ is smaller than the _________
sequence length,
representation dimensionality
Multi-head attention:
How to get size of d_k and d_v given:
1. dimension of hidden: d_model
2. number of heads (h)
d_model / h
Multi-head attention:
Given d_model (512) and number of heads (8), what is d_k?
64 (d_model / h)
Attention:
What is the formula for scaled dot product?
Attention(Q,K,V) = softmax(Q @ K^T / sqrt(d_q)) @ V
Scaled dot product attention:
As Query vector increases in dimension, magnitudes of dot product similarities increase. How do we mitigate this?
Normalize by dividing dot product by square root of Query vector dimension
Cross-attention: Where do K, V values come from
Encoder output
Cross-attention: Where do Q value come from
Masked multi-head attention from decoder
Self Attention:
Q, K, V matrix shapes
(ie A * B, what are A and B)
K = D_X * D_K
Q = D_X * D_Q
V = D_X * D_V
Purpose of Key vectors
Compare inputs to Queries
Purpose of Value vectors
Return knowledge from K and Q back to decoder
3 types of attention in Transformer
- Cross-attention - Encoder K and V plus Q from decoder
- Self-attention (Encoder) - K, V, Q from output of word + pos embedding
- Self-attention (Decoder) - K, V, Q from (masked) output embedding t-1
Difference of Encoder-Decoder Attention vs self-attention
In E/D attention, Queries come externally(e.g. decoder). In self-attention, Queries come from the inputs themselves.
What happens if you permute the order of the inputs in an self-attention layer?
Permutation equivariant: Output values will have same values as ordered, but in different order.
Since self-attention is permutation equivariant, what must be added beforehand to propagate the order of the input sequence?
Add position embedding to word embedding.