Quiz #5 Flashcards
What is ‘attention’?
Weighting or probability distribution over inputs that depends on computational state and inputs.
What does attention allow us to do?
It allows information to propagate between “distant” computational nodes while making minimal structural assumptions.
What is one of the most popular tools used for building attention mechanisms?
Softmax
Softmax is differentiable? (True/False)
True
What are the inputs to softmax attention?
A set of vectors {u1, u2, …u_} and a query vector ‘q’. What we want to do is select the most similar vector to q via p = softmax(U.q)
What is the difference between softmax applied to the final layer of an MLP and softmax attention?
- When Softmax is applied at the final layer of an MLP:
- q is the last hidden state, {u1, u2, … u_n} is the embedding of the class labels
- Samples from the distribution correspond to labelings (outputs)
- In Softmax attention:
- q is an internal hidden state, {u1, u2, … u_n} is the embeddings of an “input” (e.g. the previous layer)
- Samples from the distribution correspond to a summary of {u1, u2, … u_n}
What was the biological inspiration for attention mechanisms?
Saccades (basically rapid, discontinous eye movements between salient objects in the visual field).
At a conceptual level, what is it that are visual attention mechanisms trying to do?
Given the current state/history of glimpses, where and what scale should we look at next?
The representational power of a softmax attention layer (or more generally, any attention layer) decreases as the input size grows? (True/False)
False. It increases. This is because the size of the hidden state grows with the size of the input.
How is the similarity of query vector ‘q’ typically computed compared to the set of hidden state vectors {u1, u2, … u_n}?
Cosine similarity (i.e. inner product of q with each of the u vectors}
What is the ‘controller state’ in a Softmax Attention layer?
Initially, it’s just the input query vector q0. We use q0 to compute the next controller state (i.e. hidden state) q1, then use q1 to compute q2, …so on.
What are transformer models?
Models that are made up of multiple attention layers.
What are three important architectural distinctions that result in the superior performance of transformer models compared to previous attention based architectures?
- Multi-query hidden state propagation (“self-attention”)
- Multi-head attention
- Residual connections, LayerNorm
What is self-attention?
It uses a controller state (i.e. query/hidden state) for every single input). So the size of the controller state grows with the size of the input, giving it even more representational power than traditional attention networks (remember that the representational power of an attention network grows proportionally to its input size).
What is multi-head attention?
Multi-head attention combines multiple attention ‘heads’ being trained in the same way on the same data - but with different weight matrices, thus yielding different values.
Each of the ‘L’ attention heads yields values for each token - these values are then multiplied by trained parameters and added.
(To me this seems kind of similar to the idea in convnets of using multiple “filters” for each convolutional layer so we can learn different feature representations. )