Quiz #5 Flashcards
Attention (ML definition)
Weighting or probability distribution over inputs that depends on the computation state and inputs
Fill in the blank:
Attention allows information to [1] directly between [2] computation nodes while making [3] structural assumptions
1) propagate 2) “distant” 3) minimal
What function is the most standard form of attention in current neural networks implemented with?
Softmax
When you input a set of numbers into softmax, what does it return?
A probability distribution over that set of numbers.
Softmax is not permutation equivariant? (T/F)
False. Softmax is permutation equivariant. This means that a permutation of the inputs leads to the same permutation of the outputs.
What is the equation to compute the softmax over a set of number: {x_1, x_2, …, x_n}
Softmax({x_1, x_2, …, x_n}) = {e^{x_1}/Z,…,e^{x_n}/Z}
Where:
Z = Sum_{j=1}^n e^{x_j}
The outputs of softmax always sum to [ ].
1
When you scale your inputs by some positive factor > 1, how do the outputs of softmax change?
Softmax becomes “more sure” of it’s prediction, and puts more weight on the maximum prediction.
Softmax interpolates between two distributions. What are they? Hint: this is why Softmax should really be called “ArgSoftmax”
Softmax interpolated between:
- A distribution that selects an index uniformly at random
- A distribution that selects Argmax index with probability 1
Is softmax differentiable?
Yes!
Fill in the blanks:
Given a set of vectors U={u_1, … , u_n} and a query vector q, the probability of selecting one of the vectors dot(u_i,q) out of a distribution of all vectors dot(U,q) is the [1] of the [2] function.
- output
2. softmax
Fill in the blanks (1. and 2. are both two words):
When Softmax is applied at the final layer of a neural network, the query vector is the [1.] and {u_1, …, u_n} are the embeddings of the [2.].
- hidden state
2. class label
When Softmax is used for attention, the query vector is an [1.] [2.] and {u_1, …, u_n} is the [3.] of an [4.]. The resultant distribution corresponds to a weighted summary of {u_1,…,u_n}
- internal
- hidden state
- embeddings
- “input” (e.g. previous layer, or sentences such as from lecture: u_1 = encoding of Brian is a frog, u_2 = encoding of Lily is gray.. etc)
Define “hard” attention
samples are drawn from the distribution over the inputs
Define “soft” attention
the distribution is used directly as a weighted average.
What is the initial controller state (q) in a multi-layer soft attention model?
An encoding of the query. I.e. in lecture “What color is Greg?”
How do we update the controller state when querying a multi-layer Softmax attention model?
The embedding that has the maximum probability after we perform Softmax on the inner product of the query and input vectors (i.e. max probability rom softmax(q*U)) becomes the new query vector.
How can we represent an “ordering” of the inputs in attention model, where all inputs are consumed by the model at the same time?
For sequential data, or data with a temporal structure, we can use position encoding to indicate the “position” of that sequence in the input.
How do you add position encoding to inputs, M?
For each input, m_j, add a vector to it l(j).
These vectors, l(j) can be fixed during training, or learned.
What are three major improvements introduced with Transformers compared with attention-based architectures?
- Multi-query hidden-state propagation (“self-attention”)
- Multi-head attention
- Residual connections, LayerNorm (these have improved across all NN’s, not just Transfomers)
What is the output of a single Transformer layer?
The weighted sum of the weights and the inputs - i.e. Sum_k a_k u_k
The weights are obtained with:
a = Softmax( { dot(u_0, q), … , dot(u_n,q)})
It is fairly common in transformers to split the input representation into keys and values. When are the keys and values used for when computing softmax attention.
- The keys are used to find the softmax weights
- The values are used to defined the output
For example a={u_0 q, … , u_n q}
output=Sum_k a_k v_k
U = keys V = values
List three standard specializations of Transformers for text data.
- Position encodings depending on the location of the text.
- For language models, “causal attention”
- Training code outputs a prediction at each token simultaneously (and takes a gradient at each token simultaneously)
A self-attention model has a controller for every input. (T/F).
True. This means that the size of the controller state grows with the size of the input.