Quiz #5 Flashcards
Attention (ML definition)
Weighting or probability distribution over inputs that depends on the computation state and inputs
Fill in the blank:
Attention allows information to [1] directly between [2] computation nodes while making [3] structural assumptions
1) propagate 2) “distant” 3) minimal
What function is the most standard form of attention in current neural networks implemented with?
Softmax
When you input a set of numbers into softmax, what does it return?
A probability distribution over that set of numbers.
Softmax is not permutation equivariant? (T/F)
False. Softmax is permutation equivariant. This means that a permutation of the inputs leads to the same permutation of the outputs.
What is the equation to compute the softmax over a set of number: {x_1, x_2, …, x_n}
Softmax({x_1, x_2, …, x_n}) = {e^{x_1}/Z,…,e^{x_n}/Z}
Where:
Z = Sum_{j=1}^n e^{x_j}
The outputs of softmax always sum to [ ].
1
When you scale your inputs by some positive factor > 1, how do the outputs of softmax change?
Softmax becomes “more sure” of it’s prediction, and puts more weight on the maximum prediction.
Softmax interpolates between two distributions. What are they? Hint: this is why Softmax should really be called “ArgSoftmax”
Softmax interpolated between:
- A distribution that selects an index uniformly at random
- A distribution that selects Argmax index with probability 1
Is softmax differentiable?
Yes!
Fill in the blanks:
Given a set of vectors U={u_1, … , u_n} and a query vector q, the probability of selecting one of the vectors dot(u_i,q) out of a distribution of all vectors dot(U,q) is the [1] of the [2] function.
- output
2. softmax
Fill in the blanks (1. and 2. are both two words):
When Softmax is applied at the final layer of a neural network, the query vector is the [1.] and {u_1, …, u_n} are the embeddings of the [2.].
- hidden state
2. class label
When Softmax is used for attention, the query vector is an [1.] [2.] and {u_1, …, u_n} is the [3.] of an [4.]. The resultant distribution corresponds to a weighted summary of {u_1,…,u_n}
- internal
- hidden state
- embeddings
- “input” (e.g. previous layer, or sentences such as from lecture: u_1 = encoding of Brian is a frog, u_2 = encoding of Lily is gray.. etc)
Define “hard” attention
samples are drawn from the distribution over the inputs
Define “soft” attention
the distribution is used directly as a weighted average.