Quiz #5 Flashcards

Question 1

Q

Attention (ML definition)

Answer

A

Weighting or probability distribution over inputs that depends on the computation state and inputs

Question 2

Q

Fill in the blank:

Attention allows information to [1] directly between [2] computation nodes while making [3] structural assumptions

Answer

A

1) propagate 2) “distant” 3) minimal

Question 3

Q

What function is the most standard form of attention in current neural networks implemented with?

Question 4

Q

When you input a set of numbers into softmax, what does it return?

Answer

A

A probability distribution over that set of numbers.

Question 5

Q

Softmax is not permutation equivariant? (T/F)

Answer

A

False. Softmax is permutation equivariant. This means that a permutation of the inputs leads to the same permutation of the outputs.

Question 6

Q

What is the equation to compute the softmax over a set of number: {x_1, x_2, …, x_n}

Answer

A

Softmax({x_1, x_2, …, x_n}) = {e^{x_1}/Z,…,e^{x_n}/Z}
Where:
Z = Sum_{j=1}^n e^{x_j}

Question 7

Q

The outputs of softmax always sum to [ ].

Question 8

Q

When you scale your inputs by some positive factor > 1, how do the outputs of softmax change?

Answer

A

Softmax becomes “more sure” of it’s prediction, and puts more weight on the maximum prediction.

Question 9

Q

Softmax interpolates between two distributions. What are they? Hint: this is why Softmax should really be called “ArgSoftmax”

Answer

A

Softmax interpolated between:

A distribution that selects an index uniformly at random
A distribution that selects Argmax index with probability 1

Question 10

Q

Is softmax differentiable?

Question 11

Q

Fill in the blanks:
Given a set of vectors U={u_1, … , u_n} and a query vector q, the probability of selecting one of the vectors dot(u_i,q) out of a distribution of all vectors dot(U,q) is the [1] of the [2] function.

Answer

A

output

2. softmax

Question 12

Q

Fill in the blanks (1. and 2. are both two words):
When Softmax is applied at the final layer of a neural network, the query vector is the [1.] and {u_1, …, u_n} are the embeddings of the [2.].

Answer

A

hidden state

2. class label

Question 13

Q

When Softmax is used for attention, the query vector is an [1.] [2.] and {u_1, …, u_n} is the [3.] of an [4.]. The resultant distribution corresponds to a weighted summary of {u_1,…,u_n}

Answer

A

internal
hidden state
embeddings
“input” (e.g. previous layer, or sentences such as from lecture: u_1 = encoding of Brian is a frog, u_2 = encoding of Lily is gray.. etc)

Question 14

Q

Define “hard” attention

Answer

A

samples are drawn from the distribution over the inputs

Question 15

Q

Define “soft” attention

Answer

A

the distribution is used directly as a weighted average.

Question 16

Q

What is the initial controller state (q) in a multi-layer soft attention model?

Answer

Study These Flashcards

A

An encoding of the query. I.e. in lecture “What color is Greg?”

Question 17

Q

How do we update the controller state when querying a multi-layer Softmax attention model?

Answer

Study These Flashcards

A

The embedding that has the maximum probability after we perform Softmax on the inner product of the query and input vectors (i.e. max probability rom softmax(q*U)) becomes the new query vector.

Question 18

Q

How can we represent an “ordering” of the inputs in attention model, where all inputs are consumed by the model at the same time?

Answer

Study These Flashcards

A

For sequential data, or data with a temporal structure, we can use position encoding to indicate the “position” of that sequence in the input.

Question 19

Q

How do you add position encoding to inputs, M?

Answer

Study These Flashcards

A

For each input, m_j, add a vector to it l(j).

These vectors, l(j) can be fixed during training, or learned.

Question 20

Q

What are three major improvements introduced with Transformers compared with attention-based architectures?

Answer

Study These Flashcards

A

Multi-query hidden-state propagation (“self-attention”)
Multi-head attention
Residual connections, LayerNorm (these have improved across all NN’s, not just Transfomers)

Question 21

Q

What is the output of a single Transformer layer?

Answer

Study These Flashcards

A

The weighted sum of the weights and the inputs - i.e. Sum_k a_k u_k

The weights are obtained with:

a = Softmax( { dot(u_0, q), … , dot(u_n,q)})

Question 22

Q

It is fairly common in transformers to split the input representation into keys and values. When are the keys and values used for when computing softmax attention.

Answer

Study These Flashcards

A

The keys are used to find the softmax weights
The values are used to defined the output

For example a={u_0 q, … , u_n q}
output=Sum_k a_k v_k

U = keys
V = values

Question 23

Q

List three standard specializations of Transformers for text data.

Answer

Study These Flashcards

A

Position encodings depending on the location of the text.
For language models, “causal attention”
Training code outputs a prediction at each token simultaneously (and takes a gradient at each token simultaneously)

Question 24

Q

A self-attention model has a controller for every input. (T/F).

Answer

Study These Flashcards

A

True. This means that the size of the controller state grows with the size of the input.

Quiz #5 Flashcards

(24 cards)