Quiz #5 Flashcards

1
Q

Attention (ML definition)

A

Weighting or probability distribution over inputs that depends on the computation state and inputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Fill in the blank:

Attention allows information to [1] directly between [2] computation nodes while making [3] structural assumptions

A

1) propagate 2) “distant” 3) minimal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What function is the most standard form of attention in current neural networks implemented with?

A

Softmax

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When you input a set of numbers into softmax, what does it return?

A

A probability distribution over that set of numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Softmax is not permutation equivariant? (T/F)

A

False. Softmax is permutation equivariant. This means that a permutation of the inputs leads to the same permutation of the outputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the equation to compute the softmax over a set of number: {x_1, x_2, …, x_n}

A

Softmax({x_1, x_2, …, x_n}) = {e^{x_1}/Z,…,e^{x_n}/Z}
Where:
Z = Sum_{j=1}^n e^{x_j}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The outputs of softmax always sum to [ ].

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When you scale your inputs by some positive factor > 1, how do the outputs of softmax change?

A

Softmax becomes “more sure” of it’s prediction, and puts more weight on the maximum prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Softmax interpolates between two distributions. What are they? Hint: this is why Softmax should really be called “ArgSoftmax”

A

Softmax interpolated between:

  • A distribution that selects an index uniformly at random
  • A distribution that selects Argmax index with probability 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Is softmax differentiable?

A

Yes!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Fill in the blanks:
Given a set of vectors U={u_1, … , u_n} and a query vector q, the probability of selecting one of the vectors dot(u_i,q) out of a distribution of all vectors dot(U,q) is the [1] of the [2] function.

A
  1. output

2. softmax

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Fill in the blanks (1. and 2. are both two words):
When Softmax is applied at the final layer of a neural network, the query vector is the [1.] and {u_1, …, u_n} are the embeddings of the [2.].

A
  1. hidden state

2. class label

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When Softmax is used for attention, the query vector is an [1.] [2.] and {u_1, …, u_n} is the [3.] of an [4.]. The resultant distribution corresponds to a weighted summary of {u_1,…,u_n}

A
  1. internal
  2. hidden state
  3. embeddings
  4. “input” (e.g. previous layer, or sentences such as from lecture: u_1 = encoding of Brian is a frog, u_2 = encoding of Lily is gray.. etc)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Define “hard” attention

A

samples are drawn from the distribution over the inputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Define “soft” attention

A

the distribution is used directly as a weighted average.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the initial controller state (q) in a multi-layer soft attention model?

A

An encoding of the query. I.e. in lecture “What color is Greg?”

17
Q

How do we update the controller state when querying a multi-layer Softmax attention model?

A

The embedding that has the maximum probability after we perform Softmax on the inner product of the query and input vectors (i.e. max probability rom softmax(q*U)) becomes the new query vector.

18
Q

How can we represent an “ordering” of the inputs in attention model, where all inputs are consumed by the model at the same time?

A

For sequential data, or data with a temporal structure, we can use position encoding to indicate the “position” of that sequence in the input.

19
Q

How do you add position encoding to inputs, M?

A

For each input, m_j, add a vector to it l(j).

These vectors, l(j) can be fixed during training, or learned.

20
Q

What are three major improvements introduced with Transformers compared with attention-based architectures?

A
  • Multi-query hidden-state propagation (“self-attention”)
  • Multi-head attention
  • Residual connections, LayerNorm (these have improved across all NN’s, not just Transfomers)
21
Q

What is the output of a single Transformer layer?

A

The weighted sum of the weights and the inputs - i.e. Sum_k a_k u_k

The weights are obtained with:

a = Softmax( { dot(u_0, q), … , dot(u_n,q)})

22
Q

It is fairly common in transformers to split the input representation into keys and values. When are the keys and values used for when computing softmax attention.

A
  • The keys are used to find the softmax weights
  • The values are used to defined the output

For example a={u_0 q, … , u_n q}
output=Sum_k a_k v_k

U = keys
V = values
23
Q

List three standard specializations of Transformers for text data.

A
  • Position encodings depending on the location of the text.
  • For language models, “causal attention”
  • Training code outputs a prediction at each token simultaneously (and takes a gradient at each token simultaneously)
24
Q

A self-attention model has a controller for every input. (T/F).

A

True. This means that the size of the controller state grows with the size of the input.