Transformers Flashcards

1
Q

Transformers were developed to solve the problem of _____

A

sequence transduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is sequence transduction

A

any task that transforms an input sequence to an output sequence. This includes speech recognition, text-to-speech transformation, etc..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

For models to perform sequence transduction it is necessary to have what?

A

Some sort of memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

RNNs are what

A

Feed forward networks with the input spread out over time, and as such deal with sequence data, like stock prices

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

List an example of an RNN vector sequence model

A

Labeling images. Image is is represented as a vector and the description as a sequence of text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

RNN sequence to vector example

A

Sentiment analysis. Sequence of text (e.g movie review) output is a vector [.90, .10] indicating how good or bad the movie was

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Example of RNN sequence to sequence transduction

A

Language translation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the notable characteristic of sequence input?

A

It has some defined ordering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Downsides of RNNs?

A

1) they’re slow to train. 2) they don’t deal with long sequences too well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What happens when RNNs process too long of sequences?

A

The gradients either vanish or explode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a consequence of RNNs being too slow to train?

A

We end up using a truncated version of backprop, and even then that is too slow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

LSTM networks replace neurons with what?

A

An LSTM cell

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

LSTMs and RNNs relationship to GPUs

A

Their data must be processed sequentially. Meaning the output of one neuron must be used in the input for the next neuron. This makes it impossible to take advantage of GPUs being designed for parallel computation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Like RNNs, transformers use what architecture?

A

An encoder decoder architecture

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

GloVe

A

An unsupervised learning algorithm for obtaining vector representations for words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Embedding space

A

Maps a word (e.g “dog”) to a vector e.g [0.22, 0.73, 0.87, 0.17,…]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why do transformers need the inputs to have positional encoding added to them?

A

Because unlike other architectures, the transformer doesn’t use recurrence or convolution and instead takes each data point originally as independent from each other [explain this from two different angles]

But on a more logical level this is necessary because orderingmatters in sequences. “The dog ate the cat” has a didferneemt meaning from “the cat ate the dog”. [and presumably knowing the order of dog and cat matters for the output language, because each different output language may order subject and objects differently]

Each element of data in the transformer combines information about other elements via self attention but each element does this on its own independently of what the other elements do

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

The AIAYN paper’s choice of positional embedding is best understood if you have knowledge of what?

A

Fourier Analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

AIAYN

A

Attention is all you need

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Positional embedding requirements

A

1) Every position should have the same identifier irrespective of the sequence length or what the input is
2) since the position embedding is added to the original vector for a token, and since the value of each dimension in the original vector is bounded, the position embedding should be bounded in the same domain such that positional similarity doesn’t have a much larger effect on the final value than semantic similarity, aka each value in each dimension of the semantic vector should be between 0 and 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the benefit of using sin and cos to construct your positional embedding function as opposed to say using a sigmoid?

A

Since sigmoids are asymptomatic, large input values will have very similar output values so sigmoids would not be good for large sequences [ I guess this means the calculated semantic vectors for each latter token in the sequence would end up being very similar]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Why wouldn’t a simple cos or sin work to calculate say the first dimension value of the positional embedding? [where the parameter is the index of the token in the input sequence]

A

The periodicity of sin and cos would lead to multiple indices receiving the same positional value [which mighhht? Be ok if we end up coming up with a different formula for other dimensions of the vector, but regardless we still need to come up with some formula that works for other dimensions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Why wouldn’t a very stretched out version of sin work as the position function? E.g such that it’s period was say 4x the length of the data such that it resembles a continuous function with the outputted values progressing from 0 to 1

A

The positional embedding value deltas between each position would be too small and the semantic values would overshadow the positional values

24
Q

The sin and cos relative positional embedding works best for what type of data?

A

Text. Apparently doesn’t work that well for images

25
Q

What is the word vector with positional information passed into?

A

The encoder block (or more properly the first core layer of the encoder block [depending on whether we include the input embedding layer and positional encoding layer as part of the encoding block and the remaining grouped+repeated layers as the “core section” of the encoding block]

26
Q

At a high level The encoder block is composed of what?

A

A multi headed attention layer and a feed forward layer

27
Q

Attention involves answering what question?

A

What part of the input should I focus on?

28
Q

What is a more formal way to describe the attention question?

A

For each ith word in the sentence, how relevant is the ith word in the sentence to each other word in the sentence?

29
Q

What does the (multi-head) attention layer of the encoder block output?

A

Attention vectors. A vector representing how much the ith word in an English sentence is relevant to the other words in that same English sentence

30
Q

What does the feed forward layer in the encoder block [of a transformer] do at a high level?

A

They are applied to every attention vector

The convert the attention vector into a form that is digestible by the next encoder or decoder block

31
Q

What does the decoder take as input and produce as output?

A

In an English to French translation it would take the French word as input and then would output the NEXT French word in the sentence OR the “end of sentence” token

32
Q

Desiderata definition

A

things wanted or needed

33
Q

When comparing the efficacy of self attention layers vs recurrence or convolution layers, what is the first thing we consider?

A

The total computational complexity per layer

34
Q

When comparing the efficacy of self attention layers vs recurrence or convolution layers, what is the second thing we consider?

A

The amount of computation that can be parallelized

35
Q

How do we measure the number of computations that can be parallelized in a layer?

A

We measure the minimum number of sequential operations required

36
Q

When comparing the efficacy of self attention layers vs recurrence or convolution layers, what is the third thing we consider?

A

the path length between long-range dependencies in the network

37
Q

When are the attention vectors for a sentence calculated?

A

In the attention block

38
Q

By calculating how relevant each word in a sentence is to the ith word in the sentence, what are we accomplishing?

A

We’re accomplishing determining contextual relationships between words in the sentence

39
Q

What is the default positional embedding formula from the attention paper?

A
40
Q

What is the default positional embedding formula in the attention. Paper, with each variable being explained?

A
41
Q

Why do we talk about the Jacobian matrix of the softmax function rather the gradient?

A

Because softmax is a vector function, and the Jacobian is the matrix of all first order partial derivatives of a function (that outputs a vector) with respect to another vector

Whereas a gradient is just another word for “the (partial) derivative of a scalar function with respect to some variable

42
Q

What is an attention vector?

A

A vector representing how much the ith word in a sentence [or I guess string of words] is relevant to the other words in that same sentence

43
Q

Positional embedding requirements (S)

A

1) Every position should have the same identifier irrespective of the sequence length or what the input is
2) each dimension in the position embedding vector should be in [0,1]

44
Q

Seq2seq models contain what two models?

A

An encoder and a decoder

45
Q

What is the encoder’s job?

A

To take an input sequence and output a context vector/thought vector

https://medium.com/@b.terryjack/deep-learning-the-transformer-9ae5e9c5a190

46
Q

What is a context vector?

A

The encoder’s final state

https://medium.com/@b.terryjack/deep-learning-the-transformer-9ae5e9c5a190

47
Q

How does the decoder use the context vector on a high level?

A

Well it converts it into an output sequence - e.g the translated sentence, or a reply of the input text etc

48
Q

if the encoder is a bidirectional RNN, what is the value of the context vector for a seq2seq model?

A

The concatenation of both direction’s final hidden states

49
Q

In an encoder each hidden state corresponds to what?

A

An input word

https://medium.com/@b.terryjack/deep-learning-the-transformer-9ae5e9c5a190

50
Q

What happens with longer sequences in our naive RNN seq2seq model?

A

The signals from earlier inputs in the encoder get diluted as they are passed down to later elements in the decoder sequence

51
Q

One way to solve the long sequence signal dampening problem in a naive RNN sequence to sequence model?

A

use skip-connections that feed every hidden state of the encoder RNN into every input of the decoder RNN (rather than just the encoder’s final hidden state being fed into the decoder’s initial state

52
Q

What is a key aspect of seq-to-seq models?

A

The correspondence is between the input sequence and the output sequence, not between each individual input word and each individual output word. E.g one output element could correspond two a combination of two input elements

53
Q

List some more creative uses of sequence to sequence models than language translation and speech recognition

A

Q and A - e.g input sequence is a question and the output sequence is the answer to that question; text summarization - where input is a text document and the output is a short summary of the text’s contents

54
Q

Summarize the difference between attention scores, attention weights, and attention vectors

A

• Attention Score: Measures compatibility between a query and every key.
• Attention Weight: Normalized version of the attention scores. Tells us how much each word should contribute to the final representation.
• Attention Vector: The weighted sum of all the value vectors, using the attention weights. It is the aggregated representation of the input sequence in the context of the given query.

55
Q

When /why are we calculating the Jacobian of the softmax function?

A

Well since for some of our layers we convert a layer of logits into a layer of probabilities with softmax , when doing backprop we’ll need the partial derivative of each softmax element with respect to logit, and the Jacobian provides the partial derivative of each such softmax element with respect to logit element