Transformers Flashcards
Transformers were developed to solve the problem of _____
sequence transduction
what is sequence transduction
any task that transforms an input sequence to an output sequence. This includes speech recognition, text-to-speech transformation, etc..
For models to perform sequence transduction it is necessary to have what?
Some sort of memory
RNNs are what
Feed forward networks with the input spread out over time, and as such deal with sequence data, like stock prices
List an example of an RNN vector sequence model
Labeling images. Image is is represented as a vector and the description as a sequence of text
RNN sequence to vector example
Sentiment analysis. Sequence of text (e.g movie review) output is a vector [.90, .10] indicating how good or bad the movie was
Example of RNN sequence to sequence transduction
Language translation
What is the notable characteristic of sequence input?
It has some defined ordering
Downsides of RNNs?
1) they’re slow to train. 2) they don’t deal with long sequences too well
What happens when RNNs process too long of sequences?
The gradients either vanish or explode
What is a consequence of RNNs being too slow to train?
We end up using a truncated version of backprop, and even then that is too slow
LSTM networks replace neurons with what?
An LSTM cell
LSTMs and RNNs relationship to GPUs
Their data must be processed sequentially. Meaning the output of one neuron must be used in the input for the next neuron. This makes it impossible to take advantage of GPUs being designed for parallel computation
Like RNNs, transformers use what architecture?
An encoder decoder architecture
GloVe
An unsupervised learning algorithm for obtaining vector representations for words
Embedding space
Maps a word (e.g “dog”) to a vector e.g [0.22, 0.73, 0.87, 0.17,…]
Why do transformers need the inputs to have positional encoding added to them?
Because unlike other architectures, the transformer doesn’t use recurrence or convolution and instead takes each data point originally as independent from each other [explain this from two different angles]
But on a more logical level this is necessary because orderingmatters in sequences. “The dog ate the cat” has a didferneemt meaning from “the cat ate the dog”. [and presumably knowing the order of dog and cat matters for the output language, because each different output language may order subject and objects differently]
Each element of data in the transformer combines information about other elements via self attention but each element does this on its own independently of what the other elements do
The AIAYN paper’s choice of positional embedding is best understood if you have knowledge of what?
Fourier Analysis
AIAYN
Attention is all you need
Positional embedding requirements
1) Every position should have the same identifier irrespective of the sequence length or what the input is
2) since the position embedding is added to the original vector for a token, and since the value of each dimension in the original vector is bounded, the position embedding should be bounded in the same domain such that positional similarity doesn’t have a much larger effect on the final value than semantic similarity, aka each value in each dimension of the semantic vector should be between 0 and 1
What is the benefit of using sin and cos to construct your positional embedding function as opposed to say using a sigmoid?
Since sigmoids are asymptomatic, large input values will have very similar output values so sigmoids would not be good for large sequences [ I guess this means the calculated semantic vectors for each latter token in the sequence would end up being very similar]
Why wouldn’t a simple cos or sin work to calculate say the first dimension value of the positional embedding? [where the parameter is the index of the token in the input sequence]
The periodicity of sin and cos would lead to multiple indices receiving the same positional value [which mighhht? Be ok if we end up coming up with a different formula for other dimensions of the vector, but regardless we still need to come up with some formula that works for other dimensions
Why wouldn’t a very stretched out version of sin work as the position function? E.g such that it’s period was say 4x the length of the data such that it resembles a continuous function with the outputted values progressing from 0 to 1
The positional embedding value deltas between each position would be too small and the semantic values would overshadow the positional values
The sin and cos relative positional embedding works best for what type of data?
Text. Apparently doesn’t work that well for images
What is the word vector with positional information passed into?
The encoder block (or more properly the first core layer of the encoder block [depending on whether we include the input embedding layer and positional encoding layer as part of the encoding block and the remaining grouped+repeated layers as the “core section” of the encoding block]
At a high level The encoder block is composed of what?
A multi headed attention layer and a feed forward layer
Attention involves answering what question?
What part of the input should I focus on?
What is a more formal way to describe the attention question?
For each ith word in the sentence, how relevant is the ith word in the sentence to each other word in the sentence?
What does the (multi-head) attention layer of the encoder block output?
Attention vectors. A vector representing how much the ith word in an English sentence is relevant to the other words in that same English sentence
What does the feed forward layer in the encoder block [of a transformer] do at a high level?
They are applied to every attention vector
The convert the attention vector into a form that is digestible by the next encoder or decoder block
What does the decoder take as input and produce as output?
In an English to French translation it would take the French word as input and then would output the NEXT French word in the sentence OR the “end of sentence” token
Desiderata definition
things wanted or needed
When comparing the efficacy of self attention layers vs recurrence or convolution layers, what is the first thing we consider?
The total computational complexity per layer
When comparing the efficacy of self attention layers vs recurrence or convolution layers, what is the second thing we consider?
The amount of computation that can be parallelized
How do we measure the number of computations that can be parallelized in a layer?
We measure the minimum number of sequential operations required
When comparing the efficacy of self attention layers vs recurrence or convolution layers, what is the third thing we consider?
the path length between long-range dependencies in the network
When are the attention vectors for a sentence calculated?
In the attention block
By calculating how relevant each word in a sentence is to the ith word in the sentence, what are we accomplishing?
We’re accomplishing determining contextual relationships between words in the sentence
What is the default positional embedding formula from the attention paper?
What is the default positional embedding formula in the attention. Paper, with each variable being explained?
Why do we talk about the Jacobian matrix of the softmax function rather the gradient?
Because softmax is a vector function, and the Jacobian is the matrix of all first order partial derivatives of a function (that outputs a vector) with respect to another vector
Whereas a gradient is just another word for “the (partial) derivative of a scalar function with respect to some variable
What is an attention vector?
A vector representing how much the ith word in a sentence [or I guess string of words] is relevant to the other words in that same sentence
Positional embedding requirements (S)
1) Every position should have the same identifier irrespective of the sequence length or what the input is
2) each dimension in the position embedding vector should be in [0,1]
Seq2seq models contain what two models?
An encoder and a decoder
What is the encoder’s job?
To take an input sequence and output a context vector/thought vector
https://medium.com/@b.terryjack/deep-learning-the-transformer-9ae5e9c5a190
What is a context vector?
The encoder’s final state
https://medium.com/@b.terryjack/deep-learning-the-transformer-9ae5e9c5a190
How does the decoder use the context vector on a high level?
Well it converts it into an output sequence - e.g the translated sentence, or a reply of the input text etc
if the encoder is a bidirectional RNN, what is the value of the context vector for a seq2seq model?
The concatenation of both direction’s final hidden states
In an encoder each hidden state corresponds to what?
An input word
https://medium.com/@b.terryjack/deep-learning-the-transformer-9ae5e9c5a190
What happens with longer sequences in our naive RNN seq2seq model?
The signals from earlier inputs in the encoder get diluted as they are passed down to later elements in the decoder sequence
One way to solve the long sequence signal dampening problem in a naive RNN sequence to sequence model?
use skip-connections that feed every hidden state of the encoder RNN into every input of the decoder RNN (rather than just the encoder’s final hidden state being fed into the decoder’s initial state
What is a key aspect of seq-to-seq models?
The correspondence is between the input sequence and the output sequence, not between each individual input word and each individual output word. E.g one output element could correspond two a combination of two input elements
List some more creative uses of sequence to sequence models than language translation and speech recognition
Q and A - e.g input sequence is a question and the output sequence is the answer to that question; text summarization - where input is a text document and the output is a short summary of the text’s contents
Summarize the difference between attention scores, attention weights, and attention vectors
• Attention Score: Measures compatibility between a query and every key.
• Attention Weight: Normalized version of the attention scores. Tells us how much each word should contribute to the final representation.
• Attention Vector: The weighted sum of all the value vectors, using the attention weights. It is the aggregated representation of the input sequence in the context of the given query.
When /why are we calculating the Jacobian of the softmax function?
Well since for some of our layers we convert a layer of logits into a layer of probabilities with softmax , when doing backprop we’ll need the partial derivative of each softmax element with respect to logit, and the Jacobian provides the partial derivative of each such softmax element with respect to logit element
a
b