Transformers Flashcards
Transformers were developed to solve the problem of _____
sequence transduction
what is sequence transduction
any task that transforms an input sequence to an output sequence. This includes speech recognition, text-to-speech transformation, etc..
For models to perform sequence transduction it is necessary to have what?
Some sort of memory
RNNs are what
Feed forward networks with the input spread out over time, and as such deal with sequence data, like stock prices
List an example of an RNN vector sequence model
Labeling images. Image is is represented as a vector and the description as a sequence of text
RNN sequence to vector example
Sentiment analysis. Sequence of text (e.g movie review) output is a vector [.90, .10] indicating how good or bad the movie was
Example of RNN sequence to sequence transduction
Language translation
What is the notable characteristic of sequence input?
It has some defined ordering
Downsides of RNNs?
1) they’re slow to train. 2) they don’t deal with long sequences too well
What happens when RNNs process too long of sequences?
The gradients either vanish or explode
What is a consequence of RNNs being too slow to train?
We end up using a truncated version of backprop, and even then that is too slow
LSTM networks replace neurons with what?
An LSTM cell
LSTMs and RNNs relationship to GPUs
Their data must be processed sequentially. Meaning the output of one neuron must be used in the input for the next neuron. This makes it impossible to take advantage of GPUs being designed for parallel computation
Like RNNs, transformers use what architecture?
An encoder decoder architecture
GloVe
An unsupervised learning algorithm for obtaining vector representations for words
Embedding space
Maps a word (e.g “dog”) to a vector e.g [0.22, 0.73, 0.87, 0.17,…]
Why do transformers need the inputs to have positional encoding added to them?
Because unlike other architectures, the transformer doesn’t use recurrence or convolution and instead takes each data point originally as independent from each other [explain this from two different angles]
But on a more logical level this is necessary because orderingmatters in sequences. “The dog ate the cat” has a didferneemt meaning from “the cat ate the dog”. [and presumably knowing the order of dog and cat matters for the output language, because each different output language may order subject and objects differently]
Each element of data in the transformer combines information about other elements via self attention but each element does this on its own independently of what the other elements do
The AIAYN paper’s choice of positional embedding is best understood if you have knowledge of what?
Fourier Analysis
AIAYN
Attention is all you need
Positional embedding requirements
1) Every position should have the same identifier irrespective of the sequence length or what the input is
2) since the position embedding is added to the original vector for a token, and since the value of each dimension in the original vector is bounded, the position embedding should be bounded in the same domain such that positional similarity doesn’t have a much larger effect on the final value than semantic similarity, aka each value in each dimension of the semantic vector should be between 0 and 1
What is the benefit of using sin and cos to construct your positional embedding function as opposed to say using a sigmoid?
Since sigmoids are asymptomatic, large input values will have very similar output values so sigmoids would not be good for large sequences [ I guess this means the calculated semantic vectors for each latter token in the sequence would end up being very similar]
Why wouldn’t a simple cos or sin work to calculate say the first dimension value of the positional embedding? [where the parameter is the index of the token in the input sequence]
The periodicity of sin and cos would lead to multiple indices receiving the same positional value [which mighhht? Be ok if we end up coming up with a different formula for other dimensions of the vector, but regardless we still need to come up with some formula that works for other dimensions