Transformers Flashcards
Translation consists of two tasks:
– Capturing the meaning of x
– Producing a sentence y that captures the meaning and is written in good English
Sequence to sequence models
One model that is responsible for encoding x and one that decodes this meaning into the sentence y, the predictions are conditioned on the encoding
How decoder works in StoS models?
– During training the decoder retrieves the correct words as the input and has to predict the next word
– During testing the decoder predicts the next words following the already predicted ones
What attention in Seq2Seq helps with?
– Performance is improved a lot
– Alignment is modeled with attention → interpretability
– Corresponds better with human way of translation, i.e., the model can look back at the source sentence
– Attention solves bottleneck problem
– Attention helps with the vanishing gradient problem due to the shortcuts to the input
Key feature of Transformer
Transformer (attention) rely on attention mechanisms allowing the model to focus on the complete context of a word.
Issues with the Transformer Architecture
– For every word a query and for every query the dot-product with all values has to be computed
– With the sentence length the number of computations increases quadratically
– How to maintain context over larger texts?
– Really powerful models have too many parameters (GPT-3 has 175 billion parameters), are trained on a
dataset containing nearly a trillion words and do not fit on just one GPU
Which architectures structure is most similar to LSTMs cell state? Why?
One can argue for both the information stored after the encoding of a sequence to sequence model, as well as the attention in a sequence to sequence + attention model, as both store contectual information.
Give an intuition about query, key and value. Use a real world example as illustration.
When using google search as an example, the query might be “I want to see pictures of dogs”, for which then the key “dog images” might be used which results in the value of a dog image beeing displayed as a result of the search.
What is a baseline classifier? Why is this beneficial?
A baseline classifier is a trivial strategy, that does better than random guessing. It is beneficial to see, if a model actually learned something meaningful.
In which context are elementwise arithmetics and broadcasting used?
They are used in the context of matrix multiplication in a neural network
Order the following four terms to get a working neural network for image classification:
Convolutional Layer
ReLU
Linear Layer
Sigmoid
This network can distinguish between two classes, as sigmoid outputs values between 0 and 1
The way the encoder is used during training and testing is the same.
True
Since neural machine translation consists of two subtasks, there is an encoder and a decoder in seq2seq models.
True
In general, a key is not necessary for the attention mechanism, but it is useful to compute it, since the task of finding an appropriate word and computing the change in the embedding can thus be separated.
True
A positional encoding can either be learned or precomputed.
True