Transformers Flashcards
What are sequence to sequence models?
They consist of 2 RNN
- One model that is responsible for encoding x and one that decodes this meaning into the sentence y,
the predictions are conditioned on the encoding
- The complete model is trainable end-to-end using the cross-entropy loss for the predicted words
– During training the decoder retrieves the correct words as the input and has to predict the next word
– During testing the decoder predicts the next words following the already predicted ones
How does neural machine translation works
The task can be described as:
argmaxyP(y|x) = argmaxyP(x|y)P(y)
How does Attention in NMT works?
Use attention to focus on different parts of the input sentence depending on the state of the decoder
What improvements does using attention in NMT bring?
- Performance improvement
- More interpretability
- helps with the vanishing gradient problem due to the shortcuts to the input
- solves bottleneck problem
What’s an information bottleneck?
The encoding
By what is the hidden state most influenced by?
the neighborhood of a word. Often the direct neighborhood is defining the complete context of a word but not always
What’s the problem with using RNNs for translation? What can we do to solve it?
RNNs hamper the parallelizability of the computations as the hidden states have to be computed sequentially.
Self-attention can remedy those issues
What’s self attention?
It’s one of the most important building blocks of the de- and encoder in the transformer architecture
- For every word a query, a key and a value is computed via a learnable
weight matrix
Whats an encoder?
The encoder’s primary role is to process input data, such as a sequence of words or tokens, and transform it into a meaningful representation that can be used for various downstream tasks.
What’s a decoder?
the component responsible for generating output sequences, typically in tasks like machine translation, text generation, or any sequence-to-sequence problem. The decoder works by taking the encoded representation from the encoder (or previous outputs during generation) and producing a new sequence, such as a translation of a sentence into another language.
By which rate does the sentence length the number of computations increase?
quadratically
Which statements are true about seq2seq models? (Multiple Choice)
1. The beam search generates too many canditates for long input sentences and is therefore memory inefficient.
2. Since neural machine translation consists of two subtasks, there is an encoder and a decoder in seq2seq models.
3. The information bottleneck in seq2seq models is the embedding of words, because there are too many words with different meanings, so it is difficult to capture them in a vector with only 256 values.
4. The way the encoder is used during training and testing is the same.
5. Languages like German are particularly difficult for neural machine translation models because they can contain arbitrary compound words.
2, 4
Which statements are true about the attention mechanism? (Multiple Choice)
1. Attention is a mechanism introduced with the Transformers and is the main reason for the success of the architecture.
2. In the attention mechanism of transformers, the query is calculated based on the embedding of the current word as well as the key, and the transformed embeddings of the other words serve as the values.
3. In general, a key is not necessary for the attention mechanism, but it is useful to compute it, since the task of finding an appropriate word and computing the change in the embedding can thus be separated.
4. When we use the attention mechanism, we also always need an explicit position embedding, because the position of words in languages matters.
3
Which statements are true about transformer in general? (Multiple Choice)
1. Transformers can be used for many tasks, the samples just need to be representable as a collection of embeddings of objects describing the entire sample.
2. Large transformer models have reached a status where they understand logic and are difficult to trick by humans.
3. Multi-head attention is the same as normal attention with a correspondingly longer feature vector, but it is easier to think of the concept as multi-head attention.
4. If the same training process is used as for seq2seq models, no masking is required.
5. A positional encoding can either be learned or precomputed.
1,4,5