Transformers Flashcards

Question 1

Q

Translation consists of two tasks:

Answer

A

– Capturing the meaning of x
– Producing a sentence y that captures the meaning and is written in good English

Question 2

Q

Sequence to sequence models

Answer

A

One model that is responsible for encoding x and one that decodes this meaning into the sentence y, the predictions are conditioned on the encoding

Question 3

Q

How decoder works in StoS models?

Answer

A

– During training the decoder retrieves the correct words as the input and has to predict the next word
– During testing the decoder predicts the next words following the already predicted ones

Question 4

Q

What attention in Seq2Seq helps with?

Answer

A

– Performance is improved a lot
– Alignment is modeled with attention → interpretability
– Corresponds better with human way of translation, i.e., the model can look back at the source sentence
– Attention solves bottleneck problem
– Attention helps with the vanishing gradient problem due to the shortcuts to the input

Question 5

Q

Key feature of Transformer

Answer

A

Transformer (attention) rely on attention mechanisms allowing the model to focus on the complete context of a word.

Question 6

Q

Issues with the Transformer Architecture

Answer

A

– For every word a query and for every query the dot-product with all values has to be computed
– With the sentence length the number of computations increases quadratically
– How to maintain context over larger texts?
– Really powerful models have too many parameters (GPT-3 has 175 billion parameters), are trained on a
dataset containing nearly a trillion words and do not fit on just one GPU

Question 7

Q

Which architectures structure is most similar to LSTMs cell state? Why?

Answer

A

One can argue for both the information stored after the encoding of a sequence to sequence model, as well as the attention in a sequence to sequence + attention model, as both store contectual information.

Question 8

Q

Give an intuition about query, key and value. Use a real world example as illustration.

Answer

A

When using google search as an example, the query might be “I want to see pictures of dogs”, for which then the key “dog images” might be used which results in the value of a dog image beeing displayed as a result of the search.

Question 9

Q

What is a baseline classifier? Why is this beneficial?

Answer

A

A baseline classifier is a trivial strategy, that does better than random guessing. It is beneficial to see, if a model actually learned something meaningful.

Question 10

Q

In which context are elementwise arithmetics and broadcasting used?

Answer

A

They are used in the context of matrix multiplication in a neural network

Question 11

Q

Order the following four terms to get a working neural network for image classification:

Answer

A

Convolutional Layer
ReLU
Linear Layer
Sigmoid

This network can distinguish between two classes, as sigmoid outputs values between 0 and 1

Question 12

Q

The way the encoder is used during training and testing is the same.

Question 13

Q

Since neural machine translation consists of two subtasks, there is an encoder and a decoder in seq2seq models.

Question 14

Q

In general, a key is not necessary for the attention mechanism, but it is useful to compute it, since the task of finding an appropriate word and computing the change in the embedding can thus be separated.

Question 15

Q

A positional encoding can either be learned or precomputed.

Question 16

Q

If the same training process is used as for seq2seq models, no masking is required.

Answer

Study These Flashcards

A

True

Question 17

Q

Transformers can be used for many tasks, the samples just need to be representable as a collection of embeddings of objects describing the entire sample.

Answer

Study These Flashcards

A

True

Question 18

Q

What is self-attention?

Answer

Study These Flashcards

A

most important building block of the de- and encoder in transformer architecture. For every word a query, a key and a value is computed via a learnable weight matrix

Question 19

Q

Problems with RNN?

Answer

Study These Flashcards

A

Hidden state is mainly influenced by the neighborhood of a word, no parallelization

Transformers Flashcards

(19 cards)