Transformers Flashcards

1
Q

Translation consists of two tasks:

A

– Capturing the meaning of x
– Producing a sentence y that captures the meaning and is written in good English

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Sequence to sequence models

A

One model that is responsible for encoding x and one that decodes this meaning into the sentence y, the predictions are conditioned on the encoding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How decoder works in StoS models?

A

– During training the decoder retrieves the correct words as the input and has to predict the next word
– During testing the decoder predicts the next words following the already predicted ones

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What attention in Seq2Seq helps with?

A

– Performance is improved a lot
– Alignment is modeled with attention → interpretability
– Corresponds better with human way of translation, i.e., the model can look back at the source sentence
– Attention solves bottleneck problem
– Attention helps with the vanishing gradient problem due to the shortcuts to the input

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Key feature of Transformer

A

Transformer (attention) rely on attention mechanisms allowing the model to focus on the complete context of a word.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Issues with the Transformer Architecture

A

– For every word a query and for every query the dot-product with all values has to be computed
– With the sentence length the number of computations increases quadratically
– How to maintain context over larger texts?
– Really powerful models have too many parameters (GPT-3 has 175 billion parameters), are trained on a
dataset containing nearly a trillion words and do not fit on just one GPU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which architectures structure is most similar to LSTMs cell state? Why?

A

One can argue for both the information stored after the encoding of a sequence to sequence model, as well as the attention in a sequence to sequence + attention model, as both store contectual information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Give an intuition about query, key and value. Use a real world example as illustration.

A

When using google search as an example, the query might be “I want to see pictures of dogs”, for which then the key “dog images” might be used which results in the value of a dog image beeing displayed as a result of the search.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a baseline classifier? Why is this beneficial?

A

A baseline classifier is a trivial strategy, that does better than random guessing. It is beneficial to see, if a model actually learned something meaningful.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In which context are elementwise arithmetics and broadcasting used?

A

They are used in the context of matrix multiplication in a neural network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Order the following four terms to get a working neural network for image classification:

A

Convolutional Layer
ReLU
Linear Layer
Sigmoid

This network can distinguish between two classes, as sigmoid outputs values between 0 and 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The way the encoder is used during training and testing is the same.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Since neural machine translation consists of two subtasks, there is an encoder and a decoder in seq2seq models.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

In general, a key is not necessary for the attention mechanism, but it is useful to compute it, since the task of finding an appropriate word and computing the change in the embedding can thus be separated.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

A positional encoding can either be learned or precomputed.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

If the same training process is used as for seq2seq models, no masking is required.

A

True

17
Q

Transformers can be used for many tasks, the samples just need to be representable as a collection of embeddings of objects describing the entire sample.

A

True

18
Q

What is self-attention?

A

most important building block of the de- and encoder in transformer architecture. For every word a query, a key and a value is computed via a learnable weight matrix

19
Q

Problems with RNN?

A

Hidden state is mainly influenced by the neighborhood of a word, no parallelization