11 - Transformers Flashcards

1
Q

What is semi supervised learning?

A

pretrained with a large set of unlabeled data set (unsupervised learning) and then fine tuned through supervise training to get better performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain what a Transformers Encoder and Decoder do and what layers they have.

A

Encoder:
- maps an input sequence of symbol representations (x1…xn) to a sequence of continuous representations (z1…zn)
- Stack of 6 layers, each of them including following sublayers and normalization and residual connection after each:
- Mulit-Head Self-Attention Mechanism
- Fully Connected Feed Forward Layer

Decoder:
- given z, the decoder generates output sequence (y1…ym) of symbols. It generates it one element at a time and is auto regressice, meaning it takes the previously generated symbols as input as well
- Stack of 6 layers, each of them including following sublayers and normalization and residual connection after each:
- Mulit-Head Self-Attention Mechanism for last output
- Mulit-Head Self-Attention Mechanism for last output & encoding
- Fully Connected Feed Forward Layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Transformer vs RNN

A
  • Until Transformers, RNNs (especially LSTMs and gated RNNs) were the best at processing sequential data and transduction problems such as language modelling.
  • Since RNNs generate a sequence of hidden states, which are a function ofthe previous state and current input. This structure is inherently sequential.
  • Transformers use an attention mechanism, which provides context around each token. They can then run parallel, which speeds them up. Mostly these attention mechanisms are used in conjunction with a recurrent net.
    • Sometimes Positional Encodings (position identifiers) are added to the embeddings
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the attention mechanism?

A

Input: query, keys, values (all vectors, like the output). Can all just be copies of the input.
Output: vector, weighted sum of the values. The weights are computed by a compatibility function of the query with the corresponding key.

The most used types of attention functions are (scaled) dot-product and multi-head. Dot-product is faster, but multi-head outperformes for larger values of d_k.

Application of attention in the Transformer model:
1. Encoder: all keys, queries and values come from the previous encoder layer.
2. Decoder Bottom: can attend to all values form output so far.
3. Decoder Top: keys and values come from encoder and serve as memory. Queries come from output generated so far.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Scaled Dot Product Attention vs Multi-Heas Attention

A

Scaled Dot Product Attention

Query and Key have dimensions d_k.

Compute dot product of the query and key, devide each by squareroot of the dimension. Softmax is applied to get the weights. These are multiplied by the values to get the output. The attention is computed for multiple queries, keys and values simultaniously by stacking them into matrices:

Attention(Q,K,V)=softmax(QK^T/sqrt{d_k}) V

Multi-Head Attention

Inputs are linearly projected h times, with different, learned linear projections with d_k, d_k and d_v dimensions. On each of these projections the attention function is performed in parallel, giving d_v dimensional output values. These are concatenated and projected once more to give the final output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain the Vision Transformer

A

Normally we use CNNs for images, but now Transformers can be used. CNNs have a lot of assumptions about images, which transformers dont.

  • splitting an image into 16x16 patches which form an orderes sequence of input tokens
  • Patches are vectoriesed and embedded with a single linear layer (MLP for feed forward and norm before not after sublayers)
  • 1D positional encodings and class encoding which are trainable (original transformer uses 2D positional encoding and no class encoding)

Formal Pipeline:
1. class embedding is added, x_class is learned, but is an equal-for-all embedding. Always has to be at the 0th position
2. Each vectorized patch x is projected by E (embedded by linear projection)
3. Positional embeddings E_pos are added
4. Encoder produces output z using MSA(Multi Head Self Attention), MLP and residual connections
1. MSA splits inputs into several heads so that each head can learn different levels of self attention. the outputs are concatenated and passed through the MLP layer
5. The first output vector of the encoder is used for the head for classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly