05 - Deep Learning for Sequence Processing Flashcards
What is an encoder and a decoder?
An encoder is the input into a model being processed, whereas the decoder is the output of the model being generated.
What is meant by neural attention?
Here we talk about neural networks that automatically weight input relevance when making predictions - they attend to the inputs. This is advantageous as we get performance gains.
What is a ‘query vector’, q?
It is the query from the input sequence.
What are ‘keys’?
Keys are the elements that are being matched or compared against the query vector, q.
What are ‘values’? (Think keys, values, queries)
Values are the associated value to each key. The values represent context or meaning of the input elements.
How do we normally transform affinity scores into probabilities?
We use the softmax, or the soft argmax, transformation.
How do we compare a query, q and a key vector, h?
We use the dot product attention. The queries and keys need to have the same dimensionality.
Keys and values needs to be the same vectors. True or false?
False! They can correspond to different linear projections of the same vectors, but they do not need to be the same vector.
What is it called when we have multiple neural attention operations that are combined?
Multi-head attention.
What is the process of dot-product self-attention?
First we compute the query vectors.
Then we compute the key vectors.
Then we compute the values vectors.
Then we compute the query-key affinity scores with dot products.
We convert these scores to probabilities through softmax.
Finally we output the weighted average of the values.
How do we handle increasing complexity due to an increase in dimensionality of queries and keys?
We scale by the length of the query/key vectors. Remember: they have the same dimension.
Attention(Q,K,V) = ?
In the above, Q = queries, K = keys, V = values
softmax( (QK^T / sqrt(d_k)) ) * V
where d_k is the dimension of the token vectors.
What is masked self-attention?
It is similar to self-attention, except a part of the input is masked and then predicted at each step. For example if we are predicting the next word, we will want to mask that word that we are predicting.
Why is masked self-attention useful?
It helps us understand meaning and context in sentences where there are missing words, and thus increases our prediction power. It achieves this by focusing on the important words and the relationship with other words in that sentence.
Briefly explain how multi-head self-attention works.
In multi-head self-attention we split the input into smaller simultaneous sequences (on different splits) and perform self-attention. To get the final output we add or concatenate the outputs of each self-attention.
If we have 6 attention heads (multi-head self-attention) and a token embedding size of 600, how might we project tokens to each attention head?
By sizes of 100 within each attention head.
Why are transformers called transformers and what mechanism do they use to ‘transform’?
Because they utilize the transformer architecture. This archutecture uses self-attention to ‘transform’ input very effectively.
Sequence-to-sequence (seq2seq) are based on self-attention. True or false?
True
Transformers: Deep models with several layers scale quadratically with respect to sequence length. True or false?
True
The encoder layers does not use self-attention and feed-forward transformations. True or false?
False. They use both.