Transformers:Attention Flashcards
Attention formula
How do the keys, values, and query in attention map to a search process for say a youtube search query?
Query = text in search bar
Set of keys = video titles, words in video descriptions, maybe video tags too
Set of values = the actual videos (or I guess the video id?)
In a seq2seq model we encode the input to be a what?
Context vector
What kind of operation is self attention?
A sequence to sequence operation [ugghhhh where is the source on this? I thought I had another card that said seq to seq operations have n inputs and n outputs but sequencer to sequence models don’t necessarily have the same number of inputs and outputs, so I am very confused right now]
What is the high level formula for self attention?
What is a high level implementation
How would you naively calculate w21 in the self attention formula?
It would be dot product of x2 and x1
What is the only operation in the transformer architecture that propagates information between vectors?
The self attention operation
How to fully calculate the weight value in self attention, given the naive calculation of the weight value?