ML Topics from Attention is All You Need Flashcards
Recurrent neural networks (RNNs)
A type of neural network where the output from previous steps are fed back into the network along with the new input data. This allows RNNs to model sequence data. RNNs generate a sequence of hidden states to represent the input sequence.
Encoder-decoder architecture
A neural network structure consisting of two main components: an encoder, which encodes an input sequence into a vector representation, and a decoder, which generates an output sequence from the encoded vector. Used for sequence transduction problems like machine translation.
Attention mechanism
A mechanism that allows the model to focus on certain parts of the input sequence as needed to generate the output. Attention weights determine how much focus to place on each part of the input.
Self-attention
An attention mechanism where the model attends to itself, relating different parts of a single sequence to compute a representation of that sequence.
Scaled dot-product attention
An attention function where the compatibility between the query (what we want to focus attention on) and key (what we want to compare the query to) is computed as the dot product between the query and key, divided by the square root of the key dimension. This is then softmaxed to obtain the attention weights.
Attention mechanisms allow neural networks to focus on specific parts of the input when generating the output. In scaled dot-product attention, the model calculates how compatible each part of the input (the “keys”) is with what it wants to focus on (the “query”). This focuses the model’s attention on the most relevant parts of the input.
To determine compatibility, the model takes the dot product (a measure of similarity) between the query and each key. However, as the dimensions of these vectors get larger, this dot product can become very large, even if the two vectors are not very similar. To counteract this, the dot product is divided by the square root of the dimension of the keys. This “scales” the dot products to a more reasonable range of values.
The scaled dot products between the query and each key indicate how compatible they are. They are then “softmaxed” - converted into probabilities that sum to 1. This gives the attention weights, denoting how much attention should be placed on each key. The keys are then weighted by these attention weights and summed to get the result.
For example, if the query represents the concept of “dogs” and the keys represent sentences like “The quick brown fox jumped over the lazy dog” and “The black cat hissed”, the scaled dot product will be higher for the first sentence. After softmaxing, the first sentence may get an attention weight of 0.8 while the second gets 0.2. When these weights are applied, the output will contain most of the information from the first sentence, indicating the model placed more attention on it based on the query.
The square root in the denominator helps make this compatibility measurement more stable as the dimensions get very large. Without it, the dot products would grow huge even for modest sized keys and queries. The square root compresses them into a more useful range of values, enabling more meaningful attention weights.
In summary, scaled dot-product attention uses a measure of similarity between the what the model wants to focus on and what it has available to focus on, scales that measure to a reasonable range of values, and then creates probabilities for weighting the available information based on relevance. This allows neural networks to put focused attention on the most important parts of the input for a given query.
Multi-head attention
Attention allows neural networks to focus on specific parts of the input when generating the output. However, attention layers can only use one representation space when computing attention weights - they either look at the input as a whole or parts of it. Multi-head attention enables the model to jointly attend to different representation subspaces of the input.
Multi-head attention works by using not just one, but multiple attention layers in parallel. Each attention layer uses a different learned projection to transform the queries (what the model attends to), keys (what the model attends over), and values (the data of interest) into different representation subspaces. These projections change the way the attention layer views the relationships between the data.
Then, within each representation subspace (called an attention “head”), scaled dot-product attention is used to determine attention weights as usual. The results from each head are then concatenated together. By stacking multiple of these multi-head attention layers on top of each other, the model can incorporate many different views on how data is related.
For example, say the input is a sentence and the queries/keys represent words in the sentence. The first attention head could project the words into a subspace concerned with syntactic relationships, focusing on nouns, verbs and objects. The second attention head could project into a semantic subspace, focusing on the meaning and relationships between words. The third head could view the sentence in a subspace of space/time relationships.
Each head would compute different attention weights, focusing on its own representation subspace. The model could then attend over the input multiple times with these different views, gaining a richer understanding of the relationships in the data before moving on to subsequent layers.
By using multiple attention heads with different projections, the model learns different ways to decompose the task of relating the input to the output. This provides a much more expressive attention mechanism without substantially increasing the number of parameters in the model. Multi-head attention develops a view of the data that is the flexible aggregate of many different linear projections, rather than being limited to just one representation space.
In summary, multi-head attention allows neural networks to attend to information from different representation subspaces in the data. By using multiple attention layers with different learned projections in parallel, then combining them, multi-head attention provides a more sophisticated mechanism for focusing on what’s relevant in the input. This results in models with stronger learning and understanding abilities.
Feed-forward network
Feed-forward networks are a basic type of neural network with only one hidden layer. They apply the same set of linear transformations and nonlinear activation functions at every position of the input to produce the output.
In the Transformer model, feed-forward networks are used as sublayers in both the encoder and decoder. They apply the same feed-forward network independently to the output at each position from the previous layer. This allows the model to learn relationships between data at each position, while still keeping the computational complexity relatively low.
A feed-forward network consists of two linear transformations with a nonlinearity in between. A linear transformation involves multiplying input data by a weight matrix, then adding a bias vector. This projects the data into a new space. The nonlinearity is often a ReLU (rectified linear unit), which simply takes the maximum of the linear transformation and zero.
In the Transformer, the feed-forward network projects the output of the self-attention layers (the queries, keys and values) into a “feed-forward” space with a larger dimensionality, applies the ReLU, then projects back to the original dimensionality. This larger space provides more expressive power to model complex relationships. The ReLU introduces nonlinearity to enable more powerful modeling.
For example, say the feed-forward input is a sequence of word embeddings representing a sentence. The first linear transformation may project this into a space that models the relationships between verbs and their objects. The ReLU keeps only the parts of this projection above 0, then the second linear transformation may project back to the input space. The feed-forward network has now modeled information about verbs and objects, which is incorporated back into the input for the next layer.
By applying the same feed-forward network at each position, the model is able to efficiently learn relationships in the data regardless of position. The larger dimensionality space and ReLU provide more expressive power than just a single linear transformation. Using feed-forward networks as sublayers in the encoder and decoder adds modeling power to the self-attention layers in the Transformer.
In summary, feed-forward networks are simple neural networks with a single hidden layer, providing an easy way to learn relationships between inputs. Applying the same feed-forward network at each sequence position provides a simple yet effective mechanism for adding expressiveness to self-attention models like the Transformer. With two linear transformations and a nonlinearity, feed-forward networks can model complex relationships while still keeping computational complexity low.
Embedding
A learned vector representation of each vocabulary item. Used to convert input/output tokens into vectors of a specified dimension.
Softmformax function
A function that converts a set of values into probabilities by computing the exponential of each value divided by the total exponential of all values. Used to predict the probability of each next word in the output sequence.
Positional encoding
In sequence models, the position of elements in a sequence often matters. Recurrent neural networks inherently model position by carrying forward information from each element to the next in a sequence. Convolutional neural networks also implicitly encode position by applying filters over local windows of the sequence.
The Transformer, however, uses only self-attention, which itself is permutation invariant - it does not model the position of sequence elements. To compensate for this and give the Transformer a sense of position, positional encodings are added to the input embeddings. These encodings provide positional information to the first layer, which is then carried through the rest of the model.
The positional encodings used in the Transformer are sine and cosine functions of different frequencies. Each frequency corresponds to a different position in the sequence. The idea is that these functions are relatively smooth while still distinguishing different positions, and the frequencies can encode both relative and absolute position information.
For example, lower frequency functions (with larger periods) can represent more global, long-range position information. Higher frequencies provide more local information to model very close elements. Using a mixture of frequencies allows the model to capture position on different scales. The sine and cosine functions also naturally wrap around to the other end of the sequence, so position p looks similar to position p + n for any integer n.
The positional encodings are added to the input embeddings at the bottoms of the encoder and decoder stacks. Since the Transformer uses residual connections around each of its sublayers, the positional information provided at the start is carried all the way through the model. Each layer can thus incorporate and build upon the position information from the previous layer.
In effect, the sinusoidal positional encodings provide a fixed encodings of relative and absolute position in a way that generalizes well across sequence lengths. With these encodings, the Transformer can represent and model position without actually being positionally sensitive. The self-attention layers can learn to utilize the positional information through training, but position information is explicitly encoded at each layer, rather than implicitly encoded through the model architecture itself.
In summary, the Transformer adds positional encodings to its input embeddings to inject information about the position of each sequence element. Since the self-attention layers in the Transformer are permutation invariant, position needs to be explicitly modeled. The sinusoidal functions used as positional encodings distinguish different positions while smoothing over individual sequence elements. By combining multiple frequencies, the encodings contain both local and global position information. The positional encodings give the model a sense of position that it builds upon in subsequent layers.
Dropout
A regularization technique where some neurons are randomly dropped during training to prevent overfitting.
Label smoothing
Smoothing the target probabilities during training to make the model less confident in its predictions. Helps accuracy and BLEU score.
BLEU score
A metric for evaluating machine translation that measures the similarity between candidate and reference translations, calculated based on n-gram precision. Higher is better.
Beam search
A heuristic search algorithm used at inference time in sequence models to find the sequences with highest probabilities. A “beam” of size n considers the top n best partial solutions at each step.
Parallelization
Splitting computation across multiple machines/processors to speed up training. RNNs are hard to parallelize due to their sequential nature. Self-attention layers allow for greater parallelization.