ML Topics from Attention is All You Need Flashcards by Taylor Meek

Recurrent neural networks (RNNs)

A type of neural network where the output from previous steps are fed back into the network along with the new input data. This allows RNNs to model sequence data. RNNs generate a sequence of hidden states to represent the input sequence.

How well did you know this?

Not at all

Perfectly

Encoder-decoder architecture

A neural network structure consisting of two main components: an encoder, which encodes an input sequence into a vector representation, and a decoder, which generates an output sequence from the encoded vector. Used for sequence transduction problems like machine translation.

How well did you know this?

Not at all

Perfectly

Attention mechanism

A mechanism that allows the model to focus on certain parts of the input sequence as needed to generate the output. Attention weights determine how much focus to place on each part of the input.

How well did you know this?

Not at all

Perfectly

Self-attention

An attention mechanism where the model attends to itself, relating different parts of a single sequence to compute a representation of that sequence.

How well did you know this?

Not at all

Perfectly

Scaled dot-product attention

An attention function where the compatibility between the query (what we want to focus attention on) and key (what we want to compare the query to) is computed as the dot product between the query and key, divided by the square root of the key dimension. This is then softmaxed to obtain the attention weights.

Attention mechanisms allow neural networks to focus on specific parts of the input when generating the output. In scaled dot-product attention, the model calculates how compatible each part of the input (the “keys”) is with what it wants to focus on (the “query”). This focuses the model’s attention on the most relevant parts of the input.

To determine compatibility, the model takes the dot product (a measure of similarity) between the query and each key. However, as the dimensions of these vectors get larger, this dot product can become very large, even if the two vectors are not very similar. To counteract this, the dot product is divided by the square root of the dimension of the keys. This “scales” the dot products to a more reasonable range of values.

The scaled dot products between the query and each key indicate how compatible they are. They are then “softmaxed” - converted into probabilities that sum to 1. This gives the attention weights, denoting how much attention should be placed on each key. The keys are then weighted by these attention weights and summed to get the result.

For example, if the query represents the concept of “dogs” and the keys represent sentences like “The quick brown fox jumped over the lazy dog” and “The black cat hissed”, the scaled dot product will be higher for the first sentence. After softmaxing, the first sentence may get an attention weight of 0.8 while the second gets 0.2. When these weights are applied, the output will contain most of the information from the first sentence, indicating the model placed more attention on it based on the query.

The square root in the denominator helps make this compatibility measurement more stable as the dimensions get very large. Without it, the dot products would grow huge even for modest sized keys and queries. The square root compresses them into a more useful range of values, enabling more meaningful attention weights.

In summary, scaled dot-product attention uses a measure of similarity between the what the model wants to focus on and what it has available to focus on, scales that measure to a reasonable range of values, and then creates probabilities for weighting the available information based on relevance. This allows neural networks to put focused attention on the most important parts of the input for a given query.

How well did you know this?

Not at all

Perfectly

Multi-head attention

Attention allows neural networks to focus on specific parts of the input when generating the output. However, attention layers can only use one representation space when computing attention weights - they either look at the input as a whole or parts of it. Multi-head attention enables the model to jointly attend to different representation subspaces of the input.

Multi-head attention works by using not just one, but multiple attention layers in parallel. Each attention layer uses a different learned projection to transform the queries (what the model attends to), keys (what the model attends over), and values (the data of interest) into different representation subspaces. These projections change the way the attention layer views the relationships between the data.

Then, within each representation subspace (called an attention “head”), scaled dot-product attention is used to determine attention weights as usual. The results from each head are then concatenated together. By stacking multiple of these multi-head attention layers on top of each other, the model can incorporate many different views on how data is related.

For example, say the input is a sentence and the queries/keys represent words in the sentence. The first attention head could project the words into a subspace concerned with syntactic relationships, focusing on nouns, verbs and objects. The second attention head could project into a semantic subspace, focusing on the meaning and relationships between words. The third head could view the sentence in a subspace of space/time relationships.

Each head would compute different attention weights, focusing on its own representation subspace. The model could then attend over the input multiple times with these different views, gaining a richer understanding of the relationships in the data before moving on to subsequent layers.

By using multiple attention heads with different projections, the model learns different ways to decompose the task of relating the input to the output. This provides a much more expressive attention mechanism without substantially increasing the number of parameters in the model. Multi-head attention develops a view of the data that is the flexible aggregate of many different linear projections, rather than being limited to just one representation space.

In summary, multi-head attention allows neural networks to attend to information from different representation subspaces in the data. By using multiple attention layers with different learned projections in parallel, then combining them, multi-head attention provides a more sophisticated mechanism for focusing on what’s relevant in the input. This results in models with stronger learning and understanding abilities.

How well did you know this?

Not at all

Perfectly

Feed-forward network

Feed-forward networks are a basic type of neural network with only one hidden layer. They apply the same set of linear transformations and nonlinear activation functions at every position of the input to produce the output.

In the Transformer model, feed-forward networks are used as sublayers in both the encoder and decoder. They apply the same feed-forward network independently to the output at each position from the previous layer. This allows the model to learn relationships between data at each position, while still keeping the computational complexity relatively low.

A feed-forward network consists of two linear transformations with a nonlinearity in between. A linear transformation involves multiplying input data by a weight matrix, then adding a bias vector. This projects the data into a new space. The nonlinearity is often a ReLU (rectified linear unit), which simply takes the maximum of the linear transformation and zero.

In the Transformer, the feed-forward network projects the output of the self-attention layers (the queries, keys and values) into a “feed-forward” space with a larger dimensionality, applies the ReLU, then projects back to the original dimensionality. This larger space provides more expressive power to model complex relationships. The ReLU introduces nonlinearity to enable more powerful modeling.

For example, say the feed-forward input is a sequence of word embeddings representing a sentence. The first linear transformation may project this into a space that models the relationships between verbs and their objects. The ReLU keeps only the parts of this projection above 0, then the second linear transformation may project back to the input space. The feed-forward network has now modeled information about verbs and objects, which is incorporated back into the input for the next layer.

By applying the same feed-forward network at each position, the model is able to efficiently learn relationships in the data regardless of position. The larger dimensionality space and ReLU provide more expressive power than just a single linear transformation. Using feed-forward networks as sublayers in the encoder and decoder adds modeling power to the self-attention layers in the Transformer.

In summary, feed-forward networks are simple neural networks with a single hidden layer, providing an easy way to learn relationships between inputs. Applying the same feed-forward network at each sequence position provides a simple yet effective mechanism for adding expressiveness to self-attention models like the Transformer. With two linear transformations and a nonlinearity, feed-forward networks can model complex relationships while still keeping computational complexity low.

How well did you know this?

Not at all

Perfectly

Embedding

A learned vector representation of each vocabulary item. Used to convert input/output tokens into vectors of a specified dimension.

How well did you know this?

Not at all

Perfectly

Softmformax function

A function that converts a set of values into probabilities by computing the exponential of each value divided by the total exponential of all values. Used to predict the probability of each next word in the output sequence.

How well did you know this?

Not at all

Perfectly

Positional encoding

In sequence models, the position of elements in a sequence often matters. Recurrent neural networks inherently model position by carrying forward information from each element to the next in a sequence. Convolutional neural networks also implicitly encode position by applying filters over local windows of the sequence.

The Transformer, however, uses only self-attention, which itself is permutation invariant - it does not model the position of sequence elements. To compensate for this and give the Transformer a sense of position, positional encodings are added to the input embeddings. These encodings provide positional information to the first layer, which is then carried through the rest of the model.

The positional encodings used in the Transformer are sine and cosine functions of different frequencies. Each frequency corresponds to a different position in the sequence. The idea is that these functions are relatively smooth while still distinguishing different positions, and the frequencies can encode both relative and absolute position information.

For example, lower frequency functions (with larger periods) can represent more global, long-range position information. Higher frequencies provide more local information to model very close elements. Using a mixture of frequencies allows the model to capture position on different scales. The sine and cosine functions also naturally wrap around to the other end of the sequence, so position p looks similar to position p + n for any integer n.

The positional encodings are added to the input embeddings at the bottoms of the encoder and decoder stacks. Since the Transformer uses residual connections around each of its sublayers, the positional information provided at the start is carried all the way through the model. Each layer can thus incorporate and build upon the position information from the previous layer.

In effect, the sinusoidal positional encodings provide a fixed encodings of relative and absolute position in a way that generalizes well across sequence lengths. With these encodings, the Transformer can represent and model position without actually being positionally sensitive. The self-attention layers can learn to utilize the positional information through training, but position information is explicitly encoded at each layer, rather than implicitly encoded through the model architecture itself.

In summary, the Transformer adds positional encodings to its input embeddings to inject information about the position of each sequence element. Since the self-attention layers in the Transformer are permutation invariant, position needs to be explicitly modeled. The sinusoidal functions used as positional encodings distinguish different positions while smoothing over individual sequence elements. By combining multiple frequencies, the encodings contain both local and global position information. The positional encodings give the model a sense of position that it builds upon in subsequent layers.

How well did you know this?

Not at all

Perfectly

Dropout

A regularization technique where some neurons are randomly dropped during training to prevent overfitting.

How well did you know this?

Not at all

Perfectly

Label smoothing

Smoothing the target probabilities during training to make the model less confident in its predictions. Helps accuracy and BLEU score.

How well did you know this?

Not at all

Perfectly

BLEU score

A metric for evaluating machine translation that measures the similarity between candidate and reference translations, calculated based on n-gram precision. Higher is better.

How well did you know this?

Not at all

Perfectly

Beam search

A heuristic search algorithm used at inference time in sequence models to find the sequences with highest probabilities. A “beam” of size n considers the top n best partial solutions at each step.

How well did you know this?

Not at all

Perfectly

Parallelization

Splitting computation across multiple machines/processors to speed up training. RNNs are hard to parallelize due to their sequential nature. Self-attention layers allow for greater parallelization.

How well did you know this?

Not at all

Perfectly

Batching

Study These Flashcards

Processing multiple training examples at once, instead of one example at a time. Allows for greater parallelization and optimization. More difficult with longer sequences due to memory constraints. Self-attention also helps mitigate this.

Perplexity

Study These Flashcards

A metric for evaluating language models calculated as the exponentiated average log-likelihood of each test word. Lower is better.

Adam optimizer

Study These Flashcards

A popular optimization algorithm for training neural networks. Combines advantages of RMSProp and momentum methods. Uses adaptive learning rates for different parameters.

Learning rate

Study These Flashcards

The step size used to update parameters in model training. Larger learning rates cover more ground each update, but can oscillate or diverge. Smaller learning rates require more updates to reach a minimum. Often decayed over training.

weight tying

Study These Flashcards

Sharing the same weight matrix between embedding layers and softmax layer. Can help with generalization.

Convolutional neural network (CNN)

Study These Flashcards

A type of neural network where convolution operations are key tools used for processing spatial representations. Previously popular for machine translation but more difficult to learn long-range dependencies.

Byte-pair encoding

Study These Flashcards

A compression algorithm used to create a smaller character vocabulary for machine translation. The most frequent pairs of characters are iteratively merged until the desired number of merges is reached.

Word pieces

Study These Flashcards

Similar to BPE, a variable-length encoding of words as sequences of word pieces from a fixed vocabulary.

per-wordpiece perplexity

Study These Flashcards

Perplexity calculated based on the word piece encoding, should not be directly compared to per-word perplexity.

Length penalty

A term added to the log-likelihood in beam search to discourage short outputs. Typically a function of the output length.

Early stopping

Terminating sequence generation in beam search once the sequence has ended. Prevents wasting time and memory costs on excessively long outputs.

Residual connection

A connection that adds the input to a layer to its output. Helps with training deeper neural networks and prevents vanishing gradients.

Layer normalization

Normalizing the outputs of a layer by the mean and standard deviation of all outputs in that layer. Promotes stable distribution of layer outputs.

ReLU activation

An activation function that takes the maximum of 0 and the input value. Introduces nonlinearity into the network while avoiding the vanishing gradient problem.

TensorFlow

An open-source software framework from Google for machine learning, used to implement the model in this paper.

Tensor2Tensor

A generalization of TensorFlow to any type of data, used in this paper to help speed up and scale experimentation. Includes predefined models and datasets for many ML problems.

The kernel size of convolutional layers, defining the width of the "window" over the input that is processed at a time. Larger kerner size means longer context but more parameters.

The length of a sequence, i.e. number of tokens. Used to assess computational complexity based on sequence length.

The model dimensionality, i.e. number of units in a representation. Used to assess computational complexity.

The number of heads in multi-head attention, defining how many separate attention layers are calculated in parallel.

The size of the neighborhood in restricted self-attention, limiting attention to nearby positions. Increases maximum path length in the network.

Online vs Offline Inference

**Online inference refers to making predictions using a machine learning model in real-time, as the data is streaming in. Offline inference refers to making predictions on existing data that is static and unchanging.** Some key differences between online and offline inference are: - **Latency** - Online inference needs to be low latency, returning predictions quickly for real-time use cases. Offline inference has more flexibility in latency since the data is static. - **Data volume** - Online inference needs to handle potentially unbounded data streams. Offline inference deals with a fixed, bounded dataset. - **Adaptability** - Models for online inference may need to adapt to changing data over time. Offline models are trained once on static data. - **Resource usage** - Online inference may need to be very optimized and efficient to handle streaming data with limited resources. Offline inference has more flexibility in using resources on static data. Some examples of online vs offline inference: - **Fraud detection** - Using a machine learning model in real-time to detect fraudulent transactions would be online inference. Re-scoring historical transactions with a model would be offline inference. - **Product recommendations** - Making recommendations to users on a website based on their activity would require an online inference system. Scoring static profiles to generate recommendations in advance would be offline inference. - **Image classification** - Classifying images uploaded by users on a social network would require online inference at scale. Running a model once on a dataset of photos for a research paper would qualify as offline inference. In summary, the key distinction comes down to whether the data is static or streaming and whether latency and throughput requirements necessitate a real-time prediction system. Both online and offline inference have their uses, and machine learning architectures need to handle the different requirements of these two types of prediction.

ML Topics from Attention is All You Need Flashcards

(37 cards)