Unit 4: Sequential Data Flashcards
Image captioning
A task to take a single image as input and produce a textual description as output.
Sentiment Analysis
A task to take a textual description as input and generate a sentiment value as output.
Machine Translation
Translation from one language to another (e.g. English to French)
Hadamard product
The element-wise product
Symbol: ⊙
(0, 0.2, 1) ⊙ (3, 4, 2) = (0, 2, 2)
Gated Recurrent Unit
Does away with the cell state of an LSTM, and simply uses the hidden state hₜ
for the persistent memory.
Word embedding
A useful technique for use with word tockens. It maps each word into an M-dimensional space where similar words are close by in terms of Euclidean distance.
The value of M is generally much less than the number of tockens in the vocabulary N (M «_space;N).
Distributional hypothesis
First postulated in the 1950s, stating that words in similar contexts tend to have similar meanings.
E.g. ‘oculist’ and ‘eye-doctor’ tend to occur in the same context with words like ‘eye’ and ‘examined’.
Skip-gram Method
For Word2vec
A simply fully-connected encoder-decoder network is trained to model the probability of other words appearing in the same context as a given word.
Each encoded word becomes the desired word embedding.
Text classification using a CNN
Convolutional neural networks can be used in a text classifier by applying 1D convolutions on the 1D input sequence of embedded tokens.
Because the single dimension of the convolution often corresponds to time, this form of convolution is sometimes called “temporal convolution” to distinguish it from the familiar 2D convolutions operating on images.
k-max pooling
Given some fixed k
and a numerical sequence p
of length n >= k
, select the subsequence q
of the k
highest values of p
.
The order of the values in q
corresponds to their original order in p
.
Multi-head attention
A database consists of (key, value) pairs.
Given a query, we retrieve a desired value by finding the closest key to our query. In our ‘attention’ context, queries, keys and values are all vectors.
Given a query value, we measure the similarity to every key vector.
We then sum all value vectors, weighted by the corresponding similarities. The idea is that the attention mechanism is focusing on those values associated with the most similar keys to the query.
Self-attention
The queries, keys and values are all set equal to the input embeddings.
V = K = Q = embedded tensor
Thus each token is compared with every other token, and the degree of similarity is used to weight an average of the (embedded and linearly mapped) tokens themselves.
Positional encoding in transformers
Unlike in an RNN, there is no intrinsic ordering in the way inputs are processed within the transformer architecture.
Every input is compared with every other input, disregarding the temporal order that is implicit in the input token vector.
A temporal encoding can be added by superimposing a position signature onto the tensor of embedded tokens.
This is a tensor of the same size as the token embedding, with a unique vector of values at each token position.
Output layer from the transformer
The output layer from the transformer block is the same size as the input to the first transformer block (seqlen x d
).
On top of this we add a linear layer operating pointwise (i.e. an affine mapping in the d
direction) and expanding to the vocab_size
.
Finally, softmax is applied pointwise over the d
dimension to produce a probability distribution over the vocabulary at each position.
High-Level Overview of How a Transformer Works
Input Representation
The input sequence is first represented as embeddings.
Each word or token in the the input sequence is transformed into a high-dimensional vector.
High-Level Overview of How a Transformer Works
Positional Encoding
Since Transformers don’t inherently understand the order of the input sequence, positional encodings are added to the input embeddings to give the model information about the position of each token in the sequence.
High-Level Overview of How a Transformer Works
Encoder Structure
The enoder processes the input sequence. Each layer in the encoder of two main sub-layers:
Multi-Head Self-Attention Mechanism
This allows the model to focus on different parts of the input sequence when encoding each token.
Feedforward Neural Network
After attention, the output passes through a feedforward neural network.
High-Level Overview of How a Transformer Works
Decoder Structure
The decoder generates the output sequence. Each layer in the decoder has 3 main sub-layers.
Multi-Head Self-Attention Mechanism (masked)
Similar to the encoder, but a mask is applied to prevent attending to future tokens.
Multi-Head Attention Mechanism (encoder-decoder)
This layer attends to the encoder’s output and allows the model to focus on different parts of the input sequence when generating each token.
Feedforward Neural Network
After attention, the output passes through a feedforward neural network.
High-Level Overview of How a Transformer Works
Attention Mechanism
The attention mechanism is a crucial component of the Transformer. It allows the model to weigh the importance of different parts of the input sequence when processing each token.
The attention mechanism calculates attention scores, and a weighted sum of the input sequence is used to compute the output.
High-Level Overview of How a Transformer Works
Multi-Head Attention
Instead of having a single attention mechanism, the Transformer uses multiple attention heads in parallel.
Each head learns different relationships in the data, and their outputs are concatenated and linearly transformed.
High-Level Overview of How a Transformer Works
Layer Normalization and Residual Connections
After each sub-layer (like self-attention or feedforward network), layer normalization and residual connections are applied. These help with training stability.
High-Level Overview of How a Transformer Works
Output Layer
The final layer of the decoder produces the probability distribution over the output vocabulary.
LSTM
A type of recurrent neural network architecture designed to address the vanishing gradient problem in traditional RNNs.
The LSTM introduces a more complex structures to simple RNNs, incorporating memory cells, input gates, forget gates, and output gates.
They are designed to selectively store and retrieve information from the memory cell, making them effective for learning long-term dependencies in sequential data.
LSTM
Memory Cell (Cₜ)
The memory cell is a crucial component that allows LSTM to store information over long periods.
It acts as a conveyor belt, passing information across time steps.