Unit 4: Sequential Data Flashcards by Rayno Mostert

Image captioning

A task to take a single image as input and produce a textual description as output.

How well did you know this?

Not at all

Perfectly

Sentiment Analysis

A task to take a textual description as input and generate a sentiment value as output.

How well did you know this?

Not at all

Perfectly

Machine Translation

Translation from one language to another (e.g. English to French)

How well did you know this?

Not at all

Perfectly

Hadamard product

The element-wise product

Symbol: ⊙

(0, 0.2, 1) ⊙ (3, 4, 2) = (0, 2, 2)

How well did you know this?

Not at all

Perfectly

Gated Recurrent Unit

Does away with the cell state of an LSTM, and simply uses the hidden state hₜ for the persistent memory.

How well did you know this?

Not at all

Perfectly

Word embedding

A useful technique for use with word tockens. It maps each word into an M-dimensional space where similar words are close by in terms of Euclidean distance.

The value of M is generally much less than the number of tockens in the vocabulary N (M &laquo_space;N).

How well did you know this?

Not at all

Perfectly

Distributional hypothesis

First postulated in the 1950s, stating that words in similar contexts tend to have similar meanings.

E.g. ‘oculist’ and ‘eye-doctor’ tend to occur in the same context with words like ‘eye’ and ‘examined’.

How well did you know this?

Not at all

Perfectly

Skip-gram Method

For Word2vec

A simply fully-connected encoder-decoder network is trained to model the probability of other words appearing in the same context as a given word.

Each encoded word becomes the desired word embedding.

How well did you know this?

Not at all

Perfectly

Text classification using a CNN

Convolutional neural networks can be used in a text classifier by applying 1D convolutions on the 1D input sequence of embedded tokens.

Because the single dimension of the convolution often corresponds to time, this form of convolution is sometimes called “temporal convolution” to distinguish it from the familiar 2D convolutions operating on images.

How well did you know this?

Not at all

Perfectly

k-max pooling

Given some fixed k and a numerical sequence p of length n >= k, select the subsequence q of the k highest values of p.

The order of the values in q corresponds to their original order in p.

How well did you know this?

Not at all

Perfectly

Multi-head attention

A database consists of (key, value) pairs.

Given a query, we retrieve a desired value by finding the closest key to our query. In our ‘attention’ context, queries, keys and values are all vectors.

Given a query value, we measure the similarity to every key vector.

We then sum all value vectors, weighted by the corresponding similarities. The idea is that the attention mechanism is focusing on those values associated with the most similar keys to the query.

How well did you know this?

Not at all

Perfectly

Self-attention

The queries, keys and values are all set equal to the input embeddings.

V = K = Q = embedded tensor

Thus each token is compared with every other token, and the degree of similarity is used to weight an average of the (embedded and linearly mapped) tokens themselves.

How well did you know this?

Not at all

Perfectly

Positional encoding in transformers

Unlike in an RNN, there is no intrinsic ordering in the way inputs are processed within the transformer architecture.

Every input is compared with every other input, disregarding the temporal order that is implicit in the input token vector.

A temporal encoding can be added by superimposing a position signature onto the tensor of embedded tokens.
This is a tensor of the same size as the token embedding, with a unique vector of values at each token position.

How well did you know this?

Not at all

Perfectly

Output layer from the transformer

The output layer from the transformer block is the same size as the input to the first transformer block (seqlen x d).

On top of this we add a linear layer operating pointwise (i.e. an affine mapping in the d direction) and expanding to the vocab_size.

Finally, softmax is applied pointwise over the d dimension to produce a probability distribution over the vocabulary at each position.

How well did you know this?

Not at all

Perfectly

High-Level Overview of How a Transformer Works

Input Representation

The input sequence is first represented as embeddings.

Each word or token in the the input sequence is transformed into a high-dimensional vector.

How well did you know this?

Not at all

Perfectly

High-Level Overview of How a Transformer Works

Positional Encoding

Study These Flashcards

Since Transformers don’t inherently understand the order of the input sequence, positional encodings are added to the input embeddings to give the model information about the position of each token in the sequence.

High-Level Overview of How a Transformer Works

Encoder Structure

Study These Flashcards

The enoder processes the input sequence. Each layer in the encoder of two main sub-layers:

Multi-Head Self-Attention Mechanism
This allows the model to focus on different parts of the input sequence when encoding each token.

Feedforward Neural Network
After attention, the output passes through a feedforward neural network.

High-Level Overview of How a Transformer Works

Decoder Structure

Study These Flashcards

The decoder generates the output sequence. Each layer in the decoder has 3 main sub-layers.

Multi-Head Self-Attention Mechanism (masked)
Similar to the encoder, but a mask is applied to prevent attending to future tokens.

Multi-Head Attention Mechanism (encoder-decoder)
This layer attends to the encoder’s output and allows the model to focus on different parts of the input sequence when generating each token.

Feedforward Neural Network
After attention, the output passes through a feedforward neural network.

High-Level Overview of How a Transformer Works

Attention Mechanism

Study These Flashcards

The attention mechanism is a crucial component of the Transformer. It allows the model to weigh the importance of different parts of the input sequence when processing each token.

The attention mechanism calculates attention scores, and a weighted sum of the input sequence is used to compute the output.

High-Level Overview of How a Transformer Works

Multi-Head Attention

Study These Flashcards

Instead of having a single attention mechanism, the Transformer uses multiple attention heads in parallel.

Each head learns different relationships in the data, and their outputs are concatenated and linearly transformed.

High-Level Overview of How a Transformer Works

Layer Normalization and Residual Connections

Study These Flashcards

After each sub-layer (like self-attention or feedforward network), layer normalization and residual connections are applied. These help with training stability.

High-Level Overview of How a Transformer Works

Output Layer

Study These Flashcards

The final layer of the decoder produces the probability distribution over the output vocabulary.

LSTM

Study These Flashcards

A type of recurrent neural network architecture designed to address the vanishing gradient problem in traditional RNNs.

The LSTM introduces a more complex structures to simple RNNs, incorporating memory cells, input gates, forget gates, and output gates.

They are designed to selectively store and retrieve information from the memory cell, making them effective for learning long-term dependencies in sequential data.

LSTM

Memory Cell (Cₜ)

Study These Flashcards

The memory cell is a crucial component that allows LSTM to store information over long periods.

It acts as a conveyor belt, passing information across time steps.

# LSTM Input Gate (iₜ)

The input gate controls the flow of information into the memory cell. It takes the current input and the previous hidden state as inputs and produces a vector of values between 0 and 1.

# LSTM Forget Gate (fₜ)

The forget gate decides what information from the memory cell should be discarded or kept. It considers the current input and the previous hidden state, producing a vector of values between 0 and 1.

# LSTM Output Gate (oₜ)

The output gate determines what the next hidden state should be based on the current input and the previous hidden state. It produces a vector of values between 0 and 1.

Unit 4: Sequential Data Flashcards

(27 cards)