1. Vanishing gradeints 2. Exploding gradients 3. Limited "memory" - Can handle short-term dependencies but can't handle long term 4. Lack control over memory - unlike LSTM, does not have mechanism to control what information should be kept.

Quiz 4 Flashcards by t n

RNN forward update rule

At each time step, input and hidden are fed through a linear layer.
Followed by non-linear (tanh)
Multiplied by output linear layer to get logits
Softmax to get probabilities
Get CE loss

Formally:
U, V, W = weights for input-to-hidden, hidden-to-output, hidden-to-hidden
a_t = Ux_t + Wh_{t-1} + b
h_t = tanh(a_t)
o_t = Vh_t + c
y_hat_t = softmax(o_t)
Lt = CE(y_hat_t, y_t)

How well did you know this?

Not at all

Perfectly

RNN: How is loss calculated over entire sequence

Loss at each time step is summed

How well did you know this?

Not at all

Perfectly

RNN: True or False. hidden weights are shared across the sequence.

True

How well did you know this?

Not at all

Perfectly

RNN: Advantage of sharing parameters across sequence

Sharing allows generalization to sequence lengths that did not appear in the trainings set.

How well did you know this?

Not at all

Perfectly

RNN: RNN architecture must always pass the hidden state to next sequence

False. Goodfellow book show examples where the output from t-1 is passed to the hidden layer at t.

How well did you know this?

Not at all

Perfectly

RNN: RNN architecture that passes only the output to next time step is likely to be less powerful.

True. If hidden state is not passed, it lacks important information from the past.

How well did you know this?

Not at all

Perfectly

Vanishing gradient

Gradient diminishes as they are backpropagated through time, leading to no learning.

How well did you know this?

Not at all

Perfectly

Exploding gradient

Gradient grow as they are backpropagated through time, leading to unstable learning.

How well did you know this?

Not at all

Perfectly

RNN cons

Vanishing gradeints
Exploding gradients
Limited “memory” - Can handle short-term dependencies but can’t handle long term
Lack control over memory - unlike LSTM, does not have mechanism to control what information should be kept.

How well did you know this?

Not at all

Perfectly

RNN: Advantage over LSTM

Smaller parameter size

How well did you know this?

Not at all

Perfectly

LSTM update rule - components and flow

Sequence (FICO):
1. Forget gate
2. Input gate
3. Cell Gate
4. Output gate

How well did you know this?

Not at all

Perfectly

LSTM: What state is passed throughout the layer

Cell state

How well did you know this?

Not at all

Perfectly

LSTM: What does forget gate do

Take input and hidden state, decides what information from the cell state should be thrown away or kept.

How well did you know this?

Not at all

Perfectly

LSTM: What does input gate do

Updates the cell state with new information

How well did you know this?

Not at all

Perfectly

LSTM: What does cell gate do

Combines information from forget gate, input gate, and candidate cell state to output new cell state at time step t

How well did you know this?

Not at all

Perfectly

LSTM: What does output gate do

Decides the next hidden state based on the updated cell state.

How well did you know this?

Not at all

Perfectly

LSTM: Pros over RNN

Controls the flow of the gradient so that they neither vanish or explode

How well did you know this?

Not at all

Perfectly

RNN: What is a recursive neural network (RecNNs, not RNN!), and what are its advantages

Can handle hierarchical structure
Reduce vanishing gradients by having nested, shorter RNNs.

How well did you know this?

Not at all

Perfectly

GRU: Main difference to LSTM

Single gate controls forget and cell state update.

How well did you know this?

Not at all

Perfectly

Gradient clipping: main use

Control exploding gradients

How well did you know this?

Not at all

Perfectly

Why does MLP not work for modeling sequences

Can’t support variable-sized inputs or outputs
No inherent temporal structure
Can not maintain “state” of sequence

How well did you know this?

Not at all

Perfectly

LM: Define Language Models

Estimate probabilities of sequences that allow us to perform comparison.

How well did you know this?

Not at all

Perfectly

LM: How are probabilities of an input sequence of words calculated?

Chain rule of probability.

Probability of sentence = Product of conditional probabilities over i, which indexes our words.

p(s) = p(w_1) p(w2 | w1) p(w3 | w1, w2) … p(w_n | w_n-1 … W1)

How well did you know this?

Not at all

Perfectly

LM: 3 applications of language modeling

predictive typing
ASR
grammar correction

How well did you know this?

Not at all

Perfectly

LM: How is "conditional" language modeling different

Adds an extra context "c" to the chain rule of probability: p(s | c) = product of p(w_i | c, w_i-1 ... w_1)

Conditional LM: What is context and sequence for Topic-aware language model

C = topic S = text

Conditional LM: What is context and sequence for Text summarization

C = long document S = summary

Conditional LM: What is context and sequence for Machine Translation

C = French text S = English text

Conditional LM: What is context and sequence for Image captioning

C = image S = caption

Conditional LM: What is context and sequence for OCR

C = image of a line S = its content

Conditional LM: What is context and sequence for Speech Recognition

C = recording S = its transcription

Teacher forcing

During training, the model is also fed with the true target sequence, not its own generated output.

Knowledge distillation

Smaller model (student) learns to mimic the predictions of a larger, more complex model (teacher) by transferring its knowledge

T or F: Hard labels are passed from teacher model to student

False. Soft labels gives more signals than hard.

Knowledge distillation loss components: Teacher-student loss

CE loss between prediction scores between teacher / student can be hard or soft labels

Knowledge distillation loss components: Student loss

CE loss between student prediction and ground truth

Knowledge distillation loss components: Combined loss

teacher-student loss * weights + student loss * weights

Define cross-entropy loss

Expected number of bits required to represent an event from reference distribution (P*) when using a coding schema optimal for P.

Define per-word cross-entropy

Cross-entropy average over all the words in a sequence

What is the reference distribution (P*) in per-word cross entropy?

Empirical distribution of the words in the sequence

Define perplexity

Geometric mean of the inverse probability of a sequence of words.

What is perplexity of choosing 1 for a 10 sided dice?

10 (discrete uniform distribution of k is k)

Define perplexity using law of logarithms

Perplexity is the log of per-word cross-entropy

Define pretraining task

A task not specifically for the final task, but can help us achieve getting better initial parameters for modeling the final task.

Masked language models: Key idea

Model learns to predict masked tokens of a sequence.

Embeddings: Distributional semantics

A word's meaning is given by the words that frequently appear close-by

Embeddings: Key idea of "A Neural Probabilistic Language Model" Bengio, 2003

Map feature vector to each word in the vocabulary to predict next word.

Embeddings: Key idea of "A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning" Collobert & Weston, 2008 & "Natural Language Processing (Almost from Scratch)" Collobert et al., 2011

Use CNN to create word embeddings KNN of vectors show similar syntax + semantics

"Efficient Estimation of Word Representations in Vector Space" Mikolov et al., 2013

Word2Vec, continuous bag of words (CBOW), skip-gram

Collobert & Weston vectors. Key idea:

A word and its context is a positive training sample; a random word in that sample context gives a negative training sample. positive: "cat chills on a mat" negative: "cat chills Ohio a mat"

Skip gram objective function

(average) negative log-likelihood

Skip gram parameters

word vectors

Skip gram probability P(w_{t+J} | w_t); theta - how is it defined?

inner product of center word and context word. calculates how likely center word appears with context word.

Skip gram probability - what makes computation expensive

size of vocabulary

Skip gram: Ways to reduce computation

1. Hierarchichal softmax 2. Negative sampling

GloVe: Key idea

Training of embedding done on global word-occurrence statistics from a corpus.

fastText: Key idea

Handles OOV words + multilingual

Intrinsic evaluation of word embeddings

Evaluated on a sub-task

Extrinsic evaluation of word embeddings

Evaluated on a downstream task

Intrinsic evaluation of word embeddings - example

word analogy task (man:woman, king:?)

Graph embedding definition

Learn features such that connected nodes are more similar than unconnected nodes

Hyperbolic embeddings - what is it good for?

Better at modeling hierarchichal structures

t-SNE key idea

Measures pairwise similarities in high dimensional space and performs SGD to minimize divergence between high-dimensional vs low-dimensional data

Encoders can handle language modeling, T/F

False. Encoders get bidirectional context so we can't do language modeling

Self-attention: Sizes of Query, Key, and Value matrices, given: 1. Hidden dimension: d_model 2. Q,K,V dimension: d_q, d_k, d_v

Query: d_model * d_k Key: d_model * d_k Value: d_model * d_v

Self-attention: Dimension of self-attention output (O), given: 1. Hidden dimension: d_model 2. Q,K,V dimension: d_q, d_k, d_v 3. Number of heads h

in_features: h * d_v out_features: d_model

Define "cross-attention"

Combine two different input sequence: 1. Sequence returned by the encoder 2. Sequence processed by the decoder.

Attention: big O for each layer, given: 1. seq length "n" 2. representation dimension "d

O (n^2 * d)

Attention: big O for sequential operations

O(1)

Attention: big O for maximum path length

O(1)

Why is self-attention layer's sequential operation O(1) compared to RNN's O(n)?

self-attention layer connects all positions together

Self-attention layers are faster than recurrent layers when the _____ is smaller than the _________

sequence length, representation dimensionality

Multi-head attention: How to get size of d_k and d_v given: 1. dimension of hidden: d_model 2. number of heads (h)

d_model / h

Multi-head attention: Given d_model (512) and number of heads (8), what is d_k?

64 (d_model / h)

Attention: What is the formula for scaled dot product?

Attention(Q,K,V) = softmax(Q @ K^T / sqrt(d_q)) @ V

Scaled dot product attention: As Query vector increases in dimension, magnitudes of dot product similarities increase. How do we mitigate this?

Normalize by dividing dot product by square root of Query vector dimension

Cross-attention: Where do K, V values come from

Encoder output

Cross-attention: Where do Q value come from

Masked multi-head attention from decoder

Self Attention: Q, K, V matrix shapes (ie A * B, what are A and B)

K = D_X * D_K Q = D_X * D_Q V = D_X * D_V

Purpose of Key vectors

Compare inputs to Queries

Purpose of Value vectors

Return knowledge from K and Q back to decoder

3 types of attention in Transformer

1. Cross-attention - Encoder K and V plus Q from decoder 2. Self-attention (Encoder) - K, V, Q from output of word + pos embedding 3. Self-attention (Decoder) - K, V, Q from (masked) output embedding t-1

Difference of Encoder-Decoder Attention vs self-attention

In E/D attention, Queries come externally(e.g. decoder). In self-attention, Queries come from the inputs themselves.

What happens if you permute the order of the inputs in an self-attention layer?

Permutation equivariant: Output values will have same values as ordered, but in different order.

Since self-attention is permutation equivariant, what must be added beforehand to propagate the order of the input sequence?

Add position embedding to word embedding.

Quiz 4 Flashcards

(85 cards)