Quiz 4 - Module 3 Flashcards

Question

Current (Standard) Approach to (Soft) Attention

Answer 1

* Take a set of vectors *u₁,...u_n* * Inner product each of the vectors with controller *q* * **unordered set * Take the softmax of the set of numbers to get weights *a*_j * The output is the product of the weights *a_j* and the inputs *u_k*

Answer 2

* intrinsic * evaluation on a specific/intermediate subtask * ex - nearest neighbor of a particular word vector * fast to compute * helps to understand the system * not clear if really helpful unless correlation to real task is established * extrinsic * evaluation on real task * can take a long time to compute * unclear if the subsystem is the problem or its interaction * if replacing exactly one subsystem with another improves accuracy -\> winning

Answer 3

RNNs, when unrolled are just feed-forward Neural Networks with affine transformations and nonlinearities

Answer 4

Use the softmax on the inner product between the context word and inner word. Both words are represented by vectors.

Answer 5

Optimize the objective that connected nodes have more similar embeddings than unconnected via gradient descent

Answer 6

Layers of attention where each is input the output of the previous attention layer. The controller *q* is the hidden state

Answer 7

* result of affine transformation of previous hidden state and current input passed through sigmoid * decides how much the input should affect tthe cell state

Answer 8

Precision-based metric that measures n-gram overlap with a human reference

Answer 9

sub-word embeddings Add info to word2vec which better handles out of vocabulary words

Answer 10

* Idea - use words to predict their context words * Context - a fixed window of size 2m

Answer 11

* predictive typing * in search fields * for keyboards * for assisted typing, e.g. sentence completion * automatic speech recognition * how likely is user to have said "my hair is wet" vs "my hairy sweat" * basic grammar correction * p(They're happy together) \> p(Their happy together)

Answer 12

Model generates all the tokens of a sequence in parallel resulting in faster generation speed compared to auto-regresive models, but with cost of lower accuracy

Answer 13

Condition the language modeling equation on a new context, *c*

Answer 14

character -\> word -\> NP/VP/... -\> clause -\> sentence -\> story

Answer 15

False - Attention can be soft or hard. * Hard - where samples are drawn from the distribution over the input * soft - where the distribution is used directly as a weighted average

Answer 16

Representational power grows with the size of the input

Answer 17

* Position encodings depending on the location of a token in the text * For language models: causal attention * graph structure of a sequence (exclude things that dont go L -\> R) * Training code outputs a prediction at each token simultaneously (and takes a gradient at each token simultaneously) * multiplies training speed by the size of the context

Answer 18

Long Short-Term Memory (LSTM) Network

Answer 19

* Take as input a sequence of words * Cover up some words with special tokens () * Take word embeddings with mask (+ positional embeddings) and feed to transformer encoder * No notion of position of inputs, hence positional embeddings added * Make predictions of masked words * Give a significant boost in performance

Answer 20

* don't have to stick to a single language * join two sentences in a sequence (english and french) with a separator between * Mask the word(s) of interest in both languages * Add position and language embeddings to the word embeddings * Feed through transformer encoder architecture * Make predictions of the masked words * Strength: cross-lingual transfer

Answer 21

Speed up by performing matrix multiplication in a smaller precision domain

Answer 22

Take a pre-trained model and further train it with classificaiton of english data. Model would also be able to classify in other languages even if the training language was only english

Answer 23

Like compression, where most frequent adjacent pair is iteratively replaced

Answer 24

Masked Language Modeling is a pre-training task

Answer 25

True - why it is the mechanism of choice for attention

Answer 26

* Learning hierarchical representations by embedding entities into hyperbolic space * Discover hierarchies from similarity measurements * Needs less dimensions than word2vec (around 2) * in the circular shape, more detailed objects are on the perimeter

Answer 27

* Entire input sentence (encoder outputs) * All previously predicted tokens (decoder "state")

Answer 28

* intractable * exponential search space of possible seqs * estimated by beam search * size: 4 to 6

Answer 29

Using the actual word instead of the predicted word to feed to the next time step of the RNN. It allows the model to keep training effectively even if it would have made a mistake in previous time steps.

Answer 30

(average) negative loglikelihood

Answer 31

* Combines multiple attention heads being trained in the same way on the same data - but with different weight matrices, and yielding different values * Each of the L attention heads yields values for each token - these valures are then multiplied by trained parameters and added

Answer 32

The expected number of bits required to represent an event drawn from the reference distribution (p\*) when using a coding scheme optimal for p

Answer 33

* result of affine transformation of previous hidden state and current input passed through sigmoid * decides how much of previous cell state to keep around * f_t = 0, forget everything * f_t = 1, remember everything

Answer 34

* Topic aware language model * c = the topic, s = the text * Text summarization * c = long document, s = summary * Machine Translation * c = french, s = english * Image captioning * c = image, s = caption * Optical character recognition * c = image of a line, s = its content * speech recognition * c = recording, s = content

Answer 35

* Cannot easily support variable-sized sequences as inputs * Cannot easily support variable-sized sequences as outputs * No inherent temporal structures * no notion that input 1 comes before input 2 * No practical way of holding state * does not generalize when words change order * The size of the network grows with the maximum allowed size of the input or output sequences

Answer 36

* Not all tokens likely for every input sequence * IBM alignment models using stats to model probabilities of translation * lexical probs can be used to predict most likely output for given input

Answer 37

A word's meaning is given by the words that frequently appear close-by

Answer 38

Geometric mean of the inverse probability of a sequence of words according to the model. Perplexity of a discrete uniform distribution over k of n's is k.

Answer 39

Cross-entropy between teacher and student or KL divergence

Answer 40

Softmax is permutation equivariant A permutation of the input leads to the same permutation of the output

Answer 41

Allows us to estimate probabilities of sequences (ex: p "I eat an apple") and let us perform comparisons p(s) = p(w₁, w₂, ..., w_n) = p(w₁) p(w₂|w₁) p(w₃|w₁,w₂)...p(w_n | w_n-1, ..., w₁)

Answer 42

Cross entropy averaged over all words in the sequence. Where the reference distribution is the emperical distribution of words in the sequence. This is commonly used as a loss function.

Answer 43

* result of affine transformation of previous hidden state and current input passed through tanh * new information coming from the input we've just seen * modulated by the input gate

Answer 44

Updates become expensive, so states are carried forward forever but only backpropagate for a fixed set of states.

Answer 45

* Have pretrained model (teacher) * too slow or too expensive * Add a student model * Both teacher and student model perform (soft) prediction * Difference in loss between teacher and student is distillation loss * Student model * minimize distillation loss * minimize student loss

Answer 46

False - The true word is fed to the next timestep, not the predicted word. "Teacher Forcing"

Answer 47

xt - input at time t ht - state at time t ht-1 - state at time t - 1 f\_theta - cell The RNN is recursive with state being passed at each time step to the next one

Answer 48

* One hot vector representation of all words in vocabulary

Answer 49

* Graphs * node \> vector * optimize objective that connected nodes have more similar embeddings than unconnected nodes via gradient descent * Words * word \> vector

Answer 50

False - Small change can cause catastrophic error

Answer 51

Conditional language model P(t | s ) = P(t₁ | s) \* ... P(t_n | t₁, ..., t_n-1, s)

Answer 52

* Softmax at the final layer of a MLP * *q* is the **last** hidden state * *{u₁,...,u_n}* is the embedings of the **class labels** * samples from the distribution corresponds to **labelings (outputs)** * Softmax Attention * *q* is an **internal** hidden state * *{u₁,...,u_n}* is the embeddings of an **input (ie. previous layer)** * distribution correspond to a **summary** of *{u₁,...,u_n}* * **a weighted summary of *u*

Answer 53

Efficient Estimation of Word Representations in Vector space

Answer 54

intrinsic Evaluate word vectors by how well their cosine distance after addition captures inuitive semantic and syntactic analogy questions

Answer 55

Optimze the objective that connected nodes have more similar embeddings than unconnected nodes via gradient descent

Answer 56

Knowledge Graphs Recommender Systems Social graphs

Answer 57

* Word Embeddings * Idea - use words to predict their context words * context - a fiexed window of size 2m

Answer 58

Global Vectors Training of the embedding is performed on aggregated global word cooccurance statistics from a corpus.

Answer 59

Hierarchical Softmax Negative Sampling

Answer 60

up or down weight a whole range of elements

Answer 61

generative (can make new sequences of words based off of conditional probabilities)

Answer 62

False - Task-agnostic

Answer 63

* ex: sentence-level tasks (sentiment analysis) * input a sentence without any masked tokens + positions, go through transformer encoder architecture, output global meaning of sentence

Answer 64

* Cross Entropy

Answer 65

A learned map from entities to vectors of numbers that encodes similarity

Answer 66

A word and its contet is a positive training sample; a random word in that sample context gives a negative training sample

Quiz 4 - Module 3 Flashcards

(100 cards)