Quiz 4 - Module 3 Flashcards

1
Q

LSTM output gate (ot)

A
  • result of affine transformation of previous hidden state and current input passed through sigmoid
  • modulates the value of the hidden state
  • decides how much of the cell state we want to surface
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

RNN Language Mode: Inference

A
  • Start with first word, in practice use a special symbol to indicate new sentence
  • Feed the words in the history until we run out of history
  • Take hidden state h, transform
    • project h into a high dimensional space (same dimension as words in vocabulary)
  • normalize transformed h
    • use softmax
  • result: probability distribution of believed next work for model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why are graph embeddings useful?

A
  • task-agnostic entity representations
  • features are useful on downstream tasks without much data
  • nearest neighbors are semantically meaningful
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Contexualized Word Embeddings Algorithms

A

elmo, bert

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The most standard form of attention in current neural networks is implemented with the ____

A

Softmax

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Many to many Sequence Modeling examples

A
  • speech recognition
  • optical character recognition
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Token-level tasks

A
  • ex: named entity recognition
  • input a sentence without any masked tokens + positions, go through transformer encoder architecture, output classifications of entities (persons, locations, dates)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Steps of Beam Search Algorithm

A
  • Search exponential space in linear time
  • Beam size k determines width of search
  • At each step, extend each of k elements by one token
  • Top k overall then become the hypthoses for next step
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Self-Attention improves on the multi-layer softmax attention method by ___

A

“Multi-query hidden-state propagation”

Having a controller state for every single input.

The size of the controller state grows with the input

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data Scarcity Issues

A
  • Language Similarity missing
    • language is different from source (ie. not similar to english like spanish/french are)
  • Domain incorrect
    • ie. medical terms not social language
  • Evaluation
    • no access to real test set
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Many to One Sequence Modeling examples

A
  • Sentiment Analysis
  • Topic Classification
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Attention

A

Weighing or probability distribution over inputs that depends on computational state and inputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Differentiably Selecting a Vector from a set

A
  • Given vectors {u1, …, un} and query vector q
  • The most similar vector to q can be found via softmax(Uq)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Alignment in machine translation

A

For each word in the target, get a distribution over words in the source

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Graph embeddings are a form of ____ learning on graphs

A

unsupervised learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What makes Non-Local Neural Networks differ from fully connected Neural Networks?

A

Output is the weighted summation dynamically computed based on the data. In fully connected layer, the weights are not dynamic (learned and applied regardless of input).

Similarity function in non-local neural network is data dependent. Allows the network to learn the connectivity pattern and learn for each piece of data what is important and then sum up the contribution across those pieces of data to form the output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Distribution over inputs that depends on computational state and the inputs themselves

A

Attention

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Roll a fair die and guess. Perplexity?

A

6

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

T/F:Softmax is useful for random selection

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Recurrent Neural Networks are typically designed for ____ data

A

sequential

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Sequence Transduction

A

Sequence to Sequence (Many to Many Sequence Modeling)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

what information is important for graph representations?

A
  • state
    • compactly representing all the data we have processed thus far
  • neighborhood
    • what other elements to incorporate?
    • selecting from a set of elements with similarity or attention
  • propagation of info
    • how to update info given selected elements
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What dominates computation cost in machine translation

A

Inference

  • Expensive
    • step-by-step computation (auto-regressive, predict diff token at each step)
    • output projection (vocab * output * beam size)
    • deeper models
  • Strategies
    • smaller vocabs
    • more efficient computation
    • reduce depth/increase parallelism
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What allows information to propagate directly between distant computationl nodes while making minimal structural assumptions?

A

The attention algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Current (Standard) Approach to (Soft) Attention
* Take a set of vectors *u1,...un* * Inner product each of the vectors with controller *q* * *​*unordered set * Take the softmax of the set of numbers to get weights *a*j * The output is the product of the weights *aj* and the inputs *uk*
26
How to evaluate word embeddings
* intrinsic * evaluation on a specific/intermediate subtask * ex - nearest neighbor of a particular word vector * fast to compute * helps to understand the system * not clear if really helpful unless correlation to real task is established * extrinsic * evaluation on real task * can take a long time to compute * unclear if the subsystem is the problem or its interaction * if replacing exactly one subsystem with another improves accuracy -\> winning
27
RNNs, when unrolled are just ____ with _____ transformations and \_\_\_\_
RNNs, when unrolled are just feed-forward Neural Networks with affine transformations and nonlinearities
28
How do we define the probability of a context word given a center word?
Use the softmax on the inner product between the context word and inner word. Both words are represented by vectors.
29
Graph Embedding
Optimize the objective that connected nodes have more similar embeddings than unconnected via gradient descent
30
Multi-layer Soft Attention
Layers of attention where each is input the output of the previous attention layer. The controller *q* is the hidden state
31
LSTM input gate (gt)
* result of affine transformation of previous hidden state and current input passed through sigmoid * decides how much the input should affect tthe cell state
32
Bleu score
Precision-based metric that measures n-gram overlap with a human reference
33
fastText
sub-word embeddings Add info to word2vec which better handles out of vocabulary words
34
Word2Vec Idea/Context
* Idea - use words to predict their context words * Context - a fixed window of size 2m
35
Applications of Language Modeling
* predictive typing * in search fields * for keyboards * for assisted typing, e.g. sentence completion * automatic speech recognition * how likely is user to have said "my hair is wet" vs "my hairy sweat" * basic grammar correction * p(They're happy together) \> p(Their happy together)
36
Non-Autoregressive Machine Translation
Model generates all the tokens of a sequence in parallel resulting in faster generation speed compared to auto-regresive models, but with cost of lower accuracy
37
Conditional Language Modeling
Condition the language modeling equation on a new context, *c*
38
Hierarchical Compositionality for NLP
character -\> word -\> NP/VP/... -\> clause -\> sentence -\> story
39
Flip a fair coin and guess. Perplexity?
2
40
Total loss for knowledge distillation
41
T/F: Attentions are soft, not hard, where the distribution is used directly as a weighted average.
False - Attention can be soft or hard. * Hard - where samples are drawn from the distribution over the input * soft - where the distribution is used directly as a weighted average
42
Important property of attention as a layer
Representational power grows with the size of the input
43
Standard specializations of Transformers for text
* Position encodings depending on the location of a token in the text * For language models: causal attention * graph structure of a sequence (exclude things that dont go L -\> R) * Training code outputs a prediction at each token simultaneously (and takes a gradient at each token simultaneously) * multiplies training speed by the size of the context
44
What architecture was created to attempt to alleviate the vanishing gradient problem?
Long Short-Term Memory (LSTM) Network
45
Masked Language Modeling
* Take as input a sequence of words * Cover up some words with special tokens () * Take word embeddings with mask (+ positional embeddings) and feed to transformer encoder * No notion of position of inputs, hence positional embeddings added * Make predictions of masked words * Give a significant boost in performance
46
Cross-lingual Masked Language Modeling
* don't have to stick to a single language * join two sentences in a sequence (english and french) with a separator between * Mask the word(s) of interest in both languages * Add position and language embeddings to the word embeddings * Feed through transformer encoder architecture * Make predictions of the masked words * Strength: cross-lingual transfer
47
Quantization
Speed up by performing matrix multiplication in a smaller precision domain
48
Cross-lingual transfer
Take a pre-trained model and further train it with classificaiton of english data. Model would also be able to classify in other languages even if the training language was only english
49
Byte-pair encoding
Like compression, where most frequent adjacent pair is iteratively replaced
50
Masked Language Modeling is considered a _____ task
Masked Language Modeling is a pre-training task
51
T/F: Softmax is differentiable
True - why it is the mechanism of choice for attention
52
Hyperbolic Embeddings
* Learning hierarchical representations by embedding entities into hyperbolic space * Discover hierarchies from similarity measurements * Needs less dimensions than word2vec (around 2) * in the circular shape, more detailed objects are on the perimeter
53
Neural Machine Translation: The probability of each output token estimated separately (left-to-right) is based on:
* Entire input sentence (encoder outputs) * All previously predicted tokens (decoder "state")
54
Loss function for student (knowledge distillation)
55
Differentiably Selectig a Vector from a set
56
In Neural Machine Translation, argmax p(t | s) is considered \_\_\_\_
* intractable * exponential search space of possible seqs * estimated by beam search * size: 4 to 6
57
Teacher Forcing
Using the actual word instead of the predicted word to feed to the next time step of the RNN. It allows the model to keep training effectively even if it would have made a mistake in previous time steps.
58
What is the objective function for Skip-gram?
(average) negative loglikelihood
59
Multi-head attention
* Combines multiple attention heads being trained in the same way on the same data - but with different weight matrices, and yielding different values * Each of the L attention heads yields values for each token - these valures are then multiplied by trained parameters and added
60
Cross Entropy
The expected number of bits required to represent an event drawn from the reference distribution (p\*) when using a coding scheme optimal for p
61
LSTM forget gate (ft)
* result of affine transformation of previous hidden state and current input passed through sigmoid * decides how much of previous cell state to keep around * ft = 0, forget everything * ft = 1, remember everything
62
Applications of Conditional Language Modeling
* Topic aware language model * c = the topic, s = the text * Text summarization * c = long document, s = summary * Machine Translation * c = french, s = english * Image captioning * c = image, s = caption * Optical character recognition * c = image of a line, s = its content * speech recognition * c = recording, s = content
63
What is the problem with modeling sequences with Multi-layer perceptrons?
* Cannot easily support variable-sized sequences as inputs * Cannot easily support variable-sized sequences as outputs * No inherent temporal structures * no notion that input 1 comes before input 2 * No practical way of holding state * does not generalize when words change order * The size of the network grows with the maximum allowed size of the input or output sequences
64
T/F: Embeddings of different types (page, video, or word embeddings) can be combined to perform one task
True
65
Vocabulary Reduction
* Not all tokens likely for every input sequence * IBM alignment models using stats to model probabilities of translation * lexical probs can be used to predict most likely output for given input
66
Distributional Semantics
A word's meaning is given by the words that frequently appear close-by
67
Perplexity
Geometric mean of the inverse probability of a sequence of words according to the model. Perplexity of a discrete uniform distribution over k of n's is k.
68
Loss function for distillation (knowledge distillation)
Cross-entropy between teacher and student or KL divergence
69
Softmax is permutation \_\_\_\_
Softmax is permutation equivariant A permutation of the input leads to the same permutation of the output
70
Language Modeling
Allows us to estimate probabilities of sequences (ex: p "I eat an apple") and let us perform comparisons p(s) = p(w1, w2, ..., wn) = p(w1) p(w2|w1) p(w3|w1,w2)...p(wn | wn-1, ..., w1)
71
Per-word Cross-entropy
Cross entropy averaged over all words in the sequence. Where the reference distribution is the emperical distribution of words in the sequence. This is commonly used as a loss function.
72
LSTM candidate update (ut)
* result of affine transformation of previous hidden state and current input passed through tanh * new information coming from the input we've just seen * modulated by the input gate
73
Truncated backpropagation through time
Updates become expensive, so states are carried forward forever but only backpropagate for a fixed set of states.
74
Non-Local Neural Networks
75
Knowledge Distillation
* Have pretrained model (teacher) * too slow or too expensive * Add a student model * Both teacher and student model perform (soft) prediction * Difference in loss between teacher and student is distillation loss * Student model * minimize distillation loss * minimize student loss
76
Selecting a vector from a set
77
T/F: At each timestep for an RNN, the predicted word by the network is fed to the next timestep
False - The true word is fed to the next timestep, not the predicted word. "Teacher Forcing"
78
RNN components
xt - input at time t ht - state at time t ht-1 - state at time t - 1 f\_theta - cell The RNN is recursive with state being passed at each time step to the next one
79
How to feed words to an RNN?
* One hot vector representation of all words in vocabulary
80
The more general way to look at embeddings
* Graphs * node \> vector * optimize objective that connected nodes have more similar embeddings than unconnected nodes via gradient descent * Words * word \> vector
81
Vanilla (Elman) RNN
82
T/F: Neural translation quality changes linearly
False - Small change can cause catastrophic error
83
Translation is often modeled as a \_\_\_\_\_
Conditional language model P(t | s ) = P(t1 | s) \* ... P(tn | t1, ..., tn-1, s)
84
How do the query *q,* vectors {u1,...,un}, and distributions in softmax attention differ from that in an MLP?
* Softmax at the final layer of a MLP * *q* is the **last** hidden state * *{u1,...,un}* is the embedings of the **class labels** * samples from the distribution corresponds to **labelings (outputs)** * Softmax Attention * *q* is an **internal** hidden state * *{u1,...,un}* is the embeddings of an **input (ie. previous layer)** * distribution correspond to a **summary** of *{u1,...,un}* * *​*a weighted summary of *u*
85
What causes the vanishing gradient problem in Vanilla RNNs?
86
Word2vec
Efficient Estimation of Word Representations in Vector space
87
Word embedding evaluation a:b :: c: ? Is an example of a ___ word embedding evaluation
intrinsic Evaluate word vectors by how well their cosine distance after addition captures inuitive semantic and syntactic analogy questions
88
Graph Embeddings
Optimze the objective that connected nodes have more similar embeddings than unconnected nodes via gradient descent
89
Graph Data examples
Knowledge Graphs Recommender Systems Social graphs
90
Word2vec: the skip-gram model
* Word Embeddings * Idea - use words to predict their context words * context - a fiexed window of size 2m
91
GloVe
Global Vectors Training of the embedding is performed on aggregated global word cooccurance statistics from a corpus.
92
What are less computationally expensive alternatives for the inner product softmax in Word2Vec?
Hierarchical Softmax Negative Sampling
93
Attention-Based Networks can _________ in an ordered/arbritary set
up or down weight a whole range of elements
94
Language Models are ____ models of language
generative (can make new sequences of words based off of conditional probabilities)
95
T/F: Graph embeddings are a task specific entity representation
False - Task-agnostic
96
Sentence-level tasks
* ex: sentence-level tasks (sentiment analysis) * input a sentence without any masked tokens + positions, go through transformer encoder architecture, output global meaning of sentence
97
Evaluting LM Performance
* Cross Entropy
98
Embedding
A learned map from entities to vectors of numbers that encodes similarity
99
Collobert and Weston vectors
A word and its contet is a positive training sample; a random word in that sample context gives a negative training sample
100
T/F: The softmax composed of the inner product between two word vectors is expensive to compute
True.