Quiz 4 - Module 3 Flashcards
LSTM output gate (ot)
- result of affine transformation of previous hidden state and current input passed through sigmoid
- modulates the value of the hidden state
- decides how much of the cell state we want to surface
RNN Language Mode: Inference
- Start with first word, in practice use a special symbol to indicate new sentence
- Feed the words in the history until we run out of history
- Take hidden state h, transform
- project h into a high dimensional space (same dimension as words in vocabulary)
- normalize transformed h
- use softmax
- result: probability distribution of believed next work for model
Why are graph embeddings useful?
- task-agnostic entity representations
- features are useful on downstream tasks without much data
- nearest neighbors are semantically meaningful
Contexualized Word Embeddings Algorithms
elmo, bert
The most standard form of attention in current neural networks is implemented with the ____
Softmax
Many to many Sequence Modeling examples
- speech recognition
- optical character recognition
Token-level tasks
- ex: named entity recognition
- input a sentence without any masked tokens + positions, go through transformer encoder architecture, output classifications of entities (persons, locations, dates)
Steps of Beam Search Algorithm
- Search exponential space in linear time
- Beam size k determines width of search
- At each step, extend each of k elements by one token
- Top k overall then become the hypthoses for next step
data:image/s3,"s3://crabby-images/8155e/8155e8daa6522c95dc19c39f8aee1c1e73c1b68f" alt=""
Self-Attention improves on the multi-layer softmax attention method by ___
“Multi-query hidden-state propagation”
Having a controller state for every single input.
The size of the controller state grows with the input
Data Scarcity Issues
- Language Similarity missing
- language is different from source (ie. not similar to english like spanish/french are)
- Domain incorrect
- ie. medical terms not social language
- Evaluation
- no access to real test set
Many to One Sequence Modeling examples
- Sentiment Analysis
- Topic Classification
Attention
Weighing or probability distribution over inputs that depends on computational state and inputs
Differentiably Selecting a Vector from a set
- Given vectors {u1, …, un} and query vector q
- The most similar vector to q can be found via softmax(Uq)
Alignment in machine translation
For each word in the target, get a distribution over words in the source
Graph embeddings are a form of ____ learning on graphs
unsupervised learning
What makes Non-Local Neural Networks differ from fully connected Neural Networks?
Output is the weighted summation dynamically computed based on the data. In fully connected layer, the weights are not dynamic (learned and applied regardless of input).
Similarity function in non-local neural network is data dependent. Allows the network to learn the connectivity pattern and learn for each piece of data what is important and then sum up the contribution across those pieces of data to form the output.
Distribution over inputs that depends on computational state and the inputs themselves
Attention
Roll a fair die and guess. Perplexity?
6
T/F:Softmax is useful for random selection
True
Recurrent Neural Networks are typically designed for ____ data
sequential
Sequence Transduction
Sequence to Sequence (Many to Many Sequence Modeling)
what information is important for graph representations?
- state
- compactly representing all the data we have processed thus far
- neighborhood
- what other elements to incorporate?
- selecting from a set of elements with similarity or attention
- propagation of info
- how to update info given selected elements
What dominates computation cost in machine translation
Inference
- Expensive
- step-by-step computation (auto-regressive, predict diff token at each step)
- output projection (vocab * output * beam size)
- deeper models
- Strategies
- smaller vocabs
- more efficient computation
- reduce depth/increase parallelism
What allows information to propagate directly between distant computationl nodes while making minimal structural assumptions?
The attention algorithm
Current (Standard) Approach to (Soft) Attention
- Take a set of vectors u1,…un
- Inner product each of the vectors with controller q
- unordered set
- Take the softmax of the set of numbers to get weights aj
- The output is the product of the weights aj and the inputs uk
How to evaluate word embeddings
- intrinsic
- evaluation on a specific/intermediate subtask
- ex - nearest neighbor of a particular word vector
- fast to compute
- helps to understand the system
- not clear if really helpful unless correlation to real task is established
- evaluation on a specific/intermediate subtask
- extrinsic
- evaluation on real task
- can take a long time to compute
- unclear if the subsystem is the problem or its interaction
- if replacing exactly one subsystem with another improves accuracy -> winning
RNNs, when unrolled are just ____ with _____ transformations and ____
RNNs, when unrolled are just feed-forward Neural Networks with affine transformations and nonlinearities
How do we define the probability of a context word given a center word?
Use the softmax on the inner product between the context word and inner word. Both words are represented by vectors.
data:image/s3,"s3://crabby-images/b4290/b4290fc36125eb6869f7b5dc4d9a74384fd4b1c5" alt=""
Graph Embedding
Optimize the objective that connected nodes have more similar embeddings than unconnected via gradient descent
Multi-layer Soft Attention
Layers of attention where each is input the output of the previous attention layer. The controller q is the hidden state
LSTM input gate (gt)
- result of affine transformation of previous hidden state and current input passed through sigmoid
- decides how much the input should affect tthe cell state
Bleu score
Precision-based metric that measures n-gram overlap with a human reference
fastText
sub-word embeddings
Add info to word2vec which better handles out of vocabulary words
Word2Vec Idea/Context
- Idea - use words to predict their context words
- Context - a fixed window of size 2m
data:image/s3,"s3://crabby-images/4d76f/4d76fbc47d7c9e83c0665d87ae0590ff97a34daf" alt=""
Applications of Language Modeling
- predictive typing
- in search fields
- for keyboards
- for assisted typing, e.g. sentence completion
- automatic speech recognition
- how likely is user to have said “my hair is wet” vs “my hairy sweat”
- basic grammar correction
- p(They’re happy together) > p(Their happy together)
Non-Autoregressive Machine Translation
Model generates all the tokens of a sequence in parallel resulting in faster generation speed compared to auto-regresive models, but with cost of lower accuracy
Conditional Language Modeling
Condition the language modeling equation on a new context, c
data:image/s3,"s3://crabby-images/d4e03/d4e0363aeb2b383f48b5e5247e4842c70c9899e0" alt=""
Hierarchical Compositionality for NLP
character -> word -> NP/VP/… -> clause -> sentence -> story
Flip a fair coin and guess. Perplexity?
2
Total loss for knowledge distillation
data:image/s3,"s3://crabby-images/a6aa9/a6aa9c448f25725ca6796bdf2190de5f5652f1ef" alt=""
T/F: Attentions are soft, not hard, where the distribution is used directly as a weighted average.
False - Attention can be soft or hard.
- Hard - where samples are drawn from the distribution over the input
- soft - where the distribution is used directly as a weighted average
Important property of attention as a layer
Representational power grows with the size of the input
Standard specializations of Transformers for text
- Position encodings depending on the location of a token in the text
- For language models: causal attention
- graph structure of a sequence (exclude things that dont go L -> R)
- Training code outputs a prediction at each token simultaneously (and takes a gradient at each token simultaneously)
- multiplies training speed by the size of the context
What architecture was created to attempt to alleviate the vanishing gradient problem?
Long Short-Term Memory (LSTM) Network
Masked Language Modeling
- Take as input a sequence of words
- Cover up some words with special tokens ()
- Take word embeddings with mask (+ positional embeddings) and feed to transformer encoder
- No notion of position of inputs, hence positional embeddings added
- Make predictions of masked words
- Give a significant boost in performance
Cross-lingual Masked Language Modeling
- don’t have to stick to a single language
- join two sentences in a sequence (english and french) with a separator between
- Mask the word(s) of interest in both languages
- Add position and language embeddings to the word embeddings
- Feed through transformer encoder architecture
- Make predictions of the masked words
- Strength: cross-lingual transfer
Quantization
Speed up by performing matrix multiplication in a smaller precision domain
Cross-lingual transfer
Take a pre-trained model and further train it with classificaiton of english data. Model would also be able to classify in other languages even if the training language was only english
Byte-pair encoding
Like compression, where most frequent adjacent pair is iteratively replaced
Masked Language Modeling is considered a _____ task
Masked Language Modeling is a pre-training task
T/F: Softmax is differentiable
True - why it is the mechanism of choice for attention
Hyperbolic Embeddings
- Learning hierarchical representations by embedding entities into hyperbolic space
- Discover hierarchies from similarity measurements
- Needs less dimensions than word2vec (around 2)
- in the circular shape, more detailed objects are on the perimeter
Neural Machine Translation:
The probability of each output token estimated separately (left-to-right) is based on:
- Entire input sentence (encoder outputs)
- All previously predicted tokens (decoder “state”)
Loss function for student (knowledge distillation)
data:image/s3,"s3://crabby-images/82279/8227995e0f1fd0eeaf927c2e9a6128625e8883b1" alt=""
Differentiably Selectig a Vector from a set
data:image/s3,"s3://crabby-images/9897a/9897af5750e59e7f4423483bff5565fb27d7802e" alt=""
In Neural Machine Translation, argmax p(t | s) is considered ____
- intractable
- exponential search space of possible seqs
- estimated by beam search
- size: 4 to 6
Teacher Forcing
Using the actual word instead of the predicted word to feed to the next time step of the RNN. It allows the model to keep training effectively even if it would have made a mistake in previous time steps.
What is the objective function for Skip-gram?
(average) negative loglikelihood
data:image/s3,"s3://crabby-images/dd4bc/dd4bc3afa7fac04d370d9a6163916adda1ed071f" alt=""
Multi-head attention
- Combines multiple attention heads being trained in the same way on the same data - but with different weight matrices, and yielding different values
- Each of the L attention heads yields values for each token - these valures are then multiplied by trained parameters and added
Cross Entropy
The expected number of bits required to represent an event drawn from the reference distribution (p*) when using a coding scheme optimal for p
LSTM forget gate (ft)
- result of affine transformation of previous hidden state and current input passed through sigmoid
- decides how much of previous cell state to keep around
- ft = 0, forget everything
- ft = 1, remember everything
Applications of Conditional Language Modeling
- Topic aware language model
- c = the topic, s = the text
- Text summarization
- c = long document, s = summary
- Machine Translation
- c = french, s = english
- Image captioning
- c = image, s = caption
- Optical character recognition
- c = image of a line, s = its content
- speech recognition
- c = recording, s = content
What is the problem with modeling sequences with Multi-layer perceptrons?
- Cannot easily support variable-sized sequences as inputs
- Cannot easily support variable-sized sequences as outputs
- No inherent temporal structures
- no notion that input 1 comes before input 2
- No practical way of holding state
- does not generalize when words change order
- The size of the network grows with the maximum allowed size of the input or output sequences
T/F: Embeddings of different types (page, video, or word embeddings) can be combined to perform one task
True
Vocabulary Reduction
- Not all tokens likely for every input sequence
- IBM alignment models using stats to model probabilities of translation
- lexical probs can be used to predict most likely output for given input
Distributional Semantics
A word’s meaning is given by the words that frequently appear close-by
Perplexity
Geometric mean of the inverse probability of a sequence of words according to the model.
Perplexity of a discrete uniform distribution over k of n’s is k.
data:image/s3,"s3://crabby-images/2d96a/2d96ae21d0f7656acc90fe4457acfb8d0b9d3a15" alt=""
Loss function for distillation (knowledge distillation)
Cross-entropy between teacher and student or KL divergence
data:image/s3,"s3://crabby-images/7ca0b/7ca0b22831ae0b0a4163bd6c5e1e1bad9cea4395" alt=""
Softmax is permutation ____
Softmax is permutation equivariant
A permutation of the input leads to the same permutation of the output
Language Modeling
Allows us to estimate probabilities of sequences (ex: p “I eat an apple”) and let us perform comparisons
p(s) = p(w1, w2, …, wn)
= p(w1) p(w2|w1) p(w3|w1,w2)…p(wn | wn-1, …, w1)
Per-word Cross-entropy
Cross entropy averaged over all words in the sequence. Where the reference distribution is the emperical distribution of words in the sequence. This is commonly used as a loss function.
data:image/s3,"s3://crabby-images/a4c6f/a4c6f4bd39562780141ecf2b3de6258d74610cdd" alt=""
LSTM candidate update (ut)
- result of affine transformation of previous hidden state and current input passed through tanh
- new information coming from the input we’ve just seen
- modulated by the input gate
Truncated backpropagation through time
Updates become expensive, so states are carried forward forever but only backpropagate for a fixed set of states.
Non-Local Neural Networks
data:image/s3,"s3://crabby-images/3f92b/3f92b92ac9ee20854366b4c73c2f1b1c14ae9898" alt=""
Knowledge Distillation
- Have pretrained model (teacher)
- too slow or too expensive
- Add a student model
- Both teacher and student model perform (soft) prediction
- Difference in loss between teacher and student is distillation loss
- Student model
- minimize distillation loss
- minimize student loss
Selecting a vector from a set
data:image/s3,"s3://crabby-images/b5a72/b5a729d5324d5f3b8ae70f7ac913631ae49a344c" alt=""
T/F: At each timestep for an RNN, the predicted word by the network is fed to the next timestep
False - The true word is fed to the next timestep, not the predicted word. “Teacher Forcing”
RNN components
data:image/s3,"s3://crabby-images/21633/21633611d8c23c6a0b647036a6bf269d07e6b091" alt=""
xt - input at time t
ht - state at time t
ht-1 - state at time t - 1
f_theta - cell
The RNN is recursive with state being passed at each time step to the next one
How to feed words to an RNN?
- One hot vector representation of all words in vocabulary
The more general way to look at embeddings
- Graphs
- node > vector
- optimize objective that connected nodes have more similar embeddings than unconnected nodes via gradient descent
- Words
- word > vector
Vanilla (Elman) RNN
data:image/s3,"s3://crabby-images/b4011/b40111fc30f7d026930d098d95a40b4cc46ab65a" alt=""
T/F: Neural translation quality changes linearly
False - Small change can cause catastrophic error
Translation is often modeled as a _____
Conditional language model
P(t | s ) = P(t1 | s) * … P(tn | t1, …, tn-1, s)
How do the query q, vectors {u1,…,un}, and distributions in softmax attention differ from that in an MLP?
- Softmax at the final layer of a MLP
- q is the last hidden state
- {u1,…,un} is the embedings of the class labels
- samples from the distribution corresponds to labelings (outputs)
- Softmax Attention
- q is an internal hidden state
- {u1,…,un} is the embeddings of an input (ie. previous layer)
- distribution correspond to a summary of {u1,…,un}
- a weighted summary of u
What causes the vanishing gradient problem in Vanilla RNNs?
data:image/s3,"s3://crabby-images/676b2/676b24db050e86f2dd0bed57e07ce7ebc0e5a661" alt=""
Word2vec
Efficient Estimation of Word Representations in Vector space
Word embedding evaluation
a:b :: c: ?
Is an example of a ___ word embedding evaluation
intrinsic
Evaluate word vectors by how well their cosine distance after addition captures inuitive semantic and syntactic analogy questions
Graph Embeddings
Optimze the objective that connected nodes have more similar embeddings than unconnected nodes via gradient descent
Graph Data examples
Knowledge Graphs
Recommender Systems
Social graphs
Word2vec: the skip-gram model
- Word Embeddings
- Idea - use words to predict their context words
- context - a fiexed window of size 2m
data:image/s3,"s3://crabby-images/2a980/2a9801ed33fb0893c348b4602b683b6feaceb293" alt=""
GloVe
Global Vectors
Training of the embedding is performed on aggregated global word cooccurance statistics from a corpus.
What are less computationally expensive alternatives for the inner product softmax in Word2Vec?
Hierarchical Softmax
Negative Sampling
Attention-Based Networks can _________ in an ordered/arbritary set
up or down weight a whole range of elements
Language Models are ____ models of language
generative (can make new sequences of words based off of conditional probabilities)
T/F: Graph embeddings are a task specific entity representation
False - Task-agnostic
Sentence-level tasks
- ex: sentence-level tasks (sentiment analysis)
- input a sentence without any masked tokens + positions, go through transformer encoder architecture, output global meaning of sentence
Evaluting LM Performance
- Cross Entropy
data:image/s3,"s3://crabby-images/db44a/db44a3a291ad7b492c09f8947fadf5463eb9fc92" alt=""
Embedding
A learned map from entities to vectors of numbers that encodes similarity
Collobert and Weston vectors
A word and its contet is a positive training sample; a random word in that sample context gives a negative training sample
T/F: The softmax composed of the inner product between two word vectors is expensive to compute
True.