Quiz #5 Flashcards
What is ‘attention’?
Weighting or probability distribution over inputs that depends on computational state and inputs.
What does attention allow us to do?
It allows information to propagate between “distant” computational nodes while making minimal structural assumptions.
What is one of the most popular tools used for building attention mechanisms?
Softmax
Softmax is differentiable? (True/False)
True
What are the inputs to softmax attention?
A set of vectors {u1, u2, …u_} and a query vector ‘q’. What we want to do is select the most similar vector to q via p = softmax(U.q)
What is the difference between softmax applied to the final layer of an MLP and softmax attention?
- When Softmax is applied at the final layer of an MLP:
- q is the last hidden state, {u1, u2, … u_n} is the embedding of the class labels
- Samples from the distribution correspond to labelings (outputs)
- In Softmax attention:
- q is an internal hidden state, {u1, u2, … u_n} is the embeddings of an “input” (e.g. the previous layer)
- Samples from the distribution correspond to a summary of {u1, u2, … u_n}
What was the biological inspiration for attention mechanisms?
Saccades (basically rapid, discontinous eye movements between salient objects in the visual field).
At a conceptual level, what is it that are visual attention mechanisms trying to do?
Given the current state/history of glimpses, where and what scale should we look at next?
The representational power of a softmax attention layer (or more generally, any attention layer) decreases as the input size grows? (True/False)
False. It increases. This is because the size of the hidden state grows with the size of the input.
How is the similarity of query vector ‘q’ typically computed compared to the set of hidden state vectors {u1, u2, … u_n}?
Cosine similarity (i.e. inner product of q with each of the u vectors}
What is the ‘controller state’ in a Softmax Attention layer?
Initially, it’s just the input query vector q0. We use q0 to compute the next controller state (i.e. hidden state) q1, then use q1 to compute q2, …so on.
What are transformer models?
Models that are made up of multiple attention layers.
What are three important architectural distinctions that result in the superior performance of transformer models compared to previous attention based architectures?
- Multi-query hidden state propagation (“self-attention”)
- Multi-head attention
- Residual connections, LayerNorm
What is self-attention?
It uses a controller state (i.e. query/hidden state) for every single input). So the size of the controller state grows with the size of the input, giving it even more representational power than traditional attention networks (remember that the representational power of an attention network grows proportionally to its input size).
What is multi-head attention?
Multi-head attention combines multiple attention ‘heads’ being trained in the same way on the same data - but with different weight matrices, thus yielding different values.
Each of the ‘L’ attention heads yields values for each token - these values are then multiplied by trained parameters and added.
(To me this seems kind of similar to the idea in convnets of using multiple “filters” for each convolutional layer so we can learn different feature representations. )
What are some of the major reasons why machine translation is difficult?
- Language is ambigous
- Language depends on context
- Languages are very different (e.g. structure, what is implicit vs. explicit, etc.)
Translation is often modeled as a conditional language model? (True/False)
True. Typically Prob(tokens | source)
The probability of each output token is estimated together based on the source material in a machine translation model? (True/False)
False. They are estimated separately from left-to-right.
In a machine translation model, we calculate the probability of each output token estimated separately (left-to-right) based on what two things?
- Entire input sequence (encoder outputs)
- All previously predicted tokens (decoder “state”)
In the context of machine translation models, the argmax[p(t | s)] is intractable? (True/False)
True.
In the context of machine translation models, why is argmax[p(t | s)] impossible to find, and what technique do we use to remedy this?
- The problem: Exponential search space of possible sequences
- Remedy is to use beam search (typical beam size of 4 to 6)
What does the beam search algorithm allow us to do?
To search exponential space in linear time.
How does the beam search algorithm work for machine translation?
We explore a limited number of hypotheses ‘k’ at a time. At each step, we extend each of ‘k’ elements by one token. Top ‘k’ overall then become the hypotheses for the next step.
Each beam elements has a different _____? For transformer/decoder this is _________ input for _______ steps?
- State
- Self-Attention
- Previous steps
Total computation scales linearly with beam width? (True/False)
True
Computation over the beam is difficult to parallelize on GPUs when using the beam search algorithm? (True/False)
False. It is highly parallelizable over the beam.
What are the three reasons inference can be computational expensive for machine translation models?
- Step-by-step computation (auto-regressive inference)
- Output projection: theta(len_vocab * output_len * beam_size)
- Deeper models
What are three strategies we can use to overcome some of the inherit inefficiencies associated with machine translation inference?
- Use smaller vocabularies
- More efficient computation
Reduce depth / increase parallelism
To improve computational efficiency for machine translation models, what is one good reason why it’s reasonable to use smaller vocabularies?
Because while a vocabulary may be huge, for any given input sequence, the likely outputs will generally be constrained to a fairly small set.
IBM alignment models use statistical techniques to model the probability of one word being translated into another. Alternatively, lexical probabilities can be used to predict most likely output tokens for a given input. Using these approaches can achieve up to 60% speedup.
What is one way we can overcome the challenge of how to model rare or unseen words?
Model most frequent words as their own token, less frequent words get broken up into their constituent parts. One popular algorithm for doing this is byte-pair encoding.
How does byte-pair encoding work?
BPE comes from the idea of compression, where the most frequent adjacent pair is iteratively replaced.
Example:
Consider the string “abcdeababce”
Step 1: Replace most frequent pair “ab” with “X” (and add replacement rule)
“XcdeXXce”
X=”ab”
Step 2: Replace next most frrequent pair (here including the replacement byte)
“YdeXYe”
X=”ab”
Y=”Xc”
When using parallel computation to speedup machine translation inference, what is the major bottleneck?
Autoregressive inference time dominated by decoder
For machine translation models, average efficiency can be increased by translating multiple inputs at once? (True/False)
True (however, may not be practical for real-time systems)
What is the biggest slowdown typically the result of in machine translation models?
It comes from the requirement that for the decoder we need to predict each token in the output sequence one at a time. (i.e. autoregressive)
Neural Machine Translation (NMT) systems have an _______ learning curve with respect to the amount of training data, resulting in _______ quality in _________ settings, but better performance in ___________ settings.
- Steeper
- Worse
- Low-resource
- High-resource
What is one of the primary challenges of Neural Machine Translation (NMT) models?
Data Scarcity. There is lots of data available available, for example, English —> French, but what about Pashto –> Swahili? This is a big challenge.
Besides data scarcity, what are some of the other challenges associated with machine translation?
- Language similarity - many (most) languages are very different than English
- Domain - the training data that is available might not be closely related enough to your task of interest. (for example, Facebook trying to develop models for newsfeed using Ubuntu training manuals)
- Evaluation - no access to a test set (this is one of main blockers to research in low-resource settings)
What are some of the most powerful techniques for translation in low-resource settings?
- Multi-lingual training (i.e. exploiting the relatedness between languages)
- Backtranslation - using two languages (typically need to be fairly high-resource languagues) to train an intermediate model that can then be used to bootstrap a better model.
- Language Agnostic Sentence Representations - using a model to take a low-resource language into a high-resource language (done using an Bi-Directional Recurrent Encoder-Decoder style network