Quiz #5 Flashcards by Daniel Barker

What is ‘attention’?

Weighting or probability distribution over inputs that depends on computational state and inputs.

How well did you know this?

Not at all

Perfectly

What does attention allow us to do?

It allows information to propagate between “distant” computational nodes while making minimal structural assumptions.

How well did you know this?

Not at all

Perfectly

What is one of the most popular tools used for building attention mechanisms?

Softmax

How well did you know this?

Not at all

Perfectly

Softmax is differentiable? (True/False)

True

How well did you know this?

Not at all

Perfectly

What are the inputs to softmax attention?

A set of vectors {u1, u2, …u_} and a query vector ‘q’. What we want to do is select the most similar vector to q via p = softmax(U.q)

How well did you know this?

Not at all

Perfectly

What is the difference between softmax applied to the final layer of an MLP and softmax attention?

When Softmax is applied at the final layer of an MLP:
- q is the last hidden state, {u1, u2, … u_n} is the embedding of the class labels
- Samples from the distribution correspond to labelings (outputs)
In Softmax attention:
- q is an internal hidden state, {u1, u2, … u_n} is the embeddings of an “input” (e.g. the previous layer)
- Samples from the distribution correspond to a summary of {u1, u2, … u_n}

How well did you know this?

Not at all

Perfectly

What was the biological inspiration for attention mechanisms?

Saccades (basically rapid, discontinous eye movements between salient objects in the visual field).

How well did you know this?

Not at all

Perfectly

At a conceptual level, what is it that are visual attention mechanisms trying to do?

Given the current state/history of glimpses, where and what scale should we look at next?

How well did you know this?

Not at all

Perfectly

The representational power of a softmax attention layer (or more generally, any attention layer) decreases as the input size grows? (True/False)

False. It increases. This is because the size of the hidden state grows with the size of the input.

How well did you know this?

Not at all

Perfectly

How is the similarity of query vector ‘q’ typically computed compared to the set of hidden state vectors {u1, u2, … u_n}?

Cosine similarity (i.e. inner product of q with each of the u vectors}

How well did you know this?

Not at all

Perfectly

What is the ‘controller state’ in a Softmax Attention layer?

Initially, it’s just the input query vector q0. We use q0 to compute the next controller state (i.e. hidden state) q1, then use q1 to compute q2, …so on.

How well did you know this?

Not at all

Perfectly

What are transformer models?

Models that are made up of multiple attention layers.

How well did you know this?

Not at all

Perfectly

What are three important architectural distinctions that result in the superior performance of transformer models compared to previous attention based architectures?

Multi-query hidden state propagation (“self-attention”)
Multi-head attention
Residual connections, LayerNorm

How well did you know this?

Not at all

Perfectly

What is self-attention?

It uses a controller state (i.e. query/hidden state) for every single input). So the size of the controller state grows with the size of the input, giving it even more representational power than traditional attention networks (remember that the representational power of an attention network grows proportionally to its input size).

How well did you know this?

Not at all

Perfectly

What is multi-head attention?

Multi-head attention combines multiple attention ‘heads’ being trained in the same way on the same data - but with different weight matrices, thus yielding different values.

Each of the ‘L’ attention heads yields values for each token - these values are then multiplied by trained parameters and added.

(To me this seems kind of similar to the idea in convnets of using multiple “filters” for each convolutional layer so we can learn different feature representations. )

How well did you know this?

Not at all

Perfectly

What are some of the major reasons why machine translation is difficult?

Study These Flashcards

Language is ambigous
Language depends on context
Languages are very different (e.g. structure, what is implicit vs. explicit, etc.)

Translation is often modeled as a conditional language model? (True/False)

Study These Flashcards

True. Typically Prob(tokens | source)

The probability of each output token is estimated together based on the source material in a machine translation model? (True/False)

Study These Flashcards

False. They are estimated separately from left-to-right.

In a machine translation model, we calculate the probability of each output token estimated separately (left-to-right) based on what two things?

Study These Flashcards

Entire input sequence (encoder outputs)
All previously predicted tokens (decoder “state”)

In the context of machine translation models, the argmax[p(t | s)] is intractable? (True/False)

Study These Flashcards

True.

In the context of machine translation models, why is argmax[p(t | s)] impossible to find, and what technique do we use to remedy this?

Study These Flashcards

The problem: Exponential search space of possible sequences
Remedy is to use beam search (typical beam size of 4 to 6)

What does the beam search algorithm allow us to do?

Study These Flashcards

To search exponential space in linear time.

How does the beam search algorithm work for machine translation?

Study These Flashcards

We explore a limited number of hypotheses ‘k’ at a time. At each step, we extend each of ‘k’ elements by one token. Top ‘k’ overall then become the hypotheses for the next step.

Each beam elements has a different _____? For transformer/decoder this is _________ input for _______ steps?

Study These Flashcards

State
Self-Attention
Previous steps

Total computation scales *linearly* with beam width? (True/False)

True

Computation over the beam is difficult to parallelize on GPUs when using the beam search algorithm? (True/False)

False. It is highly parallelizable over the beam.

What are the three reasons inference can be computational expensive for machine translation models?

1. Step-by-step computation (auto-regressive inference) 2. Output projection: theta(len\_vocab \* output\_len \* beam\_size) 3. Deeper models

What are three strategies we can use to overcome some of the inherit inefficiencies associated with machine translation inference?

1. Use smaller vocabularies 2. More efficient computation Reduce depth / increase parallelism

To improve computational efficiency for machine translation models, what is one good reason why it's reasonable to use smaller vocabularies?

Because while a vocabulary may be huge, for any given input sequence, the *likely* outputs will generally be constrained to a fairly small set. IBM alignment models use statistical techniques to model the probability of one word being translated into another. Alternatively, lexical probabilities can be used to predict most likely output tokens for a given input. Using these approaches can achieve up to 60% speedup.

What is one way we can overcome the challenge of how to model rare or unseen words?

Model most frequent words as their own token, less frequent words get broken up into their constituent parts. One popular algorithm for doing this is byte-pair encoding.

How does byte-pair encoding work?

BPE comes from the idea of compression, where the *most frequent adjacent pair is iteratively replaced.* Example: Consider the string "abcdeababce" Step 1: Replace most frequent pair "ab" with "X" (and add replacement rule) "XcdeXXce" X="ab" Step 2: Replace next most frrequent pair (here including the replacement byte) "YdeXYe" X="ab" Y="Xc"

When using parallel computation to speedup machine translation inference, what is the major bottleneck?

Autoregressive inference time dominated by *decoder*

For machine translation models, average efficiency can be increased by translating multiple inputs at once? (True/False)

True (however, may not be practical for real-time systems)

What is the biggest slowdown typically the result of in machine translation models?

It comes from the requirement that for the decoder we need to predict each token in the output sequence one at a time. (i.e. autoregressive)

Neural Machine Translation (NMT) systems have an _______ learning curve with respect to the amount of training data, resulting in _______ quality in _________ settings, but better performance in ___________ settings.

1. Steeper 2. Worse 3. Low-resource 4. High-resource

What is one of the primary challenges of Neural Machine Translation (NMT) models?

Data Scarcity. There is lots of data available available, for example, English ---\> French, but what about Pashto --\> Swahili? This is a big challenge.

Besides data scarcity, what are some of the other challenges associated with machine translation?

1. Language similarity - many (most) languages are very different than English 2. Domain - the training data that is available might not be closely related enough to your task of interest. (for example, Facebook trying to develop models for newsfeed using Ubuntu training manuals) 3. Evaluation - no access to a test set (*this is one of main blockers to research in low-resource settings*)

What are some of the most powerful techniques for translation in low-resource settings?

1. Multi-lingual training (i.e. exploiting the relatedness between languages) 2. Backtranslation - using two languages (typically need to be fairly high-resource languagues) to train an intermediate model that can then be used to bootstrap a better model. 3. Language Agnostic Sentence Representations - using a model to take a low-resource language into a high-resource language (done using an Bi-Directional Recurrent Encoder-Decoder style network

Quiz #5 Flashcards

(38 cards)