Machine Translation Flashcards

Question 1

Q

Word ordering and Subject-Verb-Object order

Answer

A

Languages differ in the basic word order of verbs, subjects and objects in simple declarative clauses.

Examples:

French, English, and Mandarin are SVO (Subject-Verb-Object) order
Hindi and Japanese are SOV order
Arabic is VSO order

Question 2

Q

Word alignment, spurious words

Answer

A

Word alignment is the correspondence between words in the source and target sentences.

Type of aligments:

many-to-one
one-to-many
many-to-many

Spurious words have no counterpart in the target language.

Question 3

Q

Statistical machine translation (SMT) using language model and translation model

Answer

A

Machine translation can be formulated as a structured prediction task. Given a source sentence x, find the most probable target sentence ŷ:

ŷ = argmax (y) P(y|x)

Statistical machine translation (SMT) uses Bayes’ rule to decompose the probability model into two components that can be learned separately:

ŷ = argmax (y) P(x|y)*P(y)

where:

P(x|y) is a translation model
P(y) is a language model

We have that:

P(x|y) assigns large probability to strings that have the necessary words (roughly at the right places)
P(y) assigns large probability to well-formed strings y, regardless of the connection to x

P(x|y) and P(y) collaborate to produce a large probability for well-formed translation pairs.

Question 4

Q

Neural machine translation (NMT)

Answer

A

Neural machine translation (NMT) models the translation task through a single artificial neural network.

Let x be the source text and y = y1 … ym be the target text.
In contrast to SMT, in NMT we directly model P(y|x), using an approach similar to the one adopted for language modeling:

P(y|x) = p(y1|x)P(y2|y1,x)…*P(ym|y1, …, ym-1,x)

Question 5

Q

Encoder-decoder neural architecture (seq2seq) general idea, components

Answer

A

Encoder-decoder networks, also called sequence-to-sequence (seq2seq) networks, are models capable of generating contextually appropriate, arbitrary length sequences.

The encoder-decoder model consists of two components:

The encoder is a neural network that produces a representation of the source sentence.
The decoder is an autoregressive language model that generates the target sentence, conditioned on the output of the encoder (Autoregressive = takes its own output as new input).

The key idea underlying encoder-decoder networks. The output of the encoder is called context and drives the translation, together with the decoder output.

Question 6

Q

Autoregressive encoder-decoder using RNN for machine translation: greedy inference algorithm

Answer

A

P(y|x) can be computed as follows:

run an RNN encoder through x = x1 … xn performing forward inference, generating hidden states het with t form 1 to n
run an RNN decoder performing autoregressive generation; to generate yt with t form 1 to n, use:

encoder hidden states hen
decoder hidden states hdt-1
embedding of word yt-1

Stops when the end-of-sentence marker is predicted.

draw the inference process at slide 27 pdf 12
write the model equations at slide 28 (hint: g and f affine functions)

Question 7

Q

Encoder-decoder with RNN: training, teacher forcing

Answer

A

Given the source text and the gold translation, we compute the average loss w.r.t. our predictions on next word in the translation.

At each step of training, the decoder computes Li that measures how far the generated distribution is from the gold one.

The total loss is L = 1/T * sum (i=1 to T) Li (average cross-entropy loss).

During training, the decoder uses gold translation words as the input for the next step prediction. This is called teacher forcing.

Question 8

Q

Attention-based neural architecture idea

Answer

A

The context vector c must represent the whole source sentence in one fixed-length vector. This is called the bottleneck problem.

The attention mechanism allows the decoder to get information from all the hidden states of the encoder.

The idea is to compute context ci at each decoding step i, as a weighted sum of all the encoder hidden states hej.

Question 9

Q

Attention-based neural architecture: dynamic context vector

Answer

A

Attention replaces the static context c with a context ci dynamically computed from all encoder hidden states:

hdi = g(ci, hdi-1, yi-1)

HOW IS ci COMPUTED?

At each step i during decoding, we compute relevent scores score(hdi-1, hej) for each encoder hidden state hej.
We normalize the scores to create weights αij for each j (using softmax).
We finally compute a fixed-length context vector that takes into account information from all of the encoder hidden states and that is dynamically updated:
ci = sum (over j) αij hej

Question 10

Q

Attention-based neural architecture: scoring functions for creating weights

Answer

A

The simplest score is the dot-product attention
Bilinear model: score(hdi-1, hej) = hdi-1 Ws hdj where Ws are learnable parameters. This score allows the encoder and decoder to use different dimensions for their hidden states.

Question 11

Q

Search tree for the decoder

Answer

A

A greedy algorithm makes choices that are locally optimal, but this may not find the highest probability translation.

We define the search tree for the decoder:

the branches are the actions of generating a token
the nodes are the states, representing the generated prefix of the translation

Unfortunately, dynamic programming is not applicable to this search tree, because of long-distance dependencies between the output decisions.

Question 12

Q

Beam search for the decoder

Answer

A

In beam search we keep K possible hypotheses at each step; parameter K is called the beam width.

Beam search algorithm:

start with K initial best hypotheses
at each step, expand each of the K hypotheses resulting in V * K new hypotheses, which are all scored
prune the V * K hypotheses down to the K best hypotheses
when a complete hypothesis is found, remove it from the frontier and reduce by one the size of the beam, stopping at K = 0

find this on the lecterature

Question 13

Q

Evaluation and the BLUE metric

Answer

A

The most popular automatic metric for MT systems is called BLEU, for BiLingual Evaluation Understudy.

The N-gram precision for a candidate translation is the percentage of N-grams in the target sentence that also occur in the reference translation.

BLEU combines 1,2,3,4-gram precisions by means of the geometric mean. A brevity penalty for too-short translations is also added.

Machine Translation Flashcards

(13 cards)