Machine Translation Flashcards
Word ordering and Subject-Verb-Object order
Languages differ in the basic word order of verbs, subjects and objects in simple declarative clauses.
Examples:
- French, English, and Mandarin are SVO (Subject-Verb-Object) order
- Hindi and Japanese are SOV order
- Arabic is VSO order
Word alignment, spurious words
Word alignment is the correspondence between words in the source and target sentences.
Type of aligments:
- many-to-one
- one-to-many
- many-to-many
Spurious words have no counterpart in the target language.
Statistical machine translation (SMT) using language model and translation model
Machine translation can be formulated as a structured prediction task. Given a source sentence x, find the most probable target sentence ŷ:
ŷ = argmax (y) P(y|x)
Statistical machine translation (SMT) uses Bayes’ rule to decompose the probability model into two components that can be learned separately:
ŷ = argmax (y) P(x|y)*P(y)
where:
- P(x|y) is a translation model
- P(y) is a language model
We have that:
- P(x|y) assigns large probability to strings that have the necessary words (roughly at the right places)
- P(y) assigns large probability to well-formed strings y, regardless of the connection to x
P(x|y) and P(y) collaborate to produce a large probability for well-formed translation pairs.
Neural machine translation (NMT)
Neural machine translation (NMT) models the translation task through a single artificial neural network.
Let x be the source text and y = y1 … ym be the target text.
In contrast to SMT, in NMT we directly model P(y|x), using an approach similar to the one adopted for language modeling:
P(y|x) = p(y1|x)P(y2|y1,x)…*P(ym|y1, …, ym-1,x)
Encoder-decoder neural architecture (seq2seq) general idea, components
Encoder-decoder networks, also called sequence-to-sequence (seq2seq) networks, are models capable of generating contextually appropriate, arbitrary length sequences.
The encoder-decoder model consists of two components:
- The encoder is a neural network that produces a representation of the source sentence.
- The decoder is an autoregressive language model that generates the target sentence, conditioned on the output of the encoder (Autoregressive = takes its own output as new input).
The key idea underlying encoder-decoder networks. The output of the encoder is called context and drives the translation, together with the decoder output.
Autoregressive encoder-decoder using RNN for machine translation: greedy inference algorithm
P(y|x) can be computed as follows:
- run an RNN encoder through x = x1 … xn performing forward inference, generating hidden states het with t form 1 to n
- run an RNN decoder performing autoregressive generation; to generate yt with t form 1 to n, use:
- encoder hidden states hen
- decoder hidden states hdt-1
- embedding of word yt-1
Stops when the end-of-sentence marker is predicted.
draw the inference process at slide 27 pdf 12
write the model equations at slide 28 (hint: g and f affine functions)
Encoder-decoder with RNN: training, teacher forcing
Given the source text and the gold translation, we compute the average loss w.r.t. our predictions on next word in the translation.
At each step of training, the decoder computes Li that measures how far the generated distribution is from the gold one.
The total loss is L = 1/T * sum (i=1 to T) Li (average cross-entropy loss).
During training, the decoder uses gold translation words as the input for the next step prediction. This is called teacher forcing.
Attention-based neural architecture idea
The context vector c must represent the whole source sentence in one fixed-length vector. This is called the bottleneck problem.
The attention mechanism allows the decoder to get information from all the hidden states of the encoder.
The idea is to compute context ci at each decoding step i, as a weighted sum of all the encoder hidden states hej.
Attention-based neural architecture: dynamic context vector
Attention replaces the static context c with a context ci dynamically computed from all encoder hidden states:
hdi = g(ci, hdi-1, yi-1)
HOW IS ci COMPUTED?
- At each step i during decoding, we compute relevent scores score(hdi-1, hej) for each encoder hidden state hej.
- We normalize the scores to create weights αij for each j (using softmax).
- We finally compute a fixed-length context vector that takes into account information from all of the encoder hidden states and that is dynamically updated:
ci = sum (over j) αij hej
Attention-based neural architecture: scoring functions for creating weights
- The simplest score is the dot-product attention
- Bilinear model: score(hdi-1, hej) = hdi-1 Ws hdj where Ws are learnable parameters. This score allows the encoder and decoder to use different dimensions for their hidden states.
Search tree for the decoder
A greedy algorithm makes choices that are locally optimal, but this may not find the highest probability translation.
We define the search tree for the decoder:
- the branches are the actions of generating a token
- the nodes are the states, representing the generated prefix of the translation
Unfortunately, dynamic programming is not applicable to this search tree, because of long-distance dependencies between the output decisions.
Beam search for the decoder
In beam search we keep K possible hypotheses at each step; parameter K is called the beam width.
Beam search algorithm:
- start with K initial best hypotheses
- at each step, expand each of the K hypotheses resulting in V * K new hypotheses, which are all scored
- prune the V * K hypotheses down to the K best hypotheses
- when a complete hypothesis is found, remove it from the frontier and reduce by one the size of the beam, stopping at K = 0
find this on the lecterature
Evaluation and the BLUE metric
The most popular automatic metric for MT systems is called BLEU, for BiLingual Evaluation Understudy.
The N-gram precision for a candidate translation is the percentage of N-grams in the target sentence that also occur in the reference translation.
BLEU combines 1,2,3,4-gram precisions by means of the geometric mean. A brevity penalty for too-short translations is also added.