Machine Translation and Encoder-Decoder Models Flashcards by Ben Boyce

What is machine translation?

It is the use of computers to translate one language to another

How well did you know this?

Not at all

Perfectly

What machine translation models exist?

Statistical phrase alignment models, Encoder-Decoder models and Transformer models

How well did you know this?

Not at all

Perfectly

What type of task is machine translation?

It is a sequence to sequence task (seq2seq)

How well did you know this?

Not at all

Perfectly

What is the input, output and their lengths for a seq2seq task?

The input X is a sequence of words, the output Y is a sequence of words, but the length of X may not necessarily equal the length of Y

How well did you know this?

Not at all

Perfectly

Besides machine translation, what are some other seq2seq tasks?

Question → Answer

Sentence → Clause

Document → Abstract

How well did you know this?

Not at all

Perfectly

What do universal aspects mean in regards to the human language?

These are aspects that are true, or statistically mostly true for all languages

How well did you know this?

Not at all

Perfectly

What are some examples of universal aspects in the human language?

Nouns/Verbs, Greetings, Politeness/Rude

How well did you know this?

Not at all

Perfectly

What are translation divergences?

These are areas where languages differ

How well did you know this?

Not at all

Perfectly

What are some examples of translation divergences?

Idiosyncrasies and lexical differences

Systematic differences

How well did you know this?

Not at all

Perfectly

What is the study of translation divergences called?

Linguistic Typology

How well did you know this?

Not at all

Perfectly

What is Word Order Typology?

It is a way of ordering words in different ways, these can include:

Subject-Verb-Object (SVO)
Subject-Object-Verb (SOV)
Verb-Subject-Object (VSO)

How well did you know this?

Not at all

Perfectly

What is the Encoder-Decoder model?

For an input sequence, we have an encoder, that encodes the input to a context vector, which is then sent to a decoder that generates the output sequence.

How well did you know this?

Not at all

Perfectly

What can an encoder be?

LSTM, GRU, CNN, Transformers

How well did you know this?

Not at all

Perfectly

What is a context vector?

It is the last hidden layer of the encoder, which is used as the input to the decoder

How well did you know this?

Not at all

Perfectly

What does a language model try to do?

Predict the next word in a sequence Y based on the previous word

How well did you know this?

Not at all

Perfectly

How is a translation model different to a language model?

It predicts the next word in the sequence Y based on the previous target word AND the full source sequence X

How well did you know this?

Not at all

Perfectly

Explain how the encoder-decoder model shown in the image works

We have a single hidden layer that takes as an input the embeddings of the source text, we then have a separator and the predicted words based on its training. The predicted words are used in the prediction of the next word until the end is reached. The key is that the final hidden layer of the last input word is fed into the decoder which predicts the target words

How well did you know this?

Not at all

Perfectly

By using the hidden layer at the end of the sentence in the machine translation model, what is avoided?

Study These Flashcards

It avoids the word typology problems as we have full knowledge of where the sentence starts and ends, meaning that before we start to translate the first word, we know what the end word is

How is the encoder trained in machine translation models?

Study These Flashcards

The input words are embedded using an embedding layer and are fed in one at a time to the encoder until the full input has been seen.

Which state is the final hidden layer of the encoder fed into the decoder?

Study These Flashcards

It is fed into every single state of the decoder

What are the inputs at each step in the decoder?

Study These Flashcards

The context vector c (final hidden state of encoder), the previous output, y_t-1 and the previous hidden state of the decoder h_t-1

What is the typical loss function for a machine translation model?

Study These Flashcards

It is a cross entropy loss function

What is used during training of machine translation but not inference?

Study These Flashcards

Teacher forcing, as we want to ensure we train on exact translations of the words

What is the total loss per sentence in machine translation?

Study These Flashcards

It is the averages loss across all target words

Why is a separator token added to the start of the target sequence Y?

It is because the decoder needs a previous word embedding to compute a prediction, as nothing has been predicted, we need an initial input entry for the first word

Given that the weights of previous tokens decay as the sequence is processed, how is access to previous hidden states without diminishing the weights achieved?

It is achieved using attention

What can be used instead of the static context vector that has been shown to be better?

An attention vector

What are some types of attention layers?

Dot-product attention Additive attention Self-attention

What does the image show?

It shows how the context vector is replaced with the attention vector. Each of the hidden states are multiplied by some attention weight, and then a weighted sum is used to give us the context vector

What problem can using an attention vector over the last encoder layer overcome?

In some languages, if the start/end of the sentence was more important, then some models would reverse the sentence to compensate for that. Attention avoids this through learning which parts to focus on

What is a problem of using the argmax function with machine translation?

There is no hindsight to allow us to revisit choices at previous time steps if we have got something wrong.

What is greedy decoding?

It is where argmax is used such that we cannot revisit previous choices - we are stuck with the choice we made

What does Beam Search allow for?

It allows for you to revisit previous options if they become better than the current option that is currently being pursued

How does a beam search decoder work?

It keeps a memory of the k-best sequence options (hypotheses) at any decoding step

What is the beam width?

It is the memory size k

What happens at each step in a beam search?

All k hypotheses are extended by V predicted tokens, which is the possible tokens. The best k sequences from k x V hypotheses are then selected for memory

Explain what the image below shows, using a beam width of 2.

We start with the start of the sentence. We first take the most probable two steps, so we have two sequences ‘start arrived’ and ‘start the'. At the next step, we have 4 possible options that we can choose. By using logs, we can add together the probabilities of each word in the sequence. In doing so, we can see that the two next most probable sequences from where we currently sit are ‘start the green’ and ‘start the witch’. We repeat this again, and see that the next most probable sequences are ‘start the green witch’ and ‘start the witch arrived’. This is repeated until we get to the end.

What is the typical beam width in actual systems, and what is the problem of going above this size?

The typical size is 4 to 10, bigger than 10 can take a long time to decode which can slow the cycles.

What type of vocabulary is used in a seq2seq model?

A fixed vocabulary (e.g. 50k limited by GPU memory) although BPE or WordPiece can also be used

What decoder should be used to get fast results, and what should be used to get the best results?

Greedy decoder for fast, Beam decoder for best

What are the training data options for machine translation?

Parallel Corpus - A corpus of two languages containing sentences that are aligned Monolingual Corpus - These are stacks of posts in one language and stacks of posts in another, although there is no connection between the two. They are very large but require interesting techniques

What is backtranslation?

Train a model using a parallel corpus, apply the model to the massive monolingual corpus and use this to train the model as it is sentence aligned and compare

How can Machine Translation systems be evaluated?

Human Assessment, BLEU metric, Precision, Recall, NIST, TER, METEOR

Machine Translation and Encoder-Decoder Models Flashcards

(43 cards)