Machine Translation and Encoder-Decoder Models Flashcards
What is machine translation?
It is the use of computers to translate one language to another
What machine translation models exist?
Statistical phrase alignment models, Encoder-Decoder models and Transformer models
What type of task is machine translation?
It is a sequence to sequence task (seq2seq)
What is the input, output and their lengths for a seq2seq task?
The input X is a sequence of words, the output Y is a sequence of words, but the length of X may not necessarily equal the length of Y
Besides machine translation, what are some other seq2seq tasks?
Question → Answer
Sentence → Clause
Document → Abstract
What do universal aspects mean in regards to the human language?
These are aspects that are true, or statistically mostly true for all languages
What are some examples of universal aspects in the human language?
Nouns/Verbs, Greetings, Politeness/Rude
What are translation divergences?
These are areas where languages differ
What are some examples of translation divergences?
Idiosyncrasies and lexical differences
Systematic differences
What is the study of translation divergences called?
Linguistic Typology
What is Word Order Typology?
It is a way of ordering words in different ways, these can include:
- Subject-Verb-Object (SVO)
- Subject-Object-Verb (SOV)
- Verb-Subject-Object (VSO)
What is the Encoder-Decoder model?
For an input sequence, we have an encoder, that encodes the input to a context vector, which is then sent to a decoder that generates the output sequence.
What can an encoder be?
LSTM, GRU, CNN, Transformers
What is a context vector?
It is the last hidden layer of the encoder, which is used as the input to the decoder
What does a language model try to do?
Predict the next word in a sequence Y based on the previous word
How is a translation model different to a language model?
It predicts the next word in the sequence Y based on the previous target word AND the full source sequence X
Explain how the encoder-decoder model shown in the image works
We have a single hidden layer that takes as an input the embeddings of the source text, we then have a separator and the predicted words based on its training. The predicted words are used in the prediction of the next word until the end is reached. The key is that the final hidden layer of the last input word is fed into the decoder which predicts the target words
By using the hidden layer at the end of the sentence in the machine translation model, what is avoided?
It avoids the word typology problems as we have full knowledge of where the sentence starts and ends, meaning that before we start to translate the first word, we know what the end word is
How is the encoder trained in machine translation models?
The input words are embedded using an embedding layer and are fed in one at a time to the encoder until the full input has been seen.
Which state is the final hidden layer of the encoder fed into the decoder?
It is fed into every single state of the decoder
What are the inputs at each step in the decoder?
The context vector c (final hidden state of encoder), the previous output, yt-1 and the previous hidden state of the decoder ht-1
What is the typical loss function for a machine translation model?
It is a cross entropy loss function
What is used during training of machine translation but not inference?
Teacher forcing, as we want to ensure we train on exact translations of the words
What is the total loss per sentence in machine translation?
It is the averages loss across all target words
Why is a separator token added to the start of the target sequence Y?
It is because the decoder needs a previous word embedding to compute a prediction, as nothing has been predicted, we need an initial input entry for the first word
Given that the weights of previous tokens decay as the sequence is processed, how is access to previous hidden states without diminishing the weights achieved?
It is achieved using attention
What can be used instead of the static context vector that has been shown to be better?
An attention vector
What are some types of attention layers?
Dot-product attention
Additive attention
Self-attention
What does the image show?
It shows how the context vector is replaced with the attention vector. Each of the hidden states are multiplied by some attention weight, and then a weighted sum is used to give us the context vector
What problem can using an attention vector over the last encoder layer overcome?
In some languages, if the start/end of the sentence was more important, then some models would reverse the sentence to compensate for that. Attention avoids this through learning which parts to focus on
What is a problem of using the argmax function with machine translation?
There is no hindsight to allow us to revisit choices at previous time steps if we have got something wrong.
What is greedy decoding?
It is where argmax is used such that we cannot revisit previous choices - we are stuck with the choice we made
What does Beam Search allow for?
It allows for you to revisit previous options if they become better than the current option that is currently being pursued
How does a beam search decoder work?
It keeps a memory of the k-best sequence options (hypotheses) at any decoding step
What is the beam width?
It is the memory size k
What happens at each step in a beam search?
All k hypotheses are extended by V predicted tokens, which is the possible tokens.
The best k sequences from k x V hypotheses are then selected for memory
Explain what the image below shows, using a beam width of 2.
We start with the start of the sentence. We first take the most probable two steps, so we have two sequences ‘start arrived’ and ‘start the’. At the next step, we have 4 possible options that we can choose. By using logs, we can add together the probabilities of each word in the sequence. In doing so, we can see that the two next most probable sequences from where we currently sit are ‘start the green’ and ‘start the witch’. We repeat this again, and see that the next most probable sequences are ‘start the green witch’ and ‘start the witch arrived’. This is repeated until we get to the end.
What is the typical beam width in actual systems, and what is the problem of going above this size?
The typical size is 4 to 10, bigger than 10 can take a long time to decode which can slow the cycles.
What type of vocabulary is used in a seq2seq model?
A fixed vocabulary (e.g. 50k limited by GPU memory) although BPE or WordPiece can also be used
What decoder should be used to get fast results, and what should be used to get the best results?
Greedy decoder for fast, Beam decoder for best
What are the training data options for machine translation?
Parallel Corpus - A corpus of two languages containing sentences that are aligned
Monolingual Corpus - These are stacks of posts in one language and stacks of posts in another, although there is no connection between the two. They are very large but require interesting techniques
What is backtranslation?
Train a model using a parallel corpus, apply the model to the massive monolingual corpus and use this to train the model as it is sentence aligned and compare
How can Machine Translation systems be evaluated?
Human Assessment, BLEU metric, Precision, Recall, NIST, TER, METEOR