Sequence-to-sequence transformations in text processing Flashcards
When is sequence to sequence (seq2seq) a reasonable approach?
When we have variable length input and variable length output. Use a constant sized neural network model repeatedly on the input data.
Mention three possible seq2seq approaches
- Recurrent networks
Apply the NN module in a serial fashion - Convolutions networks
Apply the NN modules in a hierarchical fashion - Self-attention
Apply the NN module in a parallel fashion
Explain the steps in a RNN-encoder-decoder structure (seq2seq with RNN) in reference to a text translation task.
- Sequentially feed your input sentence you want to translate into your encoder (word by word), consequtive words take in the previous state and the next word.
- Use the resulting state vector from your encoder as state initialization in your decoder and with ground truth input in training, or select next word based on softmaxed output from previous time step as input to next time step during inference.
Explain the basic concept of beam search
Beam search picks the N (beam width) highest probabilities at each word prediction and calculates the next N highest probabilities based on each of the previously picked probabilities. We now have N^2 probabilities, pick the N highest probabilities to apply to the next time step.
Simply put: Beam search tries to increase accuracy of the seq2seq by not only picking the argmax of each prediction and then applying this to the next time step but rather pick the top N candidates.
Give a brief explanation of the FCN with self-attention. From the paper “attention is all you need”.
Consists of an encoder and a decoder part. The encoder encodes the input into an intermediate representation which we use to influence all our previous words in the decoder.
This explanation is quite lacking.. stoopid
How can you use reinforcement learning in seq2seq for machine translation?
We want to sample next word in our decoder part from the previous softmax output. However, sampling is non-differentiable thus we need an alternate scheme for selecting our next word. We can use reinforcement learning to select this word.
Briefly explaing the concept of co-attention and when it is appropriate to use it.
Co-attention can be used when doing question answering where we feed both a context matrix (paragraph of text) and a corresponding question matrix and want to make the network make a prediction based on both of the two inputs.
Co-attention is a method of combining the answer and the question in such a way that each word in the question matrix can be viewed in the light of each word in the context matrix and vice versa. This is done using dot products between the context and the question (and vice versa). Co-attention also uses a normalization scheme corresponding to softmax operations.