Lecture 12 Flashcards
Sequence to Sequence, Attention, Transformer
Encoder:
a LSTM that encodes the input sequence to a fixed-length internal
representation W.
Decoder:
another LSTM that takes the internal representation W to extract the output sequence from that vector
Question Answering
we ingest a sentence with several words, processing the sentence with the help of a recurrent unit such as a LSTM. We preserve the “state” that results from ingesting that sentence into our trained model. The resulting context vector then serves as the context for a decoder module, also made
of LSTMs or GRUs. If we prompt the decoder, using a particular “state” vector and a “start of sentence” marker, we can generate output tokens (plus an end of sentence marker).
Seq2Seq
At a time step in the Encoder, the RNN takes a word vector (xi) from the input sequence and a hidden state (Hi-1) from the previous time step; the hidden state is updated (Hi)
The context vector to the decoder is the hidden state from the last unit of the encoder
(without the Attention mechanism) or the weighted sum of the hidden states of the encoder (with the Attention mechanism)
Inference
The task of applying a trained model to generate a translation is
called inference
MT Keras Overview – Input/Output
- Source sequence for encoder:
x= (x1, x2,…, x|x|) initially one-hot
encoded that usually feeds into a
word embedding layer - Target sequence:
y= (y1, y2,…, y|y|) exists in two
versions – decoder input has a start sentence token, decoder output has an end sentence token; these two sequences are offset by one time step - Final decoder output goes through a softmax layer that designates the probabilities of each entry in the vocabulary
Greedy search:
Choose the output word for
each time step with the highest p value
Beam search:
Choose the k highest words
(5<=k<=10) at the next time step; assemble an overall sequence with the max probability
Attention Mechanisms
Unfortunately, the context vector between the two models is not always sufficient to produce a great result. Attention addresses this bottleneck.
Drawback of the “Vanilla” Encoder-Decoder
- In the “vanilla” seq2seq model shown earlier, the decoder
takes the final hidden state of the encoder (the context vector)
and uses that to produce the target sentence. - The fixed-size context vector represents the final time step.
Loosely speaking, the encoding process gives slightly more
weight to each successive term in the input sentence. - Earlier terms may be more important than later, though, in
driving the accuracy of the output of the decoder.
Attention Mechanism
An “attention mechanism” makes all hidden states of the encoder
visible to the decoder
- Embedding all the words in the input (represented by hidden states) while creating the context vector
- a learning mechanism that helps the decoder identifies where to
pay attention attention in the encoding when predicting at each
time step
Fully Attention-Based Approaches
What if we simply avoided the use of recurrent neural layers (such as LSTMs) and simply used attention layers instead? This was the insight behind BERT and other transformer-based” approaches. Transformer-based models have increased performance on some tasks in comparison to recurrent networks with attention.
Transformer Architectures Proliferate
- The basic transformer architecture from Vaswani et al. (2017; Advances in neural information processing systems) has morphed into numerous variations applied to a variety of tasks
- In addition to NLP, transformers
have been applied to genetics,
computer vision, signal processing,
video analysis - In 2021 alone, more than 6000
papers have been published on
applications and improvements to
BERT