Sequence Processing, Transformers and Attention Flashcards
What does LSTM stand for?
Long Short Term Memory
Why is long distant information critical to many language apps?
Words at the start of the sentence can often have a large impact on the end of the sentence
What is an issue of using RNNs?
The information value from the start of the sentence can degrade as we move down the sentence and becomes less accessible
What is the vanishing gradients problem?
The vanishing gradients problem is where gradients get smaller and smaller as Gradient Descent progresses to deeper stacked layers, eventually ‘vanishing’. The connection weights are virtually unchanged and therefore training loss does not converge
What is the exploding gradients problem?
It is the opposite of the vanishing gradients problem - as Gradient Descent progresses, the gradients get larger and larger causing the training loss to not converge.
What does the image show?
This is an LSTM cell, an LSTM layer consists of many of these cells
Explain how an LSTM cell, similar to the one shown in the image, works.
The last hidden layer, ht-1, is still provided as an input alongside the current input xt. An additional memory unit, which is ct-1, which is a form of secondary memory that can record longer term things. The LSTM cell can be trained to learn what to remember and what to forget, which is where the gates are used. Each gate is trained with an activation function in order to compute their values. Given the inputs, they go through the gates and pass through the activation function to generate output values, which in turn generate the output vector yt, the next long term state ct and the next short term state ht. To do the training, each gate has a fully connected dense layer which is used to train and update the weights.
What is a GRU?
It is a Gated Recurrent Unit
What does a GRU do?
It merges both long and short term state vectors
Are GRUs simpler or more complex than an LSTM?
They are simpler and more efficient to compute while still producing similar results.
In simple terms, explain the difference between the 4 concepts shown in the image.
The MLP simply uses the input to produce the output
The RNN will aggregate the last hidden layer as well as the input X
The LSTM cell will aggregate the last hidden layer, has an input X, and has long term memory while learning what to remember
The GRU simply merges the long term memory with the short term memory and learns what to remember and forget
What does the transformer architecture show?
Using the concept of self-attention, the LSTM and GRUs could be outperformed
What are the problems of GRUs and RNNs that transformers solve?
While they were designed to help with longer distance sequences, they still could not cope with very long distant sequences
What is the key concept of the transformer architecture?
It uses an attention layer that does not worry about the length of the sequence as it has access to all potential inputs of the sequence, learning which ones to attend to.
Explain the architecture of the image showing the Transformer model.
It is an encoder-decoder model. We have stacks of multi-head attention layers for the encoder, and stacks of multi-head attention layers for the decoder, of which we have N times of them. The encoder feeds into the decoder and we have a linear layer with softmax to give an output to the problem.