Sequence Models Flashcards
Notes on Sequence Models that may help with the exam.
What is a one-line summary of Sequential Learning?
One data item is dependent on those that come before or after it, not independently and identically distributed
What are some common applications of Sequence Learning?
Speech/Voice Recognition
Weather forecasting
Language translation
DNA Sequence Analysis
What are the four types of application scenarios in regards to Recurrent Neural Networks?
One-to-one
One-to-many
Many-to-one
Many-to-many
What is a one-line summary of One-to-one in the context of Recurrent Neural Networks?
Classical feed forward neural network with one input and one output e.g. image classification
What is a one-line summary of One-to-many in the context of Recurrent Neural Networks?
Input image and output words with variable length e.g. Image Captioning
What are some common applications of Many-to-one in the context of Recurrent Neural Networks?
Sentiment Classification
Share Price Predictions
What are some common applications of Many-to-many in the context of Recurrent Neural Networks?
Language Translation - Input and output are with variable length
Video Clip Classification - Input and output have the same length
What is the equation for a Basic Recurrent Neural Network unit?
Current state = activation function * (Previous state and the input vector at present time step)
What challenge do Recurrent Neural Networks face?
It suffers from the vanishing/exploding gradient problem
What are Long Short Term Memory Networks?
They are altered variations of Recurrent Neural Networks, which are specifically designed to capture long term dependencies.
What are Long Short Term Memory networks better at doing compared to Recurrent Neural Networks?
They are better at back propagating the gradient much more efficiently than the standard Recurrent Neural Network
What does a Long Short Term Memory consist of in regards to its two main components?
Long term memory cell states, which are non-learnable
Short term memory hidden states, which are learnable.
What connects the two main blocks of a Long Short Term Memory Network, and determines whether information passes through to either side?
Multiple sigmoid functions act as gates that switch on and off the information passing areas.
What three types of gates exist within the Long Short Term Memory Network Unit?
Forget Gate
Input Gate
Output Gate
What does the Forget Gate determine in LSTMs?
The forget gate determines how much long term memory is retained (% amount)
What does the Input Gate determine in LSTMs?
The Input Gate determines how much of the short-term information should be contributed to the long-term memory (% amount)
What does the Output Gate determine in LSTMs?
The Output Gate determines how much of the new long-term memory should be contributed to the current output. (% amount)
What is the challenge in regards to LSTMs?
LSTMs are able to capture long term memory, but not for long enough for it to be useful
What is a Transformer especially good at?
Transformers are able to capture dependencies between each input word and between all input words to outputs in a sentence
What is the overall architecture of a Transformer?
It is modelled as an Encoder-Decoder architecture:
Encoder - The output is a continuous vector representation of the inputs
Decoder - Takes the continuous vector from the Encoder and generates an output one by one
What are some common applications of Transformers?
Language Translation
Chatbot
What is the overall pipeline for a Transformer?
Starts with two chains:
- Encoder self-attention -> Feed-forward network
- Decoder self-attention
These pipelines combine into a single pipeline, which is as follows:
- Decoder-Encoder Attention -> Feed-Forward Network
What are the details surrounding the Input Embedding step in Transformers?
Each word is mapped to a vector
We then add a positional encoding to the vector using sine/cosine functions
What are the details surrounding the Encoder Training area of the Transformer?
The multi-head attention calculates the relationship between all pairwise inputs
The Feed-Forward network maps the attention vectors so that something can be fed to the Transformer decoder.
What are the details surrounding the Multi-Head Attention area of the Transformer?
Input - Each word has a Query, Key and Value assigned to it
Scaled Dot-Product Attention is then fed into the Scale, which is fed into the Softmax function.
For each input, compute multiple attention vectors and use the weighted average as the final attention vector for each word.
Output the vector as the Dot Product between attention weight and value. Multiple words can be processed in parallel.
What are the details surrounding the Decoder Training area of the Transformer?
During training, the target sentence is input to the Masked Multi-Head Attention, which masks out relationships to future words.
Then another Multi-Head Attention learns the interactions between input words and target words
The output is one-hot encoding, such as a 1000-word vector
What does Model Inference mean in regards to the Transformer?
To estimate the second word and onwards, the model uses the whole input sentence and all previously generated target words to infer the result.