RNNs Flashcards
How do Hidden Markov Models input words?
They do it one at a time
What are some limits around state based models?
Supervised ML techniques take fixed sequence inputs, but sentences can vary in length
What is the workaround to sentences not having a fixed length?
We use a sliding window of words
What is a problem using a sliding window of words?
It is hard to learn semantic patterns due to long range dependencies
Why can a single sentence generate lots of inputs?
This is because of the sliding window - if we have the sentence “and thanks for all the fish” and a window size of 3, we can have inputs of “and thanks for”, “thanks for all”, “for all the” and “all the fish”.
In the image, what is the size of the input?
It is 3 times dimension of the embedding, as we have three embeddings that are all concatenated together
What are Recurrent Neural Networks based on?
Elman Networks
What is different about RNNs compared to NNs?
We don’t just take the immediate input, we also factor in the previous input as well.
What happens to the values of the hidden layer at time t-1 when an input is received at time t?
The values are provided as input in addition to the current input vector
What type of network does the image show?
A simple RNN
Explain what the image shows?
It shows how an RNN works, we have an input vector, which is adjusted by the weights, w, and the hidden layer from the previous input. We aggregate these with the current weights w to get the new value for the hidden layer values. These are multiplied by the weights for the output which are then output.
How are the hidden layer values computed?
An activation function is used
Explain what the image shows
It shows that to get the hidden layer values, ht, we multiply the previous hidden layer by weights U, and add the current input multiplied by weights W. Then to get the output, we use a function f, usually softmax, to get the output vector
How does the loss function work in an RNN?
It needs ht and ht-1, which in turn needs ht-2 and so on
When using an RNN for language models, what is the input?
The input is a sequence of L words in vocab V where L is the length of the sequence so far, which are then one-hot encoded and used as a vector of size L x V
What is a one-hot vector in regards to a language model?
It is a vector of size V that is filled with 0s apart from the index where that word appears in the vocabulary
When using a RNN with a language model, what is the output Y?
It is the predicted next word in the sequence, which is a probability distribution over V
What does cross-entropy measure?
It measures how well a set of estimated probabilities matches the target class
What is teacher forcing?
It uses the output from prior training steps as input to help model convergence, so when making the next prediction, use the ground truth sequence rather than the predicted values to ensure that training keeps on track
How does sequence labelling with RNNs work (e.g. POS tagging)?
The input X is a sequence of words
The output Y is POS tag probabilities (most likely chosen by argmax)
Pre-trained word embeddings can be used
The loss function is a cross-entropy loss function
How does autoregressive generation using an RNN work (e.g. text generator)?
Input X is a sequence of words so far, starting with start token
Output Y is the next word to be added to X
Pre-trained word embeddings can be used
The loss function is a cross entropy loss function
How does sequence classification work with an RNN (e.g. sentence/document classifier)?
Input X is a sequence of words in sentence/document
Output Y is a class probability
Use of both an RNN + MLP
Cross entropy loss function based on classification result
How do stacked RNNs work?
The entire output sequence of one RNN is used as an input for another RNN
What is a positive of using a stacked RNN?
It encodes different levels of abstract representations which allows for more sophisticated patters to be encoded
What is a drawback of using a stacked RNN?
Adding more RNN layers increases training time
What are stacked RNNs an example of?
Deep Learning
How does a bi-directional RNN work?
We have an RNN layer that does a forward pass, a separate RNN layer that does a backwards pass, and then we concatenate the hidden layer values for each position t in the sequence
In an RNN, how many sets of weights do we have to update?
3 - W, the weights from the input layer to the hidden layer, U, the weights from the previous hidden layer to the current hidden layer and V, the weights from the hidden layer to the output layer