Lecture 4 Flashcards
What is a recurrent neural network?
A recurrent neural network or RNN is a deep neural network trained on sequential or time series data to create a machine learning (ML) model.
In what situations may inputs to the network be independent?
Some image processing. Eg where we are showing the network 1 image at a time, it doesn’t matter what the previous inputs/predictions were.
If two inputs to the network are independent of each other, their outputs will also be independent.
In what situations may inputs to the network be dependent?
Eg autonomous cars - we want to interpret what we see in terms of what we saw previously. Eg if we saw a lorry previously that is no longer in direct sight, we want to still have the knowledge that it is there.
When inputs/predictions are dependent, on previous data, how do we describe this correlation?
Time correlation
What contexts are time correlations very relevant for?
Language modelling (eg predicting the next word, translating).
We build up the sentence word by word, a bit at a time, narrowing down the sentence one word at a time. This prediction will be different if we predict the next word using only the current one vs the knowledge of the context of the whole sentence.
Particularly in languages like German, where the meaning of a sentence isn’t always clear until the last words, we need to know the context (eg the previous sentences). Knowing context is also a problem for human translators.
What situations are time correlations often critical?
- Speech recognition eg you miss a word, can work it out
- Handwriting recognition - make a guess based on words around it
- Machine translation
- Object classification
- Prediction of stock market prices - supposedly predict based on what has happened before
Does the standard neural network account for time correlation?
No, the output zi only depends on the input xi - where I represents a time step, ie the input the ith time we make a prediction is only dependent on the ith inputs.
How do we make the output Z(i+1) depend on inputs x(i+1) and x(i)?
What scenarios may we want this?
We could give the network inputs from the earlier timepoints as well.
- Self-driving car sees now and what it saw previously
- Sentence prediction: need more than one word to infer context. Not all sentences have the same length, so we would need a new network each time.
What is the problem with simply having the output Z(i+1) depend on inputs x(i+1) and x(i)?
It depends only on one timestep previously. We generally have to look much further back. We want to account for correlations between points in a sequence.
How can we account for correlations between points in a sequence?
Use the output from point i as an input to point i + 1
Linking up networks, we can take inputs of the current step and outputs of previous runs of the network. This means that the output will depend on the outputs at all previous time steps.
This acts as a very deep neural network
Define a recurrent neural network.
A deep network where the weights are shared between layers.
How are recurrent neural networks often represented?
A single network with the output feeding back in.
It is no longer necessarily a feed-forward network, you’re also feeding back into the network.
Can also be shown with several networks side-by-side
Describe the architecture of a RNN.
Unlike traditional deep neural networks, where each dense layer has distinct weight matrices, RNNs use shared weights across time steps, allowing them to remember information over sequences.
If we consider the perceptron, we have some set of features xi connected to a single hidden lair with nodes. Each feature is connected to each of the nodes. Each node has its own output. This enables features to be combined and activations applied to produce outputs. All of those outputs will go into the next layer.
[See flashcard]
In the most general case, we have n1 inputs and n0 outputs. For text recognition, an input xi is a representation of a word and zi is the probability distribution over vocabulary.
What are the sets of weights associated with RNNs?
There are two sets of weights (Wx for the inputs and Wz for the outputs)
- Input features (xi) -> hidden nodes
eg 7 input features and 5 nodes = 35 weights - Hidden nodes -> outputs (Zi)
eg 5 features to 5 nodes = 25 weights - Total connections needed = 60
- Then may have biases +5 = 65
What are the set of activations associated with the RNN for a perceptron?
Z(0) = 0 - we assume the first time we run it, it is zero
[See flashcard]
What function do we apply in an RNN to get probabilities that add to 1?
Softmax function
What is the formula for the softmax function for a two-state classification problem?
[See flashcard]
How do we train an RNN?
An RNN is essentially just a (very) deep network and is trained like one.
What does an RNN “unroll” to give?
The network is”unrolled” to give a deep network
What is a difference between a standard deep neural network and an RNN?
In a standard deep neural network, you would have different weights between nodes but here we have shared weights.
Eg node 1 and node 2 in “layer 1” have same weight as node 1 and node 2 in “layer 2”
How would we find the error of an RNN?
Back propagation gives error derivatives
- The same weight appears several times in the network
We are propagating backwards through the network and time.
What may we encounter with RNNs?
An RNN is a very deep network, trained by back-propagation.
We may encounter the vanishing gradient problem - the further back we go through the network, the more likely the gradient is to be zero. [For a deep neural network, there are even more opportunities for the gradient to go to 0].
This limits the memory of the network - the model is more sensitive to recent inputs than ones from further back in time. In reality we only change the weights based on the current one and a few previous runs (rather than all of them).
How can we solve the vanishing gradient problem in RNNs?
Solved using long short-term memory (LSTM)
What is long short-term memory (LSTM)?
LSTM (Long Short-Term Memory) is a recurrent neural network (RNN) architecture widely used in Deep Learning. It has a more complex architecture than can remember (and forget) over arbitrary intervals.
A traditional RNN has a single hidden state that is passed through time, which can make it difficult for the network to learn long-term dependencies. LSTMs model address this problem by introducing a memory cell, which is a container that can hold information for an extended period.
What kind of memory do LSTMs have?
An explicit memory - this keeps updating each time
Describe the elements of the LSTM network.
Bring up diagram and describe it.
What is a disadvantage of LSTM?
Extra computational expense.
How does LSTM avoid the vanishing gradient problem?
It has an explicit memory which is insensitive to gradients.
How do LSTM layers compare to standard RNN laters?
LSTM layers have more components than standard RNN layers.
What “goes in” to the LSTM layer?
An input, as well as the previous hidden node.
This is the same as for the RNN.
Where do we get the output from in a LSTM?
Output comes from a hidden node.
Describe the explicit memory of the LSTM layer and how it is updated.
There is an explicit memory, with several slots for previous information.
We update the memory every time we encounter a new data point. Eg in the example, it is memory with 7 slots. This contains information on what we have seen before.
Firstly, we decide what we can forget. The previous out put and current input are combined to give multiple outputs, putting together a “forget gate”. The output of this gate is a set of numbers between 0 and 1 eg 1 = remember, 0 = forget.
We multiply the memory unit element-wise by the forget gate to produce an updated memory.
Next, we want to put new information into the memory. Two nodes take in input from xi and mi-1 (the previous output). One applies a sigmoid and one a tanh activation function. Sigmoid gives an output between 0 and 1, whereas tanh returns whether it should be positive or negative. These are multiplied together to give the input gate. These values are added to the memory to update it.
Everything is then combined in the output gate. Which then passes information to a final idea unit which makes a prediction.
What are the advantages of the LSTM?
- Explicit memory - the memory of an LSTM is only changed by the forget or input gates.
- Safe from vanishing gradients - there are no nonlinearities used in updating the memory, so no back-propagation
Which gates change the memory of an LSTM?
The forget or input gates.
What is a disadvantage of the LSTM?
LSTMs have a much more complex architecture than RNN, they involve more weights and take longer to calculate than standard RNNs. They are therefore more computationally expensive.
As there are extra weights, they are more prone to overfitting.
What is an example use of LSTMs?
Google’s lm_1b - it was trained on 1 billion words. It is a combination of convolutional and recurrent layers.
What are GRUs?
Gated recurrent units
How do GRUs and LSTMs compare?
GRUs are similar to LSTMs, but more simple (without an output gate).
There are fewer components therefore fewer parameters and therefore it is less expensive. There are fewer parameters as it does not have an output gate.
The memory is now transmitted via the hidden units.
What gates to LSTMs and GRUs have?
LSTMs have input, forget and output gates
GRUs have reset and update gates.
How do LSTMs and GRUs compare?
- LSTMs have input, forget and output gates. GRUs have reset and update gates.
- LSTMs have an explicit memory that is updated, GRUs do not
- Performance of the two is very similar
What are some real-world applications of RNNs?
- Semantic search - eg web search engines, extract meaning from keywords
- Anomaly detection - finding unusual behaviour in eg banking, detection of fraud
- Stock prices - prediction of the prices of stocks based on previous behaviour
What are Bi-Directional RNNs?
Bidirectional recurrent neural networks (BRNNs) process sequential data.
This is important when the “context” of our input is important, such as recongition of handwriting - a word in a sentence may require what comes before and after it to be interpreted.
We can improve the current prediction using the preceding output in addition to the future inputs to understand the context. Ie knowing the words/letters after can help “narrow down” the prediction.
Bi-directional recurrent neural networks include information from future as well as past inputs.
How do BRNNs compare to standard RNNs?
They are much slower to implement - a forward pass through the network requires us to propagate both ways through the sequence.
This means gradient calculations are also more expensive.
How often are BRNNs used?
Generally used quite sparingly
What are recursive neural networks?
They are a generalisation of recurrent neural networks.
While a recurrent neural network is a network with feedback that unrolls into a linear chain, a recursive neural network can have any kind of hierarchical structure.