RNN Flashcards
For what kind of problems could we use many to one RNN?
Sentence classification, speech recognition (sound waves to one word), video classification
For what kind of problems could we use one to many RNN?
Image captioning, music composition, natural text generation.
For what kind of problems could we use many to many RNN?
Video classification on frame level, speech enhancment, continues emotion prediction
What makes RNN different from FW networks?
It has feedback.
What is the formula for updating hidden state and calculating output at a timestep in a simple RNN cell?
h_t = tanh( W_{hh} * h_{t-1} + W_{xh} * x_t y_t = W_{hy} h_t
What is direct feedback?
The hidden state is used in the same cell at the next timestep
What is indirect feedback?
The hidden state is connected to a previous cell at the next timestep
What is lateral feedback?
A cell is connected to a cell in the same layer.
How does lateral feedback often affect the output of a layer?
Cells strenghten themself, while weakening others, the strongest cell becomes active.
What is a RNN with symetrical connections to all other cells called?
A Hopfield network
What is the main challenge of using deep RNN’s
Gradient vansihing/ explosion. Batch normalization and dropout layers help.
What is a biderectional RNN?
Cells see inputs both from the past and the future.
What is a LSTM (Long short-term memory cell)
A long short term memory cell has a seperate path for cell state ensuring better gradient propagation?
What is the forget gate in a LSTM cell?
Controls how much of the previous cell state to remember.
What is the input gate in a LSTM cell?
Controlls how much to write to a cell
What is the ouput gate in a LSTM cell?
Controlls how much to output from the cell
How are the values for the different gates calculated in a LSTM cell?
i = sigmoid( W_i * [h_{t-1}, x_t]) o = sigmoid( W_o * [h_{t-1}, x_t]) f = sigmoid( W_f * [h_{t-1}, x_t]) g = tanh( W_g * [h_{t-1}, x_t])
c_t = c_{t-e}*f + i*g h_t = o * tanh(c_t)
How do the weight matricies in RNN differ for different timesteps?
They are the same for all timesteps.
Why is the gradient flow improved in LSTM?
No matrix multiplication of cell state.
What are peephole connections in LSTM?
c_{t-1} is connected to the forget, input and output gates
What is a GRU ( Gated reccurent unit)?
Compared to LSTM, cell state is eliminated and uses only two gates, reset and update. The hidden state path still avoids matrix multipliation to allow efficient matrix multiplication.
What is the main advantage of GRU compared to LSTM?
GRU gates have fewer parameters and often perform comparable to LSTM.
What is pooling over time?
The output can be a average, max, sum… over time of the output of the individual cells. (For … to one problems)
How can we solve problems with a high input dimension at each timestep and going over several timesteps?
CNN + RNN
How can the vanishing/exploding gradient in a basic RNN be leviated?
Exploding: Clipping gradient
Vanishing: Changing to LSTM (Or Gru)
What is BPTT(Back propagation trough time)?
In BPTT we run the forward phase for the entire sequence (over time) without updating the weights. We then calculate the loss of all outputs of the RNN and let the gradient propagate back to eariler states (trough time).
What is sequence representation learning with RNNs?
Use a RNN “encoder” to get a single output form a sequence, and a RNN decoder to turn the ouput into a sequence. Can also use “attention” a combination of the ouputs from each cell in the encoder that is used as additional input to each cell in the decoder.
What is the moativation behind CTC ( Connectionist Temporal Classification)?
If we want to translate audio into e.g. text with a normal RNN we need to