class 16 17 Flashcards
Explain the paremeters used in the Real time recurrent neural networks in details.
h(0) = 0 ah(t) = w h(t-1) + Ux(t) + b h(t) = TANH(a h(t) a o (t) = Vh(t) + c o(t) = TANH(ao(t)) L = |y - o(t) |^2
What is a long term dependency in RNN? Why it is a problem to calculate long term dependencies in RNN and how to cope with it?
Learning long term dependency is really hard. When the output at time t depend on the input at time t-T, where t»T»1.
x(t-T) -> y(t)
x(t-100) -> y(t)
In order to output correct y(t) the network needs to:
- recognize its dependency to y(t)
- use x(t-T) in generation of y(t)
Because of this long term dependency, it is hard to learn those dependecies by gradient.
Gradient vanishes when going back in time.
lim t-T goes to infinity derivative of h(t) / derivative of h(T) = 0
In order to cope with this problem we have introduced a few algorithms.
One is (vanilla)LSTM:
Lstm has a memory cell which acts as a linear memory, which helps the gradient to flow without vanishing because of the linearity.
In LSTM there are gates which are input output and forget units. These gates have sigmoidal activation functions and have non linearity. If those gates have a really big input then the sigmoidal activatin function will produce a one, or else if it is too small it will produce a 1, therefore it controls the flow through those gates.
If the input gate is on, means that the input to the memory cell is accepted
if the output gate is on, then it means that the current output of the memory cell is accepted to be read in output.
If the forget gate is off, then it means that the current output produced by the memory cell will be reseted to zero.
Gradient is not down sized by jacobian therefore it is not vanishing
which gate is curucial in LSTM? Why? What is the structure obtained by using only the forget gate?
The forget gate is cruical, because forgetting things onece in a while, helps the memory and learning, and it is cruical especially becuasee, it decides whether the information should be thrown away or to be kept.
GRU has only a forget gate but nott an input or an output gate, it has z and r, z is for updating and r is for reseting. Z is used in order to chos eif the current hidden state h should be updated in to h’
Reset gate on the other hand decides whether the previous hidden state is ignored or not
What is resorvoir computing? Echo state nets and liquid state machines? What is the main idea?
Because of the gradient vanishing problem in RNN, these are used in order to prevent vanishing gradients.
The main steps taken is, the connections between the input to hidden and hidden to hidden are fixed to random variables and the only learnabşe part is to learn the connections of the output.
It solves vanishing gradient because there is no backpropagation, because the connections are fixed and wont be changed anymore. We perform learning by only looking at the connections between hidden to output