LSTM Flashcards
Define LSTM
Long Short Term Memory
It is a more complex form of Recurrent unit, instead of just adding the effects of input and the past, LSTM have gating units that can turn these effects on or off based on the input.
These gates have their own parameters that are trained during backpropagation.
What are the three different gates?
- Forget Gate - Removes from cell state
- Input Gate - Adds to cell state
- Output Gate - Calculate hidden state
Forget gate and input gate together updates cell state.
What is the hyper parameter in LSTM?
No. of node of Neural network present in each gate.
Dimension of Cell state, hidden state, and all the state of gates are same and is equal to no. of node of neural network.
However, dimension of input may be different.
Why ft*ct-1 is called as forget step?
Since, we are using sigmoid [0,1] in output layer of ft, therefore the final output (ft*ct-1) is controlled and vary between [0,ct-1]
How is ct (cell state) calculated?
ct = ftct-1 + itcŧ
Element wise addition
How is ht (hidden state) calculated?
ht = ot * tanh(ct)
What is GRU and why was it needed?
Gated Recurrent Unit is a simpler version of LSTM with fewer gates and less computation.
It was needed because of complex LSTM with higher number of parameters increasing its complexity.
GRU has two gates: -
* Reset gate
* Update gate
What is the difference between LSTM and GRU?
LSTM: Three gates - forget, input, output
GRU: Two gates - reset, update
LSTM: Two states - cell & hidden
GRU: One state - Hidden
LSTM has more parameter than GRU.
LSTM are computationally expensive due to extra gate and cell state
GRU is preferred for simpler task.
LSTM is preferred for complex task.
When to use deep RNNs?
Stacking multiple RNNs on top of one another.
- For advanced tasks like speech recognition, machine translation
- When we have large dataset
- If we have enough computational power
Deep RNN maintains hierarchical structure; initial RNN ensures word level dependency, as we go deep, sentence, paragraph level dependency are captured.
What is BiRNN/BiLSTM/BiGRU
Bi-Directional
Used when context of a word depends on future words.
Ex - I love amazon, website; I love amazon, river.
Application BiRNN/BiLSTM/BiGRU
NER, POS tagging, Machine Translation, Sentiment Analysis, Time Series Forecasting
Disadvantage of BiRNN/BiLSTM/BiGRU
- Overfitting (due to increased complexity)
- Latency issues in real-time speech recognition