LSTM Flashcards

Question 1

Q

Define LSTM

Answer

A

Long Short Term Memory

It is a more complex form of Recurrent unit, instead of just adding the effects of input and the past, LSTM have gating units that can turn these effects on or off based on the input.

These gates have their own parameters that are trained during backpropagation.

Question 2

Q

What are the three different gates?

Answer

A

Forget Gate - Removes from cell state
Input Gate - Adds to cell state
Output Gate - Calculate hidden state

Forget gate and input gate together updates cell state.

Question 3

Q

What is the hyper parameter in LSTM?

Answer

A

No. of node of Neural network present in each gate.

Dimension of Cell state, hidden state, and all the state of gates are same and is equal to no. of node of neural network.

However, dimension of input may be different.

Question 4

Q

Why ft*ct-1 is called as forget step?

Answer

A

Since, we are using sigmoid [0,1] in output layer of ft, therefore the final output (ft*ct-1) is controlled and vary between [0,ct-1]

Question 5

Q

How is ct (cell state) calculated?

Answer

A

ct = ftct-1 + itcŧ
Element wise addition

Question 6

Q

How is ht (hidden state) calculated?

Answer

A

ht = ot * tanh(ct)

Question 7

Q

What is GRU and why was it needed?

Answer

A

Gated Recurrent Unit is a simpler version of LSTM with fewer gates and less computation.
It was needed because of complex LSTM with higher number of parameters increasing its complexity.

GRU has two gates: -
* Reset gate
* Update gate

Question 8

Q

What is the difference between LSTM and GRU?

Answer

A

LSTM: Three gates - forget, input, output
GRU: Two gates - reset, update

LSTM: Two states - cell & hidden
GRU: One state - Hidden

LSTM has more parameter than GRU.

LSTM are computationally expensive due to extra gate and cell state

GRU is preferred for simpler task.
LSTM is preferred for complex task.

Question 9

Q

When to use deep RNNs?

Answer

A

Stacking multiple RNNs on top of one another.

For advanced tasks like speech recognition, machine translation
When we have large dataset
If we have enough computational power

Deep RNN maintains hierarchical structure; initial RNN ensures word level dependency, as we go deep, sentence, paragraph level dependency are captured.

Question 10

Q

What is BiRNN/BiLSTM/BiGRU

Answer

A

Bi-Directional
Used when context of a word depends on future words.
Ex - I love amazon, website; I love amazon, river.

Question 11

Q

Application BiRNN/BiLSTM/BiGRU

Answer

A

NER, POS tagging, Machine Translation, Sentiment Analysis, Time Series Forecasting

Question 12

Q

Disadvantage of BiRNN/BiLSTM/BiGRU

Answer

A

Overfitting (due to increased complexity)
Latency issues in real-time speech recognition