6. Recurrent Language Model Flashcards
What are some advantages and disadvantages of n-gram based modeling?
Advantages:
+ Highly Scalable; Simple assumptions
+ Computationally Tractable (count based)
Disadvantages:
- Sparsity (returning 0 for long sentences which have not been encountered before)
- Especially for large n-grams; Problems capturing long range dependencies - Symbolic Units - Generalisation Deficit
What is Neural Sequence Modelling? Describe its structure
It is a Neural Network based Language Model. Consists of:
- 1-hot vectors: w_t
- word vectors: v_t = W^T * w_t, where W ∈ R^(|V| x d)
- Hidden Layer: h = sigmoid(U^T[u_t-3; u_t-2; u_t-1])
- Output: y = V^T * h
- Softmax Normalization
What are some advantages and disadvantages of Neural Sequence Modelling?
Advantages:
+ Better generalisation on unseen n-grams
+ Smaller Memory Footprint (usually use interpolated with the n-gram based Language Model)
Disadvantages:
- n-gram history is finite, so it can’t capture relationships between words too far apart.
- Poorer performance on seen n-grams due to no explicit frequency information.
Describe Full Gradient Computation using Back Prop Through Time for RRNs and discuss its advantages and disadvantages.
Use dynamic programming to calculate for entire sequence.
We are basically computing the whole net result forward, then going backwards to update our weights.
Advantages:
-Slower? (idk why the fuck)
Disadvantages:
- Memory hog (store entire sequence in memory)
Describe Truncated Back Prop Through Time for RRNs and discuss its advantages and disadvantages.
For each output step, calculate some steps of recurrent transition errors.
Instead of performing a full forward pass, we take a few steps forward, calculate the gradient, update weights, and then repeat that until we reach the end of the network.
Advantages:
- Remember only a few results
Disadvantages:
- Less accurate for long range dependencies
Describe the Vanishing Gradient and Exploding Gradient Problems.
Vanishing Gradients:
When our gradients become very small (near 0) which lead to the updates of the weights being negligible, so our network never learns.
Exploding Gradient:
When our gradient values are very large, they cause big updates to the weights which lead to big jumps and our algorithms escape the functions topology, so we fail to find a local minimum.
What was introduced in Gated RNNs?
We introduce another concept which is to allow the network to learn skip connections. This works by having an update and a reset gate, so the network chooses which one to use to either transfer data from the previous state or the current computation at the node.
What is a variant of RNNs?
LSTMs.
What is a bi-directional RNN? What are its benefits?
An RNN which has two passes, a forward pass looking at the data from left to right and a backward pass looking at the data from right to left, and then combine them (usually by concatenating or adding the resulting functions). This allows the network to learn multiple types of context for one word.
Benefits:
• Access to the entire sequence before hand
• Access to context
• Better gradient propagation