Deep Learning 3 Flashcards
- What is an example of a sequence modeling task where you have an image of a ball and want to predict its future location? Provide context for why this is a sequence problem.
You have a sequence of images or frames showing a ball in motion. The model needs to look at the current or previous frames (the ball’s trajectory so far) and then predict where the ball will appear next. This is a sequence problem because it involves temporal (or sequential) steps where each frame depends on previous frames.
- Besides images of a moving ball, what are other real-world examples of sequential data that might benefit from sequence modeling?
Common examples include audio signals (speech recognition), text data (language modeling), music generation, sensor data, and time-series analysis (like stock prices). These all have an inherent ordering that must be accounted for.
- In a language modeling task, such as predicting the next word in a sentence, why do we need to capture long-term dependencies?
Language often relies on context from many words earlier to predict future words accurately. For example, if the sentence is ‘I grew up in France … I speak fluent ___,’ the crucial clue (France) might be far back in the sequence. If the model fails to capture this distant context, it may guess incorrectly.
- Why is a fixed window approach (Idea #1) problematic for predicting the next word in a sentence?
A fixed window only sees a short recent history of words, which fails when the necessary context is outside that small window. It can’t handle variable sentence lengths, nor can it maintain relevant long-range context.
- What is the ‘bag of words’ approach (Idea #2), and why does it lose important information for sequence prediction?
‘Bag of words’ counts how often each word appears without regard to order. While it captures which words occurred, it loses the sequence information. The sentences ‘The food was good, not bad’ and ‘The food was bad, not good’ would look similar as counts, even though their meanings differ.
- Describe why using a very large fixed window (Idea #3) still poses problems for sequence modeling.
A large window leads to many parameters (one set of weights for each position in that window). The model can’t share what it learns about certain positions with others, and thus doesn’t transfer knowledge if the same context words appear at different positions in the sequence. It’s also very memory-intensive.
- Summarize the four design criteria needed for a sequence model to effectively handle sequences.
(1) Handle variable-length inputs, (2) capture long-term dependencies, (3) preserve information about ordering, (4) share parameters across time steps so that patterns learned at one position apply to other positions.
- How does a Recurrent Neural Network (RNN) differ from a standard feed-forward network in how it processes input?
An RNN has a recurrent cell that takes both the current input xₜ and a hidden state hₜ₋₁ from the previous time step, producing a new hidden state hₜ. This allows the network to maintain a memory of what has come before, unlike a feed-forward network that processes inputs independently.
- Illustrate the ‘unrolled’ computational graph of a simple RNN across three time steps and explain why we say RNNs ‘share parameters.’
At time steps t=1,2,3, you see repeated usage of the same weight matrices Wᵣₓ (for input-to-hidden) and Wₕₕ (for hidden-to-hidden). Unrolling is drawing the RNN cell multiple times, one per time step. Since each step uses the same Wᵣₓ and Wₕₕ, we say parameters are ‘shared’ across time.
- Write down the mathematical update equations for a ‘vanilla’ RNN’s hidden state and output. Mention what each term represents.
A common formulation is:\nhₜ = tanh(Wₕₕ hₜ₋₁ + Wᵣₓ xₜ + bₕ)\nŷₜ = Wₕₒ hₜ + bₒ\nHere, Wᵣₓ maps input xₜ to the hidden state, Wₕₕ maps the previous hidden state hₜ₋₁ to the new hidden state, and Wₕₒ maps the hidden state to an output ŷₜ.
- Describe how ‘Backpropagation Through Time (BPTT)’ works for training RNNs.
We unroll the RNN over the sequence length, then compute the forward pass for all time steps. For the backward pass, we propagate gradients backwards through each time step, taking into account the repeated usage of parameters. This way, we update all the parameters with respect to errors at every time step.
- In RNNs, what are exploding gradients and how can we mitigate them?
Exploding gradients occur when gradients become excessively large (e.g. if the weight matrix has norms > 1). This leads to instability during training. A standard mitigation is ‘gradient clipping,’ where we rescale gradients if their norm exceeds a threshold, preventing excessively large updates.
- In RNNs, what does it mean for gradients to ‘vanish,’ and why does it happen?
Vanishing gradients occur when the repeated multiplication of values < 1 in magnitude drives the gradients toward zero. Over many time steps, the gradient contributions from the distant past become extremely small, making it difficult to train on long-term dependencies.
- Why are vanishing gradients a problem for learning long-term dependencies, for example in a sentence like ‘I grew up in France … I speak fluent ___’?
If the gradient from the distant word ‘France’ shrinks to near zero by the time it reaches the output, the model cannot effectively learn that ‘French’ is the correct next word. The memory of that distant context is essentially lost during backprop.
- What is one simple trick to help reduce vanishing gradients in RNNs?
Use ReLU activations (or variants) instead of sigmoid/tanh for the hidden state. ReLU derivatives are 1 for inputs > 0, which can help keep gradient magnitudes from shrinking too quickly.
- What role does parameter initialization play in preventing vanishing gradients?
If the hidden-to-hidden matrix Wₕₕ is initialized close to the identity matrix, it can help preserve the scale of the hidden state across time steps, preventing rapid decay of gradients. Biases are usually set to 0 or small values to avoid saturating non-linearities initially.
- Why were gated cells such as LSTM and GRU introduced into RNN architectures?
They provide built-in mechanisms (gates) to control which information to keep or forget over longer time spans. This helps mitigate vanishing gradients by maintaining a more constant error flow over many time steps, enabling the network to capture long-term dependencies more effectively.
- Explain how an LSTM cell differs from a basic RNN cell in terms of structure.
An LSTM has multiple gates: input, forget, and output gates, along with a cell state that can carry information across many time steps. By gating updates to the cell state, it prevents gradients from vanishing or exploding, allowing the network to remember or forget information more selectively.
- Name three practical applications of RNNs and briefly describe each one.
(1) Music Generation: input is a sequence of musical notes, the RNN predicts the next note to create new melodies. (2) Sentiment Classification: input is a sequence of words, the output is the probability of positive/negative sentiment. (3) Machine Translation: input is a sentence in one language, output is a translated sentence in another language.
- Show how RNNs can be used in a ‘many-to-one’ scenario such as sentiment classification.
We feed each word xₜ of the sentence sequentially into the RNN, updating hidden states hₜ. After the last word, we take hₜ (the final hidden state) and pass it to a classifier layer that outputs a single sentiment label or probability.
- In machine translation, why do we often adopt an ‘encoder-decoder’ architecture with RNNs?
The ‘encoder’ RNN reads the source sentence into a final hidden state that summarizes the entire input. The ‘decoder’ RNN then generates the target sentence word by word, conditioned on that hidden representation. This approach handles variable-length inputs and outputs.
- How might an RNN-based system handle the special tokens like ‘<start>' in tasks like translation?</start>
The decoder RNN often receives a ‘<start>' token to signal the beginning of the output sequence generation. Once it sees '<start>', it begins producing words in the target language one at a time until it emits an '<end>' token.</end></start></start>
- What kind of design variations might we see in RNN architectures for tasks like image captioning?
One design is to use a CNN to encode the image into a feature vector, then feed that vector as the initial hidden state of an RNN (decoder) that generates captions. Another approach includes attention mechanisms to dynamically attend to different parts of the image over time.
- Summarize the key points about RNNs for sequence modeling and how they solve the major problems in naive sequence approaches.
RNNs keep a hidden state that is updated at every time step with the same parameters (solving the parameter-sharing problem). They can handle variable-length sequences, track order, and in principle capture long-term dependencies better than fixed-window or bag-of-words approaches. However, they can suffer from vanishing/exploding gradients and often use LSTM/GRU cells to alleviate that.
- Give a concise definition of a Recurrent Neural Network (RNN) and why it’s well-suited for sequence tasks.
An RNN is a neural network that processes sequences by maintaining a hidden state across time steps. At each step, it takes an input xₜ and the previous hidden state hₜ₋₁ to produce a new hidden state hₜ. Because it shares parameters and updates state in a temporal manner, it efficiently models sequential data like text, audio, or time-series signals.