Recurrent Neural Networks Flashcards
Why do we need RNNs that accept variable-length input?
Many real-world sequences (like sentences) have variable lengths, and RNNs can process them while maintaining temporal information.
What is the Seq2Seq architecture used for?
It is used for transforming one sequence into another, such as translating a sentence from one language to another.
What are the components of a Seq2Seq model?
An encoder RNN and a decoder RNN.
What does the encoder in a Seq2Seq model do?
It processes the input sequence and produces a context vector summarizing the sequence.
What does the decoder in a Seq2Seq model do?
It generates the output sequence using the context vector.
What problem arises from using a fixed-size context vector in Seq2Seq?
It may not be sufficient to capture all the information for long input sequences.
Q: What problem arises from using a fixed-size context vector in Seq2Seq?
A: It may not be sufficient to capture all the information for long input sequences.
How is this problem addressed?
With attention mechanisms that allow the decoder to access all encoder hidden states, not just the final one.
What does the attention mechanism do?
It computes a context vector dynamically for each decoder step by focusing on different parts of the input sequence.
What are the three components involved in attention computation?
Query (decoder hidden state), Keys (encoder hidden states), and Values (encoder hidden states).
How is the attention weight computed?
By taking a similarity score between the query and each key, usually followed by a softmax.
What are common types of attention score functions?
Dot-product, multiplicative (general), and additive (Bahdanau) attention.
How is dot-product attention computed?
It’s the dot product of the query and key vectors.
What is additive attention (Bahdanau attention)?
It uses a feedforward network with a single hidden layer to combine the query and key.
How does the decoder use attention in each time step?
It calculates attention weights, computes a context vector, and uses it along with the decoder hidden state to generate output.
What is concatenated in the decoder before generating the output token?
The context vector and the current decoder hidden state.
What is teacher forcing?
A training technique where the decoder receives the ground truth token from the previous time step as input rather than its own previous prediction.
What is the advantage of teacher forcing?
It speeds up convergence and reduces error propagation during training.
Name some applications of Seq2Seq with attention.
Machine translation, summarization, speech recognition, question answering.
What is a Bidirectional RNN?
An RNN that processes the sequence in both forward and backward directions.
What is the benefit of Bidirectional RNNs?
They capture both past and future context for each time step.
What is the vanishing gradient problem in RNNs?
Gradients become too small to update weights effectively, especially over long sequences.
Gradients become too small to update weights effectively, especially over long sequences.
How is this issue addressed?
Using LSTM or GRU architectures that include gating mechanisms.