Recurrent Neural Networks Flashcards

1
Q

Why do we need RNNs that accept variable-length input?

A

Many real-world sequences (like sentences) have variable lengths, and RNNs can process them while maintaining temporal information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the Seq2Seq architecture used for?

A

It is used for transforming one sequence into another, such as translating a sentence from one language to another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the components of a Seq2Seq model?

A

An encoder RNN and a decoder RNN.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does the encoder in a Seq2Seq model do?

A

It processes the input sequence and produces a context vector summarizing the sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does the decoder in a Seq2Seq model do?

A

It generates the output sequence using the context vector.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What problem arises from using a fixed-size context vector in Seq2Seq?

A

It may not be sufficient to capture all the information for long input sequences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Q: What problem arises from using a fixed-size context vector in Seq2Seq?
A: It may not be sufficient to capture all the information for long input sequences.

How is this problem addressed?

A

With attention mechanisms that allow the decoder to access all encoder hidden states, not just the final one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does the attention mechanism do?

A

It computes a context vector dynamically for each decoder step by focusing on different parts of the input sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the three components involved in attention computation?

A

Query (decoder hidden state), Keys (encoder hidden states), and Values (encoder hidden states).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is the attention weight computed?

A

By taking a similarity score between the query and each key, usually followed by a softmax.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are common types of attention score functions?

A

Dot-product, multiplicative (general), and additive (Bahdanau) attention.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How is dot-product attention computed?

A

It’s the dot product of the query and key vectors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is additive attention (Bahdanau attention)?

A

It uses a feedforward network with a single hidden layer to combine the query and key.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does the decoder use attention in each time step?

A

It calculates attention weights, computes a context vector, and uses it along with the decoder hidden state to generate output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is concatenated in the decoder before generating the output token?

A

The context vector and the current decoder hidden state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is teacher forcing?

A

A training technique where the decoder receives the ground truth token from the previous time step as input rather than its own previous prediction.

17
Q

What is the advantage of teacher forcing?

A

It speeds up convergence and reduces error propagation during training.

18
Q

Name some applications of Seq2Seq with attention.

A

Machine translation, summarization, speech recognition, question answering.

19
Q

What is a Bidirectional RNN?

A

An RNN that processes the sequence in both forward and backward directions.

20
Q

What is the benefit of Bidirectional RNNs?

A

They capture both past and future context for each time step.

21
Q

What is the vanishing gradient problem in RNNs?

A

Gradients become too small to update weights effectively, especially over long sequences.

22
Q

Gradients become too small to update weights effectively, especially over long sequences.

How is this issue addressed?

A

Using LSTM or GRU architectures that include gating mechanisms.