RNN & Attention Flashcards

1
Q

Neural Nets

What is a Feedforward Neural Net?

A
n features -> W matrix (dxn) -> non linear func -> d hidden units -> V matrix (Cxd) -> softmax -> C probabilities

Representing text by summing/averaging word embeddings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Neural Nets

Word Embeddings in NNs

A

Similar input words get similar vectors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Neural Nets

W Matrix in NNs

A

Similar output words get similar words in softmax matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Neural Nets

Hidden States in NNs

A

Similar contexts get similar hidden states

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Neural Nets

What problems ARE handled by NNs? (vs count-based LMs)

A
  • Can share strength among similar words and contexts
  • Can condition on context with intervening (interconnected) words
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Neural Nets

What problems ARE NOT handled by NNs?

A

Cannot handle long-distance dependencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Recurrent Neural Networks

Sequential Data Examples

A
  • Words in sentences
  • Characters in words
  • Sentences in sample
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Recurrent Neural Networks

Long Distance Dependencies Examples

A
  • Agreement in number, gender
  • Selectional preference (determine meaning of rain/reign using clouds/queen)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Recurrent Neural Networks

Recurrent Neural Networks

A
  • Retains information from previous inputs due to the use of hidden states
  • Designed to process sequential data, where the order of the data matters (FFNN treats inputs independently)
  • At each step, RNN takes in current input with the hidden state from the previous step and updates the hidden state (“remembers”)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Recurrent Neural Networks

Unrolling in Time

A
  • RNN updates hidden vector upon each input,
  • Unrolling means breaking down “cell” into multiple copies
  • Make it easier to see how network processes sequence step-by-step
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Recurrent Neural Networks

What can RNNs do?

A
  • Sentence classification
  • Conditional generation
  • Language modeling
  • POS tagging
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Recurrent Neural Networks

Sentence Classification

A

Read whole sentence and represent it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Recurrent Neural Networks

Conditional Generation

A

Use sentence representation to make prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Recurrent Neural Networks

Language Modeling

A

Read context up until a point and represent the context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Recurrent Neural Networks

POS Tagging

A

Use sentence and context representation to determine the POS of a word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

RNN Training

How to update hidden state?

A
h_t = tanh( Wx_t + Vh_t + b_h )
17
Q

RNN Training

How to compute output vector?

A
y_t = tanh( Uh_t + b_y )
18
Q

RNN Training

How to calculate the loss?

A
  • Loss/label is calculated each time the hidden state is updated (at each input)
  • Add up all losses to get total loss
19
Q

RNN Training

Unrolling

A

Unrolled graph (of sum of loss) is a well-formed computation graph that we can use to run backpropagation through time

20
Q

RNN Training

Parameter Tying

A
  • All parameter calculation are tied together because we are summing the loss together
  • This is how we are able to determine context, POS, and long-distance dependencies
    *
21
Q

Recurrent Neural Networks

Bi-RNNs

A
  • Runs the RNN in both directions
  • Need two different models to run the input through
  • Helpful with context as context goes in both directions
22
Q

RNN Long Short-Term Memory

Vanshing Gradient

A
  • Cccurs during backpropagation
  • Gradients (used to update the network’s weights) shrink exponentially as they are propagated backward through time
  • Makes it hard to determine long-distance dependencies
23
Q

RNN LSTM

Long Short-Term Memory (LSTM)

A
  • Overcomes vanishing gradients and helps to learn long-term dependencies
  • Has memory cell and gates that regulate the flow of information
  • Allows network to retain or forget information over long sequences
24
Q

Attention

Why do we need attention?

A
  • In normal RNN, as the input sequence becomes longer, the model struggles to compress all relevant information
  • Results in poor performance
25
Q

Attention

Attention

A
  • Attention computes weighted combination of all hidden states at every time step of the output sequence
  • Allows the model to focus more on relevant parts of the input when generating the output
  • Higher attention when query is closer to key (every time query is close to key (feature), increase the attention for that key)
26
Q

Attention

Calculating Attention

A
  1. Use “query” vector (decoder) and “key” vectors (encoder) to calculate weight and normalize using softmax
  2. This weight indicate how much attention the model should pay to each input position
  3. Combine the attention weights into a vector
  4. This creates the context vector, which summarizes the important parts of the input for generating the next output (use to pick next word)
27
Q

Attention

Self-Attention

A
  • Normal attention involves attention between two different sequences
  • Self-attention computes attention within a single sequence, each position to attends to all other positions in the sequence
  • Each word forms “query” which computes attention over each other word
28
Q

Attention

Multi-Head Self-Attention

A
  • Used in transformer
  • Captures different relationships between tokens by performing multiple attention operations (called “heads”) in parallel
  • Each “head” focuses on different parts of the input/different dependencies, giving model richer understanding of input