Recurrent Neural Networks Recurrent Neural Networks

Retains information from previous inputs due to the use of hidden states Designed to process sequential data, where the order of the data matters (FFNN treats inputs independently) At each step, RNN takes in current input with the hidden state from the previous step and updates the hidden state (remembers)

RNN & Attention Flashcards by Katherine Fadeyeva

Neural Nets

What is a Feedforward Neural Net?

n features -> W matrix (dxn) -> non linear func -> d hidden units -> V matrix (Cxd) -> softmax -> C probabilities

Representing text by summing/averaging word embeddings

How well did you know this?

Not at all

Perfectly

Neural Nets

Word Embeddings in NNs

Similar input words get similar vectors

How well did you know this?

Not at all

Perfectly

Neural Nets

W Matrix in NNs

Similar output words get similar words in softmax matrix

How well did you know this?

Not at all

Perfectly

Neural Nets

Hidden States in NNs

Similar contexts get similar hidden states

How well did you know this?

Not at all

Perfectly

Neural Nets

What problems ARE handled by NNs? (vs count-based LMs)

Can share strength among similar words and contexts
Can condition on context with intervening (interconnected) words

How well did you know this?

Not at all

Perfectly

Neural Nets

What problems ARE NOT handled by NNs?

Cannot handle long-distance dependencies

How well did you know this?

Not at all

Perfectly

Recurrent Neural Networks

Sequential Data Examples

Words in sentences
Characters in words
Sentences in sample

How well did you know this?

Not at all

Perfectly

Recurrent Neural Networks

Long Distance Dependencies Examples

Agreement in number, gender
Selectional preference (determine meaning of rain/reign using clouds/queen)

How well did you know this?

Not at all

Perfectly

Recurrent Neural Networks

Retains information from previous inputs due to the use of hidden states
Designed to process sequential data, where the order of the data matters (FFNN treats inputs independently)
At each step, RNN takes in current input with the hidden state from the previous step and updates the hidden state (“remembers”)

How well did you know this?

Not at all

Perfectly

Recurrent Neural Networks

Unrolling in Time

RNN updates hidden vector upon each input,
Unrolling means breaking down “cell” into multiple copies
Make it easier to see how network processes sequence step-by-step

How well did you know this?

Not at all

Perfectly

Recurrent Neural Networks

What can RNNs do?

Sentence classification
Conditional generation
Language modeling
POS tagging

How well did you know this?

Not at all

Perfectly

Recurrent Neural Networks

Sentence Classification

Read whole sentence and represent it

How well did you know this?

Not at all

Perfectly

Recurrent Neural Networks

Conditional Generation

Use sentence representation to make prediction

How well did you know this?

Not at all

Perfectly

Recurrent Neural Networks

Language Modeling

Read context up until a point and represent the context

How well did you know this?

Not at all

Perfectly

Recurrent Neural Networks

POS Tagging

Use sentence and context representation to determine the POS of a word

How well did you know this?

Not at all

Perfectly

RNN Training

How to update hidden state?

Study These Flashcards

h_t = tanh( Wx_t + Vh_t + b_h )

RNN Training

How to compute output vector?

Study These Flashcards

y_t = tanh( Uh_t + b_y )

RNN Training

How to calculate the loss?

Study These Flashcards

Loss/label is calculated each time the hidden state is updated (at each input)
Add up all losses to get total loss

RNN Training

Unrolling

Study These Flashcards

Unrolled graph (of sum of loss) is a well-formed computation graph that we can use to run backpropagation through time

RNN Training

Parameter Tying

Study These Flashcards

All parameter calculation are tied together because we are summing the loss together
This is how we are able to determine context, POS, and long-distance dependencies
*

Recurrent Neural Networks

Bi-RNNs

Study These Flashcards

Runs the RNN in both directions
Need two different models to run the input through
Helpful with context as context goes in both directions

RNN Long Short-Term Memory

Vanshing Gradient

Study These Flashcards

Cccurs during backpropagation
Gradients (used to update the network’s weights) shrink exponentially as they are propagated backward through time
Makes it hard to determine long-distance dependencies

RNN LSTM

Long Short-Term Memory (LSTM)

Study These Flashcards

Overcomes vanishing gradients and helps to learn long-term dependencies
Has memory cell and gates that regulate the flow of information
Allows network to retain or forget information over long sequences

Attention

Why do we need attention?

Study These Flashcards

In normal RNN, as the input sequence becomes longer, the model struggles to compress all relevant information
Results in poor performance

# Attention Attention

* Attention computes weighted combination of all hidden states at every time step of the output sequence * Allows the model to focus more on relevant parts of the input when generating the output * Higher attention when query is closer to key (every time query is close to key (feature), increase the attention for that key)

# Attention Calculating Attention

1. Use "query" vector (decoder) and "key" vectors (encoder) to calculate weight and normalize using softmax 2. This weight indicate how much attention the model should pay to each input position 3. Combine the attention weights into a vector 4. This creates the context vector, which summarizes the important parts of the input for generating the next output (use to pick next word)

# Attention Self-Attention

* Normal attention involves attention between two different sequences * Self-attention computes attention within a single sequence, each position to attends to all other positions in the sequence * Each word forms "query" which computes attention over each other word

# Attention Multi-Head Self-Attention

* Used in transformer * Captures *different relationships* between tokens by performing multiple attention operations (called "heads") in parallel * Each "head" focuses on different parts of the input/different dependencies, giving model richer understanding of input

RNN & Attention Flashcards

(28 cards)