RNN & Attention Flashcards
Neural Nets
What is a Feedforward Neural Net?
n features -> W matrix (dxn) -> non linear func -> d hidden units -> V matrix (Cxd) -> softmax -> C probabilities
Representing text by summing/averaging word embeddings
Neural Nets
Word Embeddings in NNs
Similar input words get similar vectors
Neural Nets
W Matrix in NNs
Similar output words get similar words in softmax matrix
Neural Nets
Hidden States in NNs
Similar contexts get similar hidden states
Neural Nets
What problems ARE handled by NNs? (vs count-based LMs)
- Can share strength among similar words and contexts
- Can condition on context with intervening (interconnected) words
Neural Nets
What problems ARE NOT handled by NNs?
Cannot handle long-distance dependencies
Recurrent Neural Networks
Sequential Data Examples
- Words in sentences
- Characters in words
- Sentences in sample
Recurrent Neural Networks
Long Distance Dependencies Examples
- Agreement in number, gender
- Selectional preference (determine meaning of rain/reign using clouds/queen)
Recurrent Neural Networks
Recurrent Neural Networks
- Retains information from previous inputs due to the use of hidden states
- Designed to process sequential data, where the order of the data matters (FFNN treats inputs independently)
- At each step, RNN takes in current input with the hidden state from the previous step and updates the hidden state (“remembers”)
Recurrent Neural Networks
Unrolling in Time
- RNN updates hidden vector upon each input,
- Unrolling means breaking down “cell” into multiple copies
- Make it easier to see how network processes sequence step-by-step
Recurrent Neural Networks
What can RNNs do?
- Sentence classification
- Conditional generation
- Language modeling
- POS tagging
Recurrent Neural Networks
Sentence Classification
Read whole sentence and represent it
Recurrent Neural Networks
Conditional Generation
Use sentence representation to make prediction
Recurrent Neural Networks
Language Modeling
Read context up until a point and represent the context
Recurrent Neural Networks
POS Tagging
Use sentence and context representation to determine the POS of a word
RNN Training
How to update hidden state?
h_t = tanh( Wx_t + Vh_t + b_h )
RNN Training
How to compute output vector?
y_t = tanh( Uh_t + b_y )
RNN Training
How to calculate the loss?
- Loss/label is calculated each time the hidden state is updated (at each input)
- Add up all losses to get total loss
RNN Training
Unrolling
Unrolled graph (of sum of loss) is a well-formed computation graph that we can use to run backpropagation through time
RNN Training
Parameter Tying
- All parameter calculation are tied together because we are summing the loss together
- This is how we are able to determine context, POS, and long-distance dependencies
*
Recurrent Neural Networks
Bi-RNNs
- Runs the RNN in both directions
- Need two different models to run the input through
- Helpful with context as context goes in both directions
RNN Long Short-Term Memory
Vanshing Gradient
- Cccurs during backpropagation
- Gradients (used to update the network’s weights) shrink exponentially as they are propagated backward through time
- Makes it hard to determine long-distance dependencies
RNN LSTM
Long Short-Term Memory (LSTM)
- Overcomes vanishing gradients and helps to learn long-term dependencies
- Has memory cell and gates that regulate the flow of information
- Allows network to retain or forget information over long sequences
Attention
Why do we need attention?
- In normal RNN, as the input sequence becomes longer, the model struggles to compress all relevant information
- Results in poor performance
Attention
Attention
- Attention computes weighted combination of all hidden states at every time step of the output sequence
- Allows the model to focus more on relevant parts of the input when generating the output
- Higher attention when query is closer to key (every time query is close to key (feature), increase the attention for that key)
Attention
Calculating Attention
- Use “query” vector (decoder) and “key” vectors (encoder) to calculate weight and normalize using softmax
- This weight indicate how much attention the model should pay to each input position
- Combine the attention weights into a vector
- This creates the context vector, which summarizes the important parts of the input for generating the next output (use to pick next word)
Attention
Self-Attention
- Normal attention involves attention between two different sequences
- Self-attention computes attention within a single sequence, each position to attends to all other positions in the sequence
- Each word forms “query” which computes attention over each other word
Attention
Multi-Head Self-Attention
- Used in transformer
- Captures different relationships between tokens by performing multiple attention operations (called “heads”) in parallel
- Each “head” focuses on different parts of the input/different dependencies, giving model richer understanding of input