DEEP LEARNING FOR NLP Flashcards
What is Deep Learning DL?
subset of machine learning that involves neural networks with multiple layers
The solution system is a neural network
Builds end-to-end systems, which take raw objects as the input (no initial feature extraction)
eg raw image = input is pixel values
DL vs NN: DL emphasises networks with higher number of layers
NLP tasks: Sequence distribution
Model probability distribution of a sequence
p( x1 | x2,…xn) or p(x1…xn)
text generation / completion
eg modelling a chatbot
NLP tasks: Sequence Classification
to learn a representation vector of a sequence and use it to classify the sequence
f(x1 -> x2 -> ..-> xk) = class
(sequence of input only 1 class output)
eg sentiment analysis, spam filtering
NLP tasks: Sequence labeling
Learn a representation vector for each state (element) in a sequence and use it to predict the class label for each state
f(x1 -> x2 -> ..-. xk) = class1 -> class2 ..-> class k
(sequence of input, many class outputs)
eg POS tagging, named entity recognition
NLP tasks: seq2seq learning
To encode information in an input sequence (seq) and decode it to generate an output sequence (2seq)
f(x1 -> x2 ->..xk) = y1 -> y2 ->..yk
eg language translation, question answering
What is sentiment analysis
“I liked the film a lot” = positive class
(Uses sequence classification)
What is Vanilla RNN
The simplest RNN design
When we chose f, we use a single perception (standard neuron operation to compute the hidden representation vector)
What is the issue with Vanilla RNN
Can result in vanishing gradient problems, which negatively affects the training
What is the Vanishing Gradient
In deep learning, most training is gradient-descent based
The gradient information is what is used to update the neural network
The term (Wh)^k-i in the gradient equation is problematic
k is the state of interest
i is a previous state
Numbers in the gradient matrix can become very small for long-distance past states, consequently they do not contribute to learning the correct weights
Vanishing gradients causes dependency loss between current and long-distance past states
Meaning it is biased to information in recent past states
“The write of the books …”
books is biased to “are” (plural) but correct answer is “is”
How do we fix Vanishing gradient
Challenging
Requires new cell designs -> LSTM cells or Gated Recurrent Units
modify the function used
What is Information Bottleneck
All the information in the encoder is accumulated in the final hk and sent to start the decoder
We assume hk is good enough to hold all this information - dangerous
What is the Attention RNN
Used to solve the information bottleneck problem
Concern with the loss of information between states in the encoder
Automatically searches for parts of a source sequence that are relevant to the target prediction
Selectively builds direct connections between each state in the decoder and states in the encoder
Autoregressive structure
What can we say about the softmax function
monotonically increasing
maps all numbers to positive and then only between 0 and 1 proportionally
results are good to be used as weights in attention RNN
What is the benefit of Attention RNN
Improve model performance
Solve bottleneck information problem
help vanishing gradient problem
provide interpretability
What is the motivation of multi head attention
to make the model more complex
What is the difference between RNN vs Attention
RNN connects each state with the previous state
Only takes current and previous into account
Attention automatically identifies the relevant past states
Does not care about order - only about similarity
What is a Transformer
the state-of-the-art neural network architecture for NLP, used in almost all recent language models
Concerned with attention only “Attention Is All You Need”
Transformers: What is positional encoding
First step of transformer encoding
motivation : Injects order information to the model
Adds the order of each state i to the input vector
For an even ordered state (eg 4) we use sine
for odd we use cosine
Transformers: What is Encoder Multi-head Attention
After positional encoding, we store the output vectors g1, g2, g3 as rows of a matrix G
Pass this G as query, key and value (self-attention) into the encoder multi-head attention
Transformers: Add and Layer Normalisation
Motivation: prevent information loss/change caused by previous attention layer (assume there is loss)
So we add the output of the i-2 and the i-1 layers into i (current) layer and normalise them
(Instead of just taking i-1 output)
This helps stabilise the training process
Transformers: What is the whole encoder
1) positional encoding
2) multi head attention
3) add and layer normalisation
4) fully connected NN
5) add and layer normalisation
These 5 steps are one building block of the decide which can be repeated N times
The fully connected NN uses a hidden layer with ReLU
The output vector for each state in the input sequence has the same dimensions
Transformers: What is Decoder Multi-head Attention
Same structure as the encoder attention except in decoding we do not know the subsequent states (we MASK them = assume they don’t exist)
We simply remove the future states from the query, key and value matrices (just use the previous ones)
Transformers: What is Encoder-decoder Attention
Comes after the decoder attention layer
Offers us the opportunity to inject the information from the encoder into the decoder
The encoder output is used as the value and key input into the encoder-decoder attention
The query is from the decoder output
Transformers: What is the whole decoder
1) Masked decoder attention
2) Add and layer normalisation
3) encoder-decoder attention
4) add and layer normalisation
5) fully connected feedforward NN
6) sent to the prediction layer
This is one building block of the decoder, can be repeated N times
What building blocks do Transformers involve?
- Encoder
- Decoder
with: - multi-head attentions
- fully-connected feedforward neural networks
- add&Norm operation
- positional encoding