DEEP LEARNING FOR NLP Flashcards
What is Deep Learning DL?
subset of machine learning that involves neural networks with multiple layers
The solution system is a neural network
Builds end-to-end systems, which take raw objects as the input (no initial feature extraction)
eg raw image = input is pixel values
DL vs NN: DL emphasises networks with higher number of layers
NLP tasks: Sequence distribution
Model probability distribution of a sequence
p( x1 | x2,…xn) or p(x1…xn)
text generation / completion
eg modelling a chatbot
NLP tasks: Sequence Classification
to learn a representation vector of a sequence and use it to classify the sequence
f(x1 -> x2 -> ..-> xk) = class
(sequence of input only 1 class output)
eg sentiment analysis, spam filtering
NLP tasks: Sequence labeling
Learn a representation vector for each state (element) in a sequence and use it to predict the class label for each state
f(x1 -> x2 -> ..-. xk) = class1 -> class2 ..-> class k
(sequence of input, many class outputs)
eg POS tagging, named entity recognition
NLP tasks: seq2seq learning
To encode information in an input sequence (seq) and decode it to generate an output sequence (2seq)
f(x1 -> x2 ->..xk) = y1 -> y2 ->..yk
eg language translation, question answering
What is sentiment analysis
“I liked the film a lot” = positive class
(Uses sequence classification)
What is Vanilla RNN
The simplest RNN design
When we chose f, we use a single perception (standard neuron operation to compute the hidden representation vector)
What is the issue with Vanilla RNN
Can result in vanishing gradient problems, which negatively affects the training
What is the Vanishing Gradient
In deep learning, most training is gradient-descent based
The gradient information is what is used to update the neural network
The term (Wh)^k-i in the gradient equation is problematic
k is the state of interest
i is a previous state
Numbers in the gradient matrix can become very small for long-distance past states, consequently they do not contribute to learning the correct weights
Vanishing gradients causes dependency loss between current and long-distance past states
Meaning it is biased to information in recent past states
“The write of the books …”
books is biased to “are” (plural) but correct answer is “is”
How do we fix Vanishing gradient
Challenging
Requires new cell designs -> LSTM cells or Gated Recurrent Units
modify the function used
What is Information Bottleneck
All the information in the encoder is accumulated in the final hk and sent to start the decoder
We assume hk is good enough to hold all this information - dangerous
What is the Attention RNN
Used to solve the information bottleneck problem
Concern with the loss of information between states in the encoder
Automatically searches for parts of a source sequence that are relevant to the target prediction
Selectively builds direct connections between each state in the decoder and states in the encoder
Autoregressive structure
What can we say about the softmax function
monotonically increasing
maps all numbers to positive and then only between 0 and 1 proportionally
results are good to be used as weights in attention RNN
What is the benefit of Attention RNN
Improve model performance
Solve bottleneck information problem
help vanishing gradient problem
provide interpretability
What is the motivation of multi head attention
to make the model more complex