DEEP LEARNING FOR NLP Flashcards by Isabel Draper

What is Deep Learning DL?

subset of machine learning that involves neural networks with multiple layers
The solution system is a neural network
Builds end-to-end systems, which take raw objects as the input (no initial feature extraction)
eg raw image = input is pixel values

DL vs NN: DL emphasises networks with higher number of layers

How well did you know this?

Not at all

Perfectly

NLP tasks: Sequence distribution

Model probability distribution of a sequence
p( x1 | x2,…xn) or p(x1…xn)
text generation / completion
eg modelling a chatbot

How well did you know this?

Not at all

Perfectly

NLP tasks: Sequence Classification

to learn a representation vector of a sequence and use it to classify the sequence
f(x1 -> x2 -> ..-> xk) = class
(sequence of input only 1 class output)
eg sentiment analysis, spam filtering

How well did you know this?

Not at all

Perfectly

NLP tasks: Sequence labeling

Learn a representation vector for each state (element) in a sequence and use it to predict the class label for each state
f(x1 -> x2 -> ..-. xk) = class1 -> class2 ..-> class k
(sequence of input, many class outputs)
eg POS tagging, named entity recognition

How well did you know this?

Not at all

Perfectly

NLP tasks: seq2seq learning

To encode information in an input sequence (seq) and decode it to generate an output sequence (2seq)
f(x1 -> x2 ->..xk) = y1 -> y2 ->..yk
eg language translation, question answering

How well did you know this?

Not at all

Perfectly

What is sentiment analysis

“I liked the film a lot” = positive class
(Uses sequence classification)

How well did you know this?

Not at all

Perfectly

What is Vanilla RNN

The simplest RNN design
When we chose f, we use a single perception (standard neuron operation to compute the hidden representation vector)

How well did you know this?

Not at all

Perfectly

What is the issue with Vanilla RNN

Can result in vanishing gradient problems, which negatively affects the training

How well did you know this?

Not at all

Perfectly

What is the Vanishing Gradient

In deep learning, most training is gradient-descent based
The gradient information is what is used to update the neural network
The term (Wh)^k-i in the gradient equation is problematic
k is the state of interest
i is a previous state
Numbers in the gradient matrix can become very small for long-distance past states, consequently they do not contribute to learning the correct weights
Vanishing gradients causes dependency loss between current and long-distance past states
Meaning it is biased to information in recent past states
“The write of the books …”
books is biased to “are” (plural) but correct answer is “is”

How well did you know this?

Not at all

Perfectly

How do we fix Vanishing gradient

Challenging
Requires new cell designs -> LSTM cells or Gated Recurrent Units
modify the function used

How well did you know this?

Not at all

Perfectly

What is Information Bottleneck

All the information in the encoder is accumulated in the final hk and sent to start the decoder
We assume hk is good enough to hold all this information - dangerous

How well did you know this?

Not at all

Perfectly

What is the Attention RNN

Used to solve the information bottleneck problem
Concern with the loss of information between states in the encoder
Automatically searches for parts of a source sequence that are relevant to the target prediction
Selectively builds direct connections between each state in the decoder and states in the encoder
Autoregressive structure

How well did you know this?

Not at all

Perfectly

What can we say about the softmax function

monotonically increasing
maps all numbers to positive and then only between 0 and 1 proportionally
results are good to be used as weights in attention RNN

How well did you know this?

Not at all

Perfectly

What is the benefit of Attention RNN

Improve model performance
Solve bottleneck information problem
help vanishing gradient problem
provide interpretability

How well did you know this?

Not at all

Perfectly

What is the motivation of multi head attention

to make the model more complex

How well did you know this?

Not at all

Perfectly

What is the difference between RNN vs Attention

Study These Flashcards

RNN connects each state with the previous state
Only takes current and previous into account

Attention automatically identifies the relevant past states
Does not care about order - only about similarity

What is a Transformer

Study These Flashcards

the state-of-the-art neural network architecture for NLP, used in almost all recent language models
Concerned with attention only “Attention Is All You Need”

Transformers: What is positional encoding

Study These Flashcards

First step of transformer encoding
motivation : Injects order information to the model
Adds the order of each state i to the input vector
For an even ordered state (eg 4) we use sine
for odd we use cosine

Transformers: What is Encoder Multi-head Attention

Study These Flashcards

After positional encoding, we store the output vectors g1, g2, g3 as rows of a matrix G
Pass this G as query, key and value (self-attention) into the encoder multi-head attention

Transformers: Add and Layer Normalisation

Study These Flashcards

Motivation: prevent information loss/change caused by previous attention layer (assume there is loss)
So we add the output of the i-2 and the i-1 layers into i (current) layer and normalise them
(Instead of just taking i-1 output)
This helps stabilise the training process

Transformers: What is the whole encoder

Study These Flashcards

1) positional encoding
2) multi head attention
3) add and layer normalisation
4) fully connected NN
5) add and layer normalisation

These 5 steps are one building block of the decide which can be repeated N times
The fully connected NN uses a hidden layer with ReLU
The output vector for each state in the input sequence has the same dimensions

Transformers: What is Decoder Multi-head Attention

Study These Flashcards

Same structure as the encoder attention except in decoding we do not know the subsequent states (we MASK them = assume they don’t exist)
We simply remove the future states from the query, key and value matrices (just use the previous ones)

Transformers: What is Encoder-decoder Attention

Study These Flashcards

Comes after the decoder attention layer
Offers us the opportunity to inject the information from the encoder into the decoder

The encoder output is used as the value and key input into the encoder-decoder attention
The query is from the decoder output

Transformers: What is the whole decoder

Study These Flashcards

1) Masked decoder attention
2) Add and layer normalisation
3) encoder-decoder attention
4) add and layer normalisation
5) fully connected feedforward NN
6) sent to the prediction layer

This is one building block of the decoder, can be repeated N times

What building blocks do Transformers involve?

- Encoder - Decoder with: - multi-head attentions - fully-connected feedforward neural networks - add&Norm operation - positional encoding

DEEP LEARNING FOR NLP Flashcards

(25 cards)