DEEP LEARNING FOR NLP Flashcards

1
Q

What is Deep Learning DL?

A

subset of machine learning that involves neural networks with multiple layers
The solution system is a neural network
Builds end-to-end systems, which take raw objects as the input (no initial feature extraction)
eg raw image = input is pixel values

DL vs NN: DL emphasises networks with higher number of layers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

NLP tasks: Sequence distribution

A

Model probability distribution of a sequence
p( x1 | x2,…xn) or p(x1…xn)
text generation / completion
eg modelling a chatbot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

NLP tasks: Sequence Classification

A

to learn a representation vector of a sequence and use it to classify the sequence
f(x1 -> x2 -> ..-> xk) = class
(sequence of input only 1 class output)
eg sentiment analysis, spam filtering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

NLP tasks: Sequence labeling

A

Learn a representation vector for each state (element) in a sequence and use it to predict the class label for each state
f(x1 -> x2 -> ..-. xk) = class1 -> class2 ..-> class k
(sequence of input, many class outputs)
eg POS tagging, named entity recognition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

NLP tasks: seq2seq learning

A

To encode information in an input sequence (seq) and decode it to generate an output sequence (2seq)
f(x1 -> x2 ->..xk) = y1 -> y2 ->..yk
eg language translation, question answering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is sentiment analysis

A

“I liked the film a lot” = positive class
(Uses sequence classification)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Vanilla RNN

A

The simplest RNN design
When we chose f, we use a single perception (standard neuron operation to compute the hidden representation vector)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the issue with Vanilla RNN

A

Can result in vanishing gradient problems, which negatively affects the training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the Vanishing Gradient

A

In deep learning, most training is gradient-descent based
The gradient information is what is used to update the neural network
The term (Wh)^k-i in the gradient equation is problematic
k is the state of interest
i is a previous state
Numbers in the gradient matrix can become very small for long-distance past states, consequently they do not contribute to learning the correct weights
Vanishing gradients causes dependency loss between current and long-distance past states
Meaning it is biased to information in recent past states
“The write of the books …”
books is biased to “are” (plural) but correct answer is “is”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do we fix Vanishing gradient

A

Challenging
Requires new cell designs -> LSTM cells or Gated Recurrent Units
modify the function used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Information Bottleneck

A

All the information in the encoder is accumulated in the final hk and sent to start the decoder
We assume hk is good enough to hold all this information - dangerous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the Attention RNN

A

Used to solve the information bottleneck problem
Concern with the loss of information between states in the encoder
Automatically searches for parts of a source sequence that are relevant to the target prediction
Selectively builds direct connections between each state in the decoder and states in the encoder
Autoregressive structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What can we say about the softmax function

A

monotonically increasing
maps all numbers to positive and then only between 0 and 1 proportionally
results are good to be used as weights in attention RNN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the benefit of Attention RNN

A

Improve model performance
Solve bottleneck information problem
help vanishing gradient problem
provide interpretability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the motivation of multi head attention

A

to make the model more complex

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the difference between RNN vs Attention

A

RNN connects each state with the previous state
Only takes current and previous into account

Attention automatically identifies the relevant past states
Does not care about order - only about similarity

17
Q

What is a Transformer

A

the state-of-the-art neural network architecture for NLP, used in almost all recent language models
Concerned with attention only “Attention Is All You Need”

18
Q

Transformers: What is positional encoding

A

First step of transformer encoding
motivation : Injects order information to the model
Adds the order of each state i to the input vector
For an even ordered state (eg 4) we use sine
for odd we use cosine

19
Q

Transformers: What is Encoder Multi-head Attention

A

After positional encoding, we store the output vectors g1, g2, g3 as rows of a matrix G
Pass this G as query, key and value (self-attention) into the encoder multi-head attention

20
Q

Transformers: Add and Layer Normalisation

A

Motivation: prevent information loss/change caused by previous attention layer (assume there is loss)
So we add the output of the i-2 and the i-1 layers into i (current) layer and normalise them
(Instead of just taking i-1 output)
This helps stabilise the training process

21
Q

Transformers: What is the whole encoder

A

1) positional encoding
2) multi head attention
3) add and layer normalisation
4) fully connected NN
5) add and layer normalisation

These 5 steps are one building block of the decide which can be repeated N times
The fully connected NN uses a hidden layer with ReLU
The output vector for each state in the input sequence has the same dimensions

22
Q

Transformers: What is Decoder Multi-head Attention

A

Same structure as the encoder attention except in decoding we do not know the subsequent states (we MASK them = assume they don’t exist)
We simply remove the future states from the query, key and value matrices (just use the previous ones)

23
Q

Transformers: What is Encoder-decoder Attention

A

Comes after the decoder attention layer
Offers us the opportunity to inject the information from the encoder into the decoder

The encoder output is used as the value and key input into the encoder-decoder attention
The query is from the decoder output

24
Q

Transformers: What is the whole decoder

A

1) Masked decoder attention
2) Add and layer normalisation
3) encoder-decoder attention
4) add and layer normalisation
5) fully connected feedforward NN
6) sent to the prediction layer

This is one building block of the decoder, can be repeated N times

25
Q

What building blocks do Transformers involve?

A
  • Encoder
  • Decoder
    with:
  • multi-head attentions
  • fully-connected feedforward neural networks
  • add&Norm operation
  • positional encoding