Deep Learning Flashcards

1
Q

How can we use deep learning for nlp

A

data driven approach

take input text, embed it in high dimensional vector space

run prediction model on this

both the embedding and the prediction model are usually neural networks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How can use use time series for nlp

A

time series is a set of data points in time order. these datapoints for us are words and the time order is the appearance order in the text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What tasks can we perform using deep learning (4)

A

sequence classification
sequence labelling
sequence extraction
sequence to sequence translation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is sequence classification

A

the output is the probability distribution of classes the text belongs to.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what span extraction and how is it treated as a classification problem

A

return 2 probability distributions,
one for being the start of span
one for being the end of span

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what does a ml model for sequence labelling output

A

the output is a probability distribution for each token over classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

examples of sequence labelling

A

such as POS tagging, named entity recognition, open information extraction, question type classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

how is sequence to sequence to sequence translation performed

A

input is the sequence and output is another sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

applications for sequence to sequence translation

A

translation, summarisation, text generation, question answering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is the major disadvantage of bag of words

A

we lose the ordering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is a rnn

A

when processing the last token we use the output of the previous token

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is the vanishing gradient problem

A

we get very small gradients, making the updates zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

why is the vanishing gradient problem prevalent in nlp

A

because of the long term dependencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is the solution to the vanishing gradient problem

A

lstm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is a lstm

A

uses a context vector that retains information from past calculations.
at each step it decides how much to add to the context vector and whether anything should be disgarded

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is a gru

A

gated recurrent unit. similar to lstm in performance but fewer parameters.

17
Q

what is cataphora

A

use of a word or phrase that stands for a later word

18
Q

what is the main disadvantage of rnn, lstm and gru

A

they only take into account previous, not future, data

19
Q

what is a birnn, how do we create one

A

bidirectional rnn. stack forwards and backward rnns on top of each other

20
Q

what is self attention

A

rather than processing a sequence left to right, learn the relative importance of a token with respect t the others.

To maintain the ordering, we encode the position

21
Q

what is a word embedding

A

a static map from string to vector trained on cooccurrence in a training corpus

22
Q

what is language modelling

A

autocomplete

predict the next token given previous

23
Q

what information does language modelling combine (3)

A

syntax, semantic, and self supervision

24
Q

how do we perform language modelling using a bilstm

A

predict missing word using left and right of blank

25
Q

what bilstm model performs language modelling

A

elmo

26
Q

what does a bilstm model for lanauge modelling return, how is it optimised

A

returns 2 probability distributions that we combine

we want to maximise the log likelihood of the expected token jointly for both directions

27
Q

what is a transformer language model

A

use self attention to replace the rnn in elmo

28
Q

why do we replace the rnn with self attention

A

does not require back propagation
this makes optimisation faster
meaning bigger models on large data
bigger language models mean better performance

29
Q

what is the problem when using deep learning on nlp

A

spurious correlations. there is no way for a nn to determine spurious vs robust correlations

30
Q

how do we overcome the issue of spurious correlations

A

more data

31
Q

what are the two problems with neural networks

A
  • out of distribution generalisation

- interpretability