Deep Learning Flashcards
How can we use deep learning for nlp
data driven approach
take input text, embed it in high dimensional vector space
run prediction model on this
both the embedding and the prediction model are usually neural networks
How can use use time series for nlp
time series is a set of data points in time order. these datapoints for us are words and the time order is the appearance order in the text
What tasks can we perform using deep learning (4)
sequence classification
sequence labelling
sequence extraction
sequence to sequence translation
what is sequence classification
the output is the probability distribution of classes the text belongs to.
what span extraction and how is it treated as a classification problem
return 2 probability distributions,
one for being the start of span
one for being the end of span
what does a ml model for sequence labelling output
the output is a probability distribution for each token over classes
examples of sequence labelling
such as POS tagging, named entity recognition, open information extraction, question type classification
how is sequence to sequence to sequence translation performed
input is the sequence and output is another sequence
applications for sequence to sequence translation
translation, summarisation, text generation, question answering
what is the major disadvantage of bag of words
we lose the ordering
what is a rnn
when processing the last token we use the output of the previous token
what is the vanishing gradient problem
we get very small gradients, making the updates zero
why is the vanishing gradient problem prevalent in nlp
because of the long term dependencies
what is the solution to the vanishing gradient problem
lstm
what is a lstm
uses a context vector that retains information from past calculations.
at each step it decides how much to add to the context vector and whether anything should be disgarded
what is a gru
gated recurrent unit. similar to lstm in performance but fewer parameters.
what is cataphora
use of a word or phrase that stands for a later word
what is the main disadvantage of rnn, lstm and gru
they only take into account previous, not future, data
what is a birnn, how do we create one
bidirectional rnn. stack forwards and backward rnns on top of each other
what is self attention
rather than processing a sequence left to right, learn the relative importance of a token with respect t the others.
To maintain the ordering, we encode the position
what is a word embedding
a static map from string to vector trained on cooccurrence in a training corpus
what is language modelling
autocomplete
predict the next token given previous
what information does language modelling combine (3)
syntax, semantic, and self supervision
how do we perform language modelling using a bilstm
predict missing word using left and right of blank
what bilstm model performs language modelling
elmo
what does a bilstm model for lanauge modelling return, how is it optimised
returns 2 probability distributions that we combine
we want to maximise the log likelihood of the expected token jointly for both directions
what is a transformer language model
use self attention to replace the rnn in elmo
why do we replace the rnn with self attention
does not require back propagation
this makes optimisation faster
meaning bigger models on large data
bigger language models mean better performance
what is the problem when using deep learning on nlp
spurious correlations. there is no way for a nn to determine spurious vs robust correlations
how do we overcome the issue of spurious correlations
more data
what are the two problems with neural networks
- out of distribution generalisation
- interpretability