Week 5 - Sequence Classification Flashcards
What is a multinomial classifier
many possibilities of class, only one is correct
what is multi-class multi-label classification
many possibilities of class, assign between 0 and K classes to the text
Sequence classification tasks
Sentiment analysis
fact verification
relations:
detection of paraphrasing
semantic similarity
textual entailment
What is relation classification
identifying relations between two entities in text can be considered text classification
if the set of possible relations is constrained
Textual entailment for fact verification
Claim
sent to evidence retrieval (queries wikipedia, search engine)
evidence + claim sent to textual entailment classifier
Decides whether evidence entails or contradicts
what are the main parts of deep learning text classification
use embeddings and neural networks
to obtain distributed and contextualised representation
this representation used to predict the probability distribution of the input belonging to class K
What is the text classification encoder usually based on
averaging of embeddings
cnn/rnn over embeddings
or pre trained language model eg BERT
architecture of transformers for text classification
pass contextualised embeddings into model in the form of a special separator token (CLS)
Into self attention layer
through FFNN or softmaxed linear layer
cross entropy loss is minimised petween predicted and ground truth labels
Weights of the language model are updated with feedforward neural network (fine tuning) (backpropagation)
What is the CLS
via self attention, has information of all the other tokens in the sequence
can say it has the “average” sentence representation
What is the different in architecture for multi label multi class classification
Softmax is not used because softmax assumes label dependence on each other
Model probabilities independently using element wise sigmoid and reduce them to single scalar
Often use a threshold other than 0.5
What is the issue with long document classification
computing attention is quadratic to input size
so LMs are bounded by input size
but many documents are much larger than this
How do transformers solve long document issue
-Truncation: first/last tokens
-hierarchical approaches
-dedicated architecture: the longformer
What is the longformer
doesnt calculate the attention with respect to each token
computes neighbours left and right instead
can also use a sliding window/dilated window/ global +sliding window
At lower layers - focus on closer neighbourhood; syntactic features
As we go to higher layers, spread attention out; semantic features
What is the hierarchical approach
Long documents are split into n chunks of size m and stride (overlap) s
embedded with BERT
embeddings used as input into RNN classifier
RNN sent into FFNN then softmax
how do we train the hierarchical approach
- first finetune BERT on the text chunks of fixed size m
discard this classifier - fine-tune new RNN on BERT ouputs
optimise this RNN to predict the correct classes
(freeze BERT parameters)