NLP Tips and Tricks Flashcards
How does the softmax classifier work?
It gives a probability of different classifiers from a set of scores. For each input x, predict the probability of class k by taking the dot product of the weight matrix for that class and input x, taking the exponential of it, all divided by the exponent of the sum of all the scores for all the classes
What is the input and output for softmax?
The input is a vector of words of length n
The output is a vector of labels (POS tags, NE BIO tags)
What is the size of the softmax weight matrix?
It has dimensions of (C x n), where C is the number of classes and n is the number of entries in the input vector
What is cross-entropy loss?
It is a loss function used to train weights by computing the loss for the probabilities. We want to minimise the cross entropy loss
How do we compute cross entropy?
The P of true class distribution, multiplied by the log of the P of the predicted class, summed across all possible classes, multiplied by minus 1. This can be simplified to be -log q (k=true_class), where q is the predicted probability distribution
What do activation functions do?
They compute the hidden layer values given an input to compute some output values for that hidden layer
How are model parameters modelled?
They are a C x n matrix of weights, where C is the number of classes and n is the length of the input vector
What is shown in the image?
It is the matrix of gradient loss, which is computed with respect to the parameters
What is backpropagation?
It is a way to compute gradients efficiently using matrix computation.
Why is matrix computation good?
It is highly parallelizable
What does forward propagation do?
It computes hidden layer values using activation function
What does back propagation do?
It calculates partial derivatives at each step and pass gradients back through the graph
Compute local gradients + apply chain rule
The downstream gradient = upstream gradient x local gradient
Having multiple inputs will lead to multiple local gradients