Training Flashcards
What is the generality of backpropagation?
Applies to all mathematical expressions (well chains of functions), not just neural networks
Difference between micrograd and production grade nn libraries
Micrograd operates on scalars and not tensors, but other than that uses all the same math
In classification problems, how can use use cross entropy loss if a target value and predicted value are say both .8?
Well in that case we really shouldn’t use cross entropy loss. Bc -.8*log.8 = -.8 *-.1 = -.08 which is not 0 even though there was zero prediction error!
What do you call it when working with classification problems if your target distributions aren’t one-hot encoded?
You’re working with “soft” target distributions
Why do we call it “multi class classification?”
It is as opposed to binary classification where there are only two as opposed to more multiple options
What is the restriction on the domain of values MSE is able to be used on?
No restriction. Just be real numbers
What are the possible input values for KL divergence?
Input values must be greater than zero
In seq to seq transformer models what loss function is usually used?
Cross entropy loss
Explain how given dL/dWi, a particular modification to that weight Wi is chosen
Where
η is the learning rate and “gradient” is the computed derivative
Is ReLu différentiable?
No
Are the intermediate values of each input, output, weight necessary to compute the gradient for a particular weight? Can you just compute the gradients analytically?
Yes and for the second question you’re not thinking fully
an analytical expression of the the derivative of the loss wrt to a weight is expressed in terms of variables (input, output, weights etc) and you need the values of those variables to calculate the gradients
Do we store a matrix of analytical expressions for each gradient during training?
No
Rather modern deep learning frameworks like PyTorch and tensor flow calculate each gradient auotmically as needed via backpropagation
What is a Hessian?
How can you have a partial derivative with respect two two variables?
What is the Jacobian for an particular entry in the Jacobian matrix?