Training Flashcards
What is the generality of backpropagation?
Applies to all mathematical expressions (well chains of functions), not just neural networks
Difference between micrograd and production grade nn libraries
Micrograd operates on scalars and not tensors, but other than that uses all the same math
In classification problems, how can use use cross entropy loss if a target value and predicted value are say both .8?
Well in that case we really shouldn’t use cross entropy loss. Bc -.8*log.8 = -.8 *-.1 = -.08 which is not 0 even though there was zero prediction error!
What do you call it when working with classification problems if your target distributions aren’t one-hot encoded?
You’re working with “soft” target distributions
Why do we call it “multi class classification?”
It is as opposed to binary classification where there are only two as opposed to more multiple options
What is the restriction on the domain of values MSE is able to be used on?
No restriction. Just be real numbers
What are the possible input values for KL divergence?
Input values must be greater than zero
In seq to seq transformer models what loss function is usually used?
Cross entropy loss
Explain how given dL/dWi, a particular modification to that weight Wi is chosen
Where
η is the learning rate and “gradient” is the computed derivative
Is ReLu différentiable?
No
Are the intermediate values of each input, output, weight necessary to compute the gradient for a particular weight? Can you just compute the gradients analytically?
Yes and for the second question you’re not thinking fully
an analytical expression of the the derivative of the loss wrt to a weight is expressed in terms of variables (input, output, weights etc) and you need the values of those variables to calculate the gradients
Do we store a matrix of analytical expressions for each gradient during training?
No
Rather modern deep learning frameworks like PyTorch and tensor flow calculate each gradient auotmically as needed via backpropagation
What is a Hessian?
How can you have a partial derivative with respect two two variables?
What is the Jacobian for an particular entry in the Jacobian matrix?
The Jacobian generalizes what?
The concept of a gradient for scalar valued functions into it being a concept for vector valued functions
In standard backprop do you update a weight as soon as you calculate it’s gradient?
No you calculate all the gradients and then perform the updates
Is it more proper to say gradient of a neuron or gradient of a weight?
Gradient of a weight, specifically “gradient with respect to the weight”
How do you calculate a weight’s gradient if it is connected in a forward manner to two neurons?
Well that weight’s neuron can affect the loss in two ways, via its connection to the following to weights
What are the two main computational differences between training and inference?
In both youre conducting forward passes to generate the next token
1) in training you’re calculating the loss function on the predicted token versus the target token 2) You are also performing backpropagation and updating the weights.
What are some additional computational differences between training and inference?
. . .. [see ChatGPT logs from Oct28?]
is it right to say that if a model is 1b parameters that the model has 500m neurons? because each neuron usually has one weight and one bias, right?
Not necessarily. The number of parameters in a model doesn’t directly translate to the number of neurons, especially in deep learning models.
In a fully connected layer (also known as a dense layer), each neuron does have a weight associated with each input it receives and a single bias term. So, the number of parameters in a fully connected layer is (number of inputs) * (number of neurons) + (number of neurons) for the bias terms.
However, many deep learning models, especially those used in natural language processing tasks like GPT-3, use other types of layers as well, such as convolutional layers or transformer layers. These layers have different parameter structures.
For example, in a transformer layer used in GPT-3, there are multiple weight matrices and bias vectors per neuron, associated with the self-attention mechanism and the feed-forward network within each transformer block.
So, a model with 1 billion parameters could have far fewer than 500 million neurons, depending on the architecture of the model.