Training Flashcards

1
Q

What is the generality of backpropagation?

A

Applies to all mathematical expressions (well chains of functions), not just neural networks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Difference between micrograd and production grade nn libraries

A

Micrograd operates on scalars and not tensors, but other than that uses all the same math

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In classification problems, how can use use cross entropy loss if a target value and predicted value are say both .8?

A

Well in that case we really shouldn’t use cross entropy loss. Bc -.8*log.8 = -.8 *-.1 = -.08 which is not 0 even though there was zero prediction error!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What do you call it when working with classification problems if your target distributions aren’t one-hot encoded?

A

You’re working with “soft” target distributions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why do we call it “multi class classification?”

A

It is as opposed to binary classification where there are only two as opposed to more multiple options

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the restriction on the domain of values MSE is able to be used on?

A

No restriction. Just be real numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the possible input values for KL divergence?

A

Input values must be greater than zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In seq to seq transformer models what loss function is usually used?

A

Cross entropy loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain how given dL/dWi, a particular modification to that weight Wi is chosen

A

Where

η is the learning rate and “gradient” is the computed derivative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Is ReLu différentiable?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Are the intermediate values of each input, output, weight necessary to compute the gradient for a particular weight? Can you just compute the gradients analytically?

A

Yes and for the second question you’re not thinking fully

an analytical expression of the the derivative of the loss wrt to a weight is expressed in terms of variables (input, output, weights etc) and you need the values of those variables to calculate the gradients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Do we store a matrix of analytical expressions for each gradient during training?

A

No

Rather modern deep learning frameworks like PyTorch and tensor flow calculate each gradient auotmically as needed via backpropagation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a Hessian?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can you have a partial derivative with respect two two variables?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the Jacobian for an particular entry in the Jacobian matrix?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The Jacobian generalizes what?

A

The concept of a gradient for scalar valued functions into it being a concept for vector valued functions

17
Q

In standard backprop do you update a weight as soon as you calculate it’s gradient?

A

No you calculate all the gradients and then perform the updates

18
Q

Is it more proper to say gradient of a neuron or gradient of a weight?

A

Gradient of a weight, specifically “gradient with respect to the weight”

19
Q

How do you calculate a weight’s gradient if it is connected in a forward manner to two neurons?

A

Well that weight’s neuron can affect the loss in two ways, via its connection to the following to weights

20
Q

What are the two main computational differences between training and inference?

A

In both youre conducting forward passes to generate the next token

1) in training you’re calculating the loss function on the predicted token versus the target token 2) You are also performing backpropagation and updating the weights.

21
Q

What are some additional computational differences between training and inference?

A

. . .. [see ChatGPT logs from Oct28?]

22
Q

is it right to say that if a model is 1b parameters that the model has 500m neurons? because each neuron usually has one weight and one bias, right?

A

Not necessarily. The number of parameters in a model doesn’t directly translate to the number of neurons, especially in deep learning models.

In a fully connected layer (also known as a dense layer), each neuron does have a weight associated with each input it receives and a single bias term. So, the number of parameters in a fully connected layer is (number of inputs) * (number of neurons) + (number of neurons) for the bias terms.

However, many deep learning models, especially those used in natural language processing tasks like GPT-3, use other types of layers as well, such as convolutional layers or transformer layers. These layers have different parameter structures.

For example, in a transformer layer used in GPT-3, there are multiple weight matrices and bias vectors per neuron, associated with the self-attention mechanism and the feed-forward network within each transformer block.

So, a model with 1 billion parameters could have far fewer than 500 million neurons, depending on the architecture of the model.

23
Q
A