Training Flashcards

Question 1

Q

What is the generality of backpropagation?

Answer

A

Applies to all mathematical expressions (well chains of functions), not just neural networks

Question 2

Q

Difference between micrograd and production grade nn libraries

Answer

A

Micrograd operates on scalars and not tensors, but other than that uses all the same math

Question 3

Q

In classification problems, how can use use cross entropy loss if a target value and predicted value are say both .8?

Answer

A

Well in that case we really shouldn’t use cross entropy loss. Bc -.8*log.8 = -.8 *-.1 = -.08 which is not 0 even though there was zero prediction error!

Question 4

Q

What do you call it when working with classification problems if your target distributions aren’t one-hot encoded?

Answer

A

You’re working with “soft” target distributions

Question 5

Q

Why do we call it “multi class classification?”

Answer

A

It is as opposed to binary classification where there are only two as opposed to more multiple options

Question 6

Q

What is the restriction on the domain of values MSE is able to be used on?

Answer

A

No restriction. Just be real numbers

Question 7

Q

What are the possible input values for KL divergence?

Answer

A

Input values must be greater than zero

Question 8

Q

In seq to seq transformer models what loss function is usually used?

Answer

A

Cross entropy loss

Question 9

Q

Explain how given dL/dWi, a particular modification to that weight Wi is chosen

Answer

A

Where

η is the learning rate and “gradient” is the computed derivative

Question 10

Q

Is ReLu différentiable?

Question 11

Q

Are the intermediate values of each input, output, weight necessary to compute the gradient for a particular weight? Can you just compute the gradients analytically?

Answer

A

Yes and for the second question you’re not thinking fully

an analytical expression of the the derivative of the loss wrt to a weight is expressed in terms of variables (input, output, weights etc) and you need the values of those variables to calculate the gradients

Question 12

Q

Do we store a matrix of analytical expressions for each gradient during training?

Answer

A

No

Rather modern deep learning frameworks like PyTorch and tensor flow calculate each gradient auotmically as needed via backpropagation

Question 13

Q

What is a Hessian?

Question 14

Q

How can you have a partial derivative with respect two two variables?

Question 15

Q

What is the Jacobian for an particular entry in the Jacobian matrix?

Question 16

Q

The Jacobian generalizes what?

Answer

Study These Flashcards

A

The concept of a gradient for scalar valued functions into it being a concept for vector valued functions

Question 17

Q

In standard backprop do you update a weight as soon as you calculate it’s gradient?

Answer

Study These Flashcards

A

No you calculate all the gradients and then perform the updates

Question 18

Q

Is it more proper to say gradient of a neuron or gradient of a weight?

Answer

Study These Flashcards

A

Gradient of a weight, specifically “gradient with respect to the weight”

Question 19

Q

How do you calculate a weight’s gradient if it is connected in a forward manner to two neurons?

Answer

Study These Flashcards

A

Well that weight’s neuron can affect the loss in two ways, via its connection to the following to weights

Question 20

Q

What are the two main computational differences between training and inference?

Answer

Study These Flashcards

A

In both youre conducting forward passes to generate the next token

1) in training you’re calculating the loss function on the predicted token versus the target token 2) You are also performing backpropagation and updating the weights.

Question 21

Q

What are some additional computational differences between training and inference?

Answer

Study These Flashcards

A

. . .. [see ChatGPT logs from Oct28?]

Question 22

Q

is it right to say that if a model is 1b parameters that the model has 500m neurons? because each neuron usually has one weight and one bias, right?

Answer

Study These Flashcards

A

Not necessarily. The number of parameters in a model doesn’t directly translate to the number of neurons, especially in deep learning models.

In a fully connected layer (also known as a dense layer), each neuron does have a weight associated with each input it receives and a single bias term. So, the number of parameters in a fully connected layer is (number of inputs) * (number of neurons) + (number of neurons) for the bias terms.

However, many deep learning models, especially those used in natural language processing tasks like GPT-3, use other types of layers as well, such as convolutional layers or transformer layers. These layers have different parameter structures.

For example, in a transformer layer used in GPT-3, there are multiple weight matrices and bias vectors per neuron, associated with the self-attention mechanism and the feed-forward network within each transformer block.

So, a model with 1 billion parameters could have far fewer than 500 million neurons, depending on the architecture of the model.

Question 23

Q

Answer

Study These Flashcards

A

Training Flashcards

(23 cards)