04 - Backpropagation Flashcards

1
Q

Using Maximum Likelihood to get loss

A

Model f(x) to predict the output y given x (p(y|x)):
- Linear Regression: f(x)=w^Tx
- Logistic Regression: f(x)=1/1+e^{-w^Tx}
- if we know the distribution, we can use ML to optimize for w_{ML}=arg max L(w)=arg max sum_i=1-m log p_model (y_i|x_i) → Log Likelihood
- if we take the negative of it, we get the cross entropy: -sum_i=1-m log p_model(y_i|x_i)
- the CE quantifies the stratistical divergence between the outputs of the model and the examples in the training set
- maximizing LL is the same as minimizing CE
- in machine learning the CE is the loss or cost function, denoted L
- often the average over presented samples is taken, loss becomes:
L(y_hat,y)=-1/m sum_i=1-m log p_model(y_i|x_i)
- MSE loss: L(y_hat,y)=1/m sum|| y-y_hat||^2
- Log Loss: L(y_hat,y)=1/m sum y log y_hat

Now that we have the loss, we can derive the gradient with respect to out prediction, but what we need if the gradient with respect to the parameters (weights and biases)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

MSE loss vs Log Loss (backprop)

A

MSE loss:
- Used for regression problems with linear outputs (prediction of contunuous variables).
L(y_hat,y)=1/2|| y-y_hat||^2
Log Loss:
- Used for classification problems, with outputs being probability distributions over multiple classes) - softmax
L(y_hat,y)= -y^T log y_hat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Draw the computational graph (backprop)

A

x-> * W1-> + b1 —(a)->f1–(h)–>* W2-> + b2 —(a2)->f2–(y_hat)–> lossfunction with y -> L

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Backpropagation

A

We know the loss, y and y_hat, and propagate back through the computational graph to calculate the weights and biases. We use the chain rule through the whole thing.

  1. Loss: dL/dy_hat = nabla_yhat L(y_hat,y)
  2. Delta Rule: dL/da2 = y_hat-y
  3. hidden: dL/dh = dL/da2 da2/dh
  4. dL/da1 = dL/dh dh/da1

Now we just need to extract the weights and biases:
Bias: dL/db = dL/da1 da1/db1 = dL/da1, bc da1/db1 is 1
Weights: dL/dW1 = dL/da1 da1/dW1 = dL/da1 x^T
(for weight 2 the same with h)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The Delta Rule (Backpropagation)

A

The delta rule describes how the weights of a neural network should be adjusted to minimize the error between the predicted output and the target output during the training process.

So before the softmax activation, the derivative is just y_hat-y, both for linear outputs & MSE loss and softmax outputs with log loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly