Math/Statistics | Priority Flashcards
Derive the (binary) cross-entropy loss function (in log form). 1. p(y|x) 2. log p(y|x) 3. -log p(y|x) 4. plug in definition of yhat.
(See source material.) Eqs. (5.20) - (5.23)
Jurafsky SLP3E Chapter 5 Logistic regression 5.5 The cross-entropy loss function
Equation for optimal weights using cross-entropy loss for a dataset.
(See source material.) Eq. (5.24)
Jurafsky SLP3E Chapter 5 Logistic regression 5.5 The cross-entropy loss function
Explain convex vs. non-convex functions informally.
A convex function has at most one minimum; there are no local minima to get stuck in, so gradient descent starting from any point is guaranteed to find the minimum. (By contrast, the loss for multi-layer neural networks is non-convex, and gradient descent may get stuck in local minima for neural network training and never find the global optimum.)
Jurafsky SLP3E Chapter 5 Logistic regression 5.6 Gradient descent
Informal intuition of gradient descent.
“How shall we find the minimum of this (or any) loss function? Gradient descent is a
method that finds a minimum of a function by figuring out in which direction (in the
space of the parameters q) the function’s slope is rising the most steeply, and moving
in the opposite direction. The intuition is that if you are hiking in a canyon and trying
to descend most quickly down to the river at the bottom, you might look around
yourself 360 degrees, find the direction where the ground is sloping the steepest,
and walk downhill in that direction.”
Jurafsky SLP3E Chapter 5 Logistic regression 5.6 Gradient descent
Basic equation for updating the optimial parameters theta based on the gradient including momentum.
(See source material.) Eq. (5.27)
Jurafsky SLP3E Chapter 5 Logistic regression 5.6 Gradient descent
Equation for derivative of binary cross-entropy loss (for logistic regression).
(See source material.) Eq. (5.29)
Jurafsky SLP3E Chapter 5 Logistic regression 5.6 Gradient descent
Derive the cross-entropy loss (with sigmoid) (i.e. 2 terms of backpropagation with chain rule).
(See source material.) Eqs. (7.41) - (7.43)
Jurafsky SLP3E Chapter 5 Logistic regression 7.6 Training neural nets
Backpropagation in a 2-layer MLP: What are the (3) terms in an equation for the partial derivative of the loss w.r.t. a weight w^out_1,1?
Raschka MLWPT Chapter 11 Implementing a Multilayer Artificial Neural Network from Scratch
Backpropagation in a 2-layer MLP: What are the (3) terms in an equation for the partial derivative of the loss w.r.t. a weight w^h_1,1 (h=hidden)?
(See source material p366 “As before, we can expand it to include the net inputs z and then solve the individual terms:”)
Raschka MLWPT Chapter 11 Implementing a Multilayer Artificial Neural Network from Scratch