SGD Flashcards
1
Q
What optimizer for non linear function of θ?
A
Gradient based optimizer
2
Q
How to compute the gradient of a NN?
A
compute the partial derivative of the loss with respect to all parameters θk (i.e., the weights and the biases of all layers):
3
Q
What does Splitting the training set into B minibatches do?
A
- reduces the computation cost of one gradient by a factor of B
- increases the standard deviation on the gradient estimate by a factor of √B only.
More iterations, but fewer epochs (hence smaller total
computation cost).
4
Q
A