Optimization for training deep models 10 11 Flashcards
Optimization for neural networks, emprical risk minimization, surrogate loss?
Optimizing the neural networks firstly srtarted with optimizing the performance measure, therefore firstly the empirical risk minimization is tried, empirical risk minimization which is composed of optimizing the loss on training set. But it is not usefull since the model is too complex and optimizing the training error wont help in overfitting and may be cause to overfitting.
Surrogate loss is: optimizing the loss functions sometimes too hard, therefore we use a loss function as a proxy of our original loss function, and we try to optimize this proxy loss function. For example using negative log likelihood instead of 0-1 loss.
But generally the training halts before the derivatives of the surrogate loss is too high, which makes the optimization almost impossible.
What are batch/minibatch algorithms? What is online learning? What is stochastic gradient desend?
Most optimization algorithms converge faster(in time) with an approximation to the exact gradient.
Bacth,Full GD: Compute the gradient of loss function on the WHOLE training data, update the weights
Pros: find the exact gradient
Cons: İf we have 1 million training data, after 1 epoch, we would process all 1 million data, but we will just update 1 weight
Stochastic gradient descend: (online learning: one example at a time) Computing the gradient of loss on A SINGLE EXAMPLE, update the weights
Pros: If we have 1 million example, in one epoch, we would update 1 million weights
Cons: Not finding the exact gradient
Mini batch: Compute the gradient of loss, on the subset examples taken from the training set, update the weights.
More accurate computation
What is the difference between backpropagation and stochastic gradient descend?
Forward propagation: taking the input and feeding it into the network, the output and the loss is obtained
Back propagation: information from the cost flows backwards, and the derivative is computed with respect to the different parameters.
Back propagation is not doing any kind of learning, if we want to make a learning then we must use stochastic gradient descend. Back propagation only helps to calculate the graidents and update the weights
Stochastic gradient descend is used for learning. It is an online learning algorithm where one example at a time is processed.
It calculates the gradient of the loss on a SINGLE example, and updates the weights.
What are the challenges of the optimization in Neural Networks? Explain them
1)Ill conditioning heissian:
Heissian it the matrix of all the second order derivatives, it uses the conditioning number where it computer the heissian matrix and then it takes the ratio of the largest eigen vlaue and the lowest eigen value. The ill conditioning of the heissian is, having the variance of the matrix of second order derivatives is too high.
Generally in NN we dont use second order deivatives, SGD may get stuck in because there is no information about the curvature.
2) Local minima
In convex optimization, it is clear that if we have a local minima it is equal to have a global minima.
But since in NNs we use non linear activation functions, we have non convexity.
For many years it was thoguht to be a problem of NNs but in many times the network wont even in a minima, and reaching to a low cost area is more than enough. To check if our network is in a critical point, we use gradient norms.
3)flat regions
SGD often easily escapes from the flat regions, for the heissian matrix, it produces only posiitve values when it is in a local minimum, when it is in a flat region, it produces both negative and positive values. Eigenvalues of the heissian matrix often gets positive values in the low cost areas. Eventhough SGD manages to escape from local minimas, second order methods find points with zero gradients and it is more likely for them to jump into a flat region this is the reason why in NNs first order derivatives are prferred.
4)Cliffs and exploiding gradients
Often cliffs happen in Neural networks, it makes the gradients big and the weights could jump far away from each other, therefore we use gradient clipping to reduce the step size.
5)Expoiding Vanishing gradients
Assume we have W = V diag(eigenvalues) V^-1
We keep multiplying this vector, after t steps,
W^t = V diag(eigenvalues)^t V^-1
if our eigenvalues are too small <1, then we will have vanishing gradients
if >1 exponentially big, exploding
if the weights and the gradients are too big: exploiding(cliff)
if the weights and graidnet to small: vanishing
what is parameter initialization and breaking symetry?
İn neural networks, initialization is too important, some bad intial points could lead the algorthm to fail. The one thing for sure is to use break symmetry.
Break symmetry is: If we have a network, the hidden units connected to the same input unit must have diferent parameters. Otherwise the netwrork will constantly update the same parameters and have the same results for those points. Therefore initializing those parameters differently will help the network not to lose information.