! S9: Gradient Descent Flashcards
Gradient Descent - Definition
iterative optimization algorithm used to minimize a cost function to find the optimal values of the parameters (weights) in a machine learning model
Gradient Descent - Steps
- Define a cost function (difference btw predicted & actual values)
- Calculate the partial derivatives of cost function with respect to each weight (random initalized first)
- Update weights: if derivative >0 -> decrease weight
- Repeat until reach minimum of the cost function
Gradient Descent - Step Size
- how much update weights in each iteration
- step size = slope * learning rate
- steps gradually get smaller as parameters approach minimum
Gradient Descent - Con
- all features must have similar scale
- high computational cost O(ndt) (10.000 dp * 10 features * 1.000 iterations)
- might only find local minimum (if non-convex)
Gradient
= derivative of a function that has > 1 input variable
Stochastic GD
= solution to reduce computational cost: randomly picks one dp at each iteration to reduce computations -> n = smaller
Batch & Mini Batch GD
- solution to reduce computational cost: randomly picks minibatches at each iteration & calculates GD
- improve performance by reducing variance (<-> SGD)
Stochastic GD - pro
- efficient -> indepentent of n -> O(dt)
- easy implementation
- avoidance of local minima due to noisy update of weights
- each training step is faster (but more variance)
Stochastic GD - Con
- need hyperparamets (regularization parameter, number of iterations)
- sensitive to feature scaling
- Gradient of random example might point in wrong direction -> solution: ok if most gradients points in right direction
- noisy updates & high variance (optimization less stable)
- slow convergence: might nedd more iterations to find minimum
- sensitive to learning rate: to low -> can overshoot minimum
Regularization
- methods to prevent overfitting
- goal: low bias (training erro) & low variance (test error)
Methods to prevent overfitting
- Methods with minimizing loss function + penalty (during training) Regularization
- Early Stopping
- Dropout
- BatchNorm
Regularization - Methods with minimizing loss function + penalty
- techniques to select features by penalizing small weights if feature not necessary for model
- add penalty to loss function to select features
- goal: avoid under- & overfitting
Regularization - Methods with minimizing loss function + penalty - Techniques
- L0 regularization
- L1 regularization (lasso)
- L2 Regularization (Ridge)
- Elastic Net (L1 + L2)
L0 regularization
penalty is one number (lambda if w not 0)
L1 Regularization (Lasso)
- penalty added to loss function: sum of “absolute value of magnitude” of coefficients (penalty on L1-norm of “w”)
- Loss function + lambda sum IwI