! S9: Gradient Descent Flashcards
Gradient Descent - Definition
iterative optimization algorithm used to minimize a cost function to find the optimal values of the parameters (weights) in a machine learning model
Gradient Descent - Steps
- Define a cost function (difference btw predicted & actual values)
- Calculate the partial derivatives of cost function with respect to each weight (random initalized first)
- Update weights: if derivative >0 -> decrease weight
- Repeat until reach minimum of the cost function
Gradient Descent - Step Size
- how much update weights in each iteration
- step size = slope * learning rate
- steps gradually get smaller as parameters approach minimum
Gradient Descent - Con
- all features must have similar scale
- high computational cost O(ndt) (10.000 dp * 10 features * 1.000 iterations)
- might only find local minimum (if non-convex)
Gradient
= derivative of a function that has > 1 input variable
Stochastic GD
= solution to reduce computational cost: randomly picks one dp at each iteration to reduce computations -> n = smaller
Batch & Mini Batch GD
- solution to reduce computational cost: randomly picks minibatches at each iteration & calculates GD
- improve performance by reducing variance (<-> SGD)
Stochastic GD - pro
- efficient -> indepentent of n -> O(dt)
- easy implementation
- avoidance of local minima due to noisy update of weights
- each training step is faster (but more variance)
Stochastic GD - Con
- need hyperparamets (regularization parameter, number of iterations)
- sensitive to feature scaling
- Gradient of random example might point in wrong direction -> solution: ok if most gradients points in right direction
- noisy updates & high variance (optimization less stable)
- slow convergence: might nedd more iterations to find minimum
- sensitive to learning rate: to low -> can overshoot minimum
Regularization
- methods to prevent overfitting
- goal: low bias (training erro) & low variance (test error)
Methods to prevent overfitting
- Methods with minimizing loss function + penalty (during training) Regularization
- Early Stopping
- Dropout
- BatchNorm
Regularization - Methods with minimizing loss function + penalty
- techniques to select features by penalizing small weights if feature not necessary for model
- add penalty to loss function to select features
- goal: avoid under- & overfitting
Regularization - Methods with minimizing loss function + penalty - Techniques
- L0 regularization
- L1 regularization (lasso)
- L2 Regularization (Ridge)
- Elastic Net (L1 + L2)
L0 regularization
penalty is one number (lambda if w not 0)
L1 Regularization (Lasso)
- penalty added to loss function: sum of “absolute value of magnitude” of coefficients (penalty on L1-norm of “w”)
- Loss function + lambda sum IwI
L2 Regularization (Ridge)
- penalty added to loss function: sum of “squared magnitude” of coefficients (penalty on L2-norm of “w”)
- Loss function + lambda sum IwI^2
- Lambda = hyperparameter of how strong to penalize
L2 & L1 Regularization - Similarity
- decrease variance: lower test error
L2 & L1 Regularization - Difference
- L1: encourages w to be exactly 0 (if weight = 0 penalizes inclusion of feature in model), solution not unique
- L2: all w tend to be non-zero, unique solution
Early Stopping
= technique to prevent overfitting by monitoring models performance on validation set during training & stopp at lowest validation error (training error can go further down)
Early Stopping - Method
- Algorithm learns in “epochs”: takes whole data multiple times during training, makes predictions, compare to actual labels & updates weights & biases according to minimizing loss
- Monitor validation error as run Stochastic Gradient
- Stop algorithm if validation error reaches minimum / increases again (= overfitting)
Dropout
= technique to prevent overfitting by randomly dropping out (i.e., setting to zero) a proportion of the output features or activations of a layer during training – for deep neural networks
Dropout - Method
- At each training step: each input neuron has probability p (dropout rate) of being dropped
- Evaluate training loss: Overfits -> increase, underfits -> decrease
- After training: use all neurons
- Compensation for dropout: 1 Multiply each input connection weight by (1-p), 2. Divide each neurons output by p
Dropout - Time-Effort-trade-off
well-tuned dropout -> slows convergence but better model
Batch Norm
technique used to mitigate the negative effects of the internal covariate shift by stabilizing distribution of inputs (over a minibatch) to each layer during training