- efficient -> indepentent of n -> O(dt) - easy implementation - avoidance of local minima due to noisy update of weights - each training step is faster (but more variance)

- need hyperparamets (regularization parameter, number of iterations) - sensitive to feature scaling - Gradient of random example might point in wrong direction -> solution: ok if most gradients points in right direction - noisy updates & high variance (optimization less stable) - slow convergence: might nedd more iterations to find minimum - sensitive to learning rate: to low -> can overshoot minimum

- methods to prevent overfitting - goal: low bias (training erro) & low variance (test error)

! S9: Gradient Descent Flashcards by Linda Caro

Gradient Descent - Definition

iterative optimization algorithm used to minimize a cost function to find the optimal values of the parameters (weights) in a machine learning model

How well did you know this?

Not at all

Perfectly

Gradient Descent - Steps

Define a cost function (difference btw predicted & actual values)
Calculate the partial derivatives of cost function with respect to each weight (random initalized first)
Update weights: if derivative >0 -> decrease weight
Repeat until reach minimum of the cost function

How well did you know this?

Not at all

Perfectly

Gradient Descent - Step Size

how much update weights in each iteration
step size = slope * learning rate
steps gradually get smaller as parameters approach minimum

How well did you know this?

Not at all

Perfectly

Gradient Descent - Con

all features must have similar scale
high computational cost O(ndt) (10.000 dp * 10 features * 1.000 iterations)
might only find local minimum (if non-convex)

How well did you know this?

Not at all

Perfectly

Gradient

= derivative of a function that has > 1 input variable

How well did you know this?

Not at all

Perfectly

Stochastic GD

= solution to reduce computational cost: randomly picks one dp at each iteration to reduce computations -> n = smaller

How well did you know this?

Not at all

Perfectly

Batch & Mini Batch GD

solution to reduce computational cost: randomly picks minibatches at each iteration & calculates GD
improve performance by reducing variance (<-> SGD)

How well did you know this?

Not at all

Perfectly

Stochastic GD - pro

efficient -> indepentent of n -> O(dt)
easy implementation
avoidance of local minima due to noisy update of weights
each training step is faster (but more variance)

How well did you know this?

Not at all

Perfectly

Stochastic GD - Con

need hyperparamets (regularization parameter, number of iterations)
sensitive to feature scaling
Gradient of random example might point in wrong direction -> solution: ok if most gradients points in right direction
noisy updates & high variance (optimization less stable)
slow convergence: might nedd more iterations to find minimum
sensitive to learning rate: to low -> can overshoot minimum

How well did you know this?

Not at all

Perfectly

Regularization

methods to prevent overfitting
goal: low bias (training erro) & low variance (test error)

How well did you know this?

Not at all

Perfectly

Methods to prevent overfitting

Methods with minimizing loss function + penalty (during training) Regularization
Early Stopping
Dropout
BatchNorm

How well did you know this?

Not at all

Perfectly

Regularization - Methods with minimizing loss function + penalty

techniques to select features by penalizing small weights if feature not necessary for model
add penalty to loss function to select features
goal: avoid under- & overfitting

How well did you know this?

Not at all

Perfectly

Regularization - Methods with minimizing loss function + penalty - Techniques

L0 regularization
L1 regularization (lasso)
L2 Regularization (Ridge)
Elastic Net (L1 + L2)

How well did you know this?

Not at all

Perfectly

L0 regularization

penalty is one number (lambda if w not 0)

How well did you know this?

Not at all

Perfectly

L1 Regularization (Lasso)

penalty added to loss function: sum of “absolute value of magnitude” of coefficients (penalty on L1-norm of “w”)
Loss function + lambda sum IwI

How well did you know this?

Not at all

Perfectly

L2 Regularization (Ridge)

Study These Flashcards

penalty added to loss function: sum of “squared magnitude” of coefficients (penalty on L2-norm of “w”)
Loss function + lambda sum IwI^2
Lambda = hyperparameter of how strong to penalize

L2 & L1 Regularization - Similarity

Study These Flashcards

decrease variance: lower test error

L2 & L1 Regularization - Difference

Study These Flashcards

L1: encourages w to be exactly 0 (if weight = 0 penalizes inclusion of feature in model), solution not unique
L2: all w tend to be non-zero, unique solution

Early Stopping

Study These Flashcards

= technique to prevent overfitting by monitoring models performance on validation set during training & stopp at lowest validation error (training error can go further down)

Early Stopping - Method

Study These Flashcards

Algorithm learns in “epochs”: takes whole data multiple times during training, makes predictions, compare to actual labels & updates weights & biases according to minimizing loss
Monitor validation error as run Stochastic Gradient
Stop algorithm if validation error reaches minimum / increases again (= overfitting)

Dropout

Study These Flashcards

= technique to prevent overfitting by randomly dropping out (i.e., setting to zero) a proportion of the output features or activations of a layer during training – for deep neural networks

Dropout - Method

Study These Flashcards

At each training step: each input neuron has probability p (dropout rate) of being dropped
Evaluate training loss: Overfits -> increase, underfits -> decrease
After training: use all neurons
Compensation for dropout: 1 Multiply each input connection weight by (1-p), 2. Divide each neurons output by p

Dropout - Time-Effort-trade-off

Study These Flashcards

well-tuned dropout -> slows convergence but better model

Batch Norm

Study These Flashcards

technique used to mitigate the negative effects of the internal covariate shift by stabilizing distribution of inputs (over a minibatch) to each layer during training

Batch Norm - Importance

- In Deep neural networks: each layer receives input -> gives output to next layer - During training: values of inputs to each layer can change - change in distribution of input values = internal covariate shift (= statistical properties of inputs to layer, e.g. mean & variance, differ as network learns) - challenging for subsequent layers to learn effectively because they need to continuously adapt to changing distributions

Batch Norm - Method

- opeates on mini-batch of inputs -> for each mini-batch mean & variance of activations across batch computed - at each hidden layer: - normalize & zero-center input - Scale & shift results - Pass through Activation function - Fed into next layers

Batch Norm - Pro

- Regularization effect (as adds some noise to mini-batch statistics -> prevents overfitting) - Reduced sensitiveity to initalization - no standardization in training set required if batch normalization layer at first layer - better performance - convergence faster

Batch Norm - Con

- Adds runtime penalty to NN (training slower because each epoch takes time for BN) - Tricky to use in RNNs (<-> easier in e.g. CNN)

Gradient Clipping

- often used in RNN - solution to exploiding gradient probelm - sets threshold on gradients: if to large -> gradients scaled down or clipped - stabilize training process & prevents extreme updates to model's parameters

! S9: Gradient Descent Flashcards

(29 cards)