! S9: Gradient Descent Flashcards

1
Q

Gradient Descent - Definition

A

iterative optimization algorithm used to minimize a cost function to find the optimal values of the parameters (weights) in a machine learning model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Gradient Descent - Steps

A
  1. Define a cost function (difference btw predicted & actual values)
  2. Calculate the partial derivatives of cost function with respect to each weight (random initalized first)
  3. Update weights: if derivative >0 -> decrease weight
  4. Repeat until reach minimum of the cost function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Gradient Descent - Step Size

A
  • how much update weights in each iteration
  • step size = slope * learning rate
  • steps gradually get smaller as parameters approach minimum
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Gradient Descent - Con

A
  • all features must have similar scale
  • high computational cost O(ndt) (10.000 dp * 10 features * 1.000 iterations)
  • might only find local minimum (if non-convex)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Gradient

A

= derivative of a function that has > 1 input variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Stochastic GD

A

= solution to reduce computational cost: randomly picks one dp at each iteration to reduce computations -> n = smaller

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Batch & Mini Batch GD

A
  • solution to reduce computational cost: randomly picks minibatches at each iteration & calculates GD
  • improve performance by reducing variance (<-> SGD)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Stochastic GD - pro

A
  • efficient -> indepentent of n -> O(dt)
  • easy implementation
  • avoidance of local minima due to noisy update of weights
  • each training step is faster (but more variance)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Stochastic GD - Con

A
  • need hyperparamets (regularization parameter, number of iterations)
  • sensitive to feature scaling
  • Gradient of random example might point in wrong direction -> solution: ok if most gradients points in right direction
  • noisy updates & high variance (optimization less stable)
  • slow convergence: might nedd more iterations to find minimum
  • sensitive to learning rate: to low -> can overshoot minimum
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Regularization

A
  • methods to prevent overfitting
  • goal: low bias (training erro) & low variance (test error)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Methods to prevent overfitting

A
  1. Methods with minimizing loss function + penalty (during training) Regularization
  2. Early Stopping
  3. Dropout
  4. BatchNorm
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Regularization - Methods with minimizing loss function + penalty

A
  • techniques to select features by penalizing small weights if feature not necessary for model
  • add penalty to loss function to select features
  • goal: avoid under- & overfitting
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Regularization - Methods with minimizing loss function + penalty - Techniques

A
  1. L0 regularization
  2. L1 regularization (lasso)
  3. L2 Regularization (Ridge)
  4. Elastic Net (L1 + L2)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

L0 regularization

A

penalty is one number (lambda if w not 0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

L1 Regularization (Lasso)

A
  • penalty added to loss function: sum of “absolute value of magnitude” of coefficients (penalty on L1-norm of “w”)
  • Loss function + lambda sum IwI
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

L2 Regularization (Ridge)

A
  • penalty added to loss function: sum of “squared magnitude” of coefficients (penalty on L2-norm of “w”)
  • Loss function + lambda sum IwI^2
  • Lambda = hyperparameter of how strong to penalize
17
Q

L2 & L1 Regularization - Similarity

A
  • decrease variance: lower test error
18
Q

L2 & L1 Regularization - Difference

A
  • L1: encourages w to be exactly 0 (if weight = 0 penalizes inclusion of feature in model), solution not unique
  • L2: all w tend to be non-zero, unique solution
19
Q

Early Stopping

A

= technique to prevent overfitting by monitoring models performance on validation set during training & stopp at lowest validation error (training error can go further down)

20
Q

Early Stopping - Method

A
  • Algorithm learns in “epochs”: takes whole data multiple times during training, makes predictions, compare to actual labels & updates weights & biases according to minimizing loss
  • Monitor validation error as run Stochastic Gradient
  • Stop algorithm if validation error reaches minimum / increases again (= overfitting)
21
Q

Dropout

A

= technique to prevent overfitting by randomly dropping out (i.e., setting to zero) a proportion of the output features or activations of a layer during training – for deep neural networks

22
Q

Dropout - Method

A
  • At each training step: each input neuron has probability p (dropout rate) of being dropped
  • Evaluate training loss: Overfits -> increase, underfits -> decrease
  • After training: use all neurons
  • Compensation for dropout: 1 Multiply each input connection weight by (1-p), 2. Divide each neurons output by p
23
Q

Dropout - Time-Effort-trade-off

A

well-tuned dropout -> slows convergence but better model

24
Q

Batch Norm

A

technique used to mitigate the negative effects of the internal covariate shift by stabilizing distribution of inputs (over a minibatch) to each layer during training

25
Q

Batch Norm - Importance

A
  • In Deep neural networks: each layer receives input -> gives output to next layer
  • During training: values of inputs to each layer can change
  • change in distribution of input values = internal covariate shift (= statistical properties of inputs to layer, e.g. mean & variance, differ as network learns)
  • challenging for subsequent layers to learn effectively because they need to continuously adapt to changing distributions
26
Q

Batch Norm - Method

A
  • opeates on mini-batch of inputs -> for each mini-batch mean & variance of activations across batch computed
  • at each hidden layer:
  • normalize & zero-center input
  • Scale & shift results
  • Pass through Activation function
  • Fed into next layers
27
Q

Batch Norm - Pro

A
  • Regularization effect (as adds some noise to mini-batch statistics -> prevents overfitting)
  • Reduced sensitiveity to initalization
  • no standardization in training set required if batch normalization layer at first layer
  • better performance
  • convergence faster
28
Q

Batch Norm - Con

A
  • Adds runtime penalty to NN (training slower because each epoch takes time for BN)
  • Tricky to use in RNNs (<-> easier in e.g. CNN)
29
Q

Gradient Clipping

A
  • often used in RNN
  • solution to exploiding gradient probelm
  • sets threshold on gradients: if to large -> gradients scaled down or clipped
  • stabilize training process & prevents extreme updates to model’s parameters