! S10 & 14: Backpropagation, Hyperparameters & Learning Rate Flashcards
1
Q
Backpropagation
A
- technique to compute NN gradient via chain rule
- Initialize hidden layers connection w randomly
- Forward pass: for each training instance predict & measure error
- Backward pass: go through layers reverse -> measure contribution from each connection
- GD: Adjust connection w (beginning with most influencial one)
2
Q
Backpropagation - Con
A
- to do for each single training example -> very slow, computational expensive
- can Reach local minimum
3
Q
Ways to make optimizer faster
A
- Using good initalization strategy for connection weights
- Using good activation function
- Using Batch Normalization
- Reusing parts of pretrained network
- Use faster optimizer, e.g. Momentum optimization or Nesterov Accelerated Gradient
4
Q
Momentum Optimization
A
- technique used in gradient-based optimization algorithms
- introduces a momentum term that accumulates the gradients of previous iterations
- navigates better through areas with high curvature (Krümmung) and escape local minima
- exponentially smoothed averages: Calculates average of previous gradients (more fare gradients get less important)
5
Q
Nesterov Accelearted Gradient (NAG)
A
- mesures gradient of cost function not at local position ø but slightly ahead in direction of momentum at ø+ßm (applies gradients computet after momentum step <-> normal: before)
- even faster than momentum optimization
6
Q
Optimization of parameters vs hyperparameters
A
- parameters weights: optimiezd with backpropagation
- Hyperparameters: manually or tuning phase (validation dataset, testing different ones -> no overfit of training data)
7
Q
Learning Rate
A
- hyperparameter that determines step size / magnitude of parameter updates during the training process
- higher -> larger updates & potentially faster convergence
- lower -> smaller updates & more cautious learning
8
Q
Learning Rate - Method
A
- Run DG for whole with fiexed step-size
- Measure error & plot progress
- If error not decreasing -> decrease step size
9
Q
Bias step-size multiplier
A
bigger step-size for bias variables
10
Q
Momentum
A
- technique where term is a dded to the parameter updates that accounts for previous direction of movement
- e.g momentum value of 0.9: 90% of the previous direction is retained, and only 10% is influenced by the current gradient
11
Q
Learning Rate - add momentum
A
Add term that moves in previous direction (ß = 0.9)
12
Q
Finding good learning rate
A
- train model for few hundred iterations
- exponentially incrase learning rate from small to large value
- look at learning curve (Loss = y, epoch = x)
- pick learning rate that reduces the longest
13
Q
Learning schedule - set leraning rate to…
A
- strategies to reduce learning rate during training
- Power Scheduling: … function of iteration number
- Exponential Scheduling: … gradually drop by factor of 10 every s steps
- Piecewise constant Scheduling: … 1 number for some epochs, then smaller one for next
- Performance Scheduling: … measure validation error every N steps & reduce learning rate by factor of lampda when error stops dropping
14
Q
Learning rate - Challenges
A
- Function evaluations can be very expensive for large models, ML pipelines or datasets
- Configuration space = often complex (mix of continous, categorical & conditional hp) & high dimensional (not always clear which hp to optimize)
- No access to gradient of hp loss function (other properties of function used in classical optimization do not apply, e.g. convexity, smoothness)
- Can’t directly optimize for generalization performance (because training data = limited size)
15
Q
HP Optimization Techniques
A
- Babysitting (Grad student descent)
- Grid Search
-> model-Free Blackbox Optimization Methods - Random Search
- Gradient-based optimization
- Bayesian optimization (BO)
- Multi-fidelity optimization algorihtm
- Metaheristic algorithm