! S10 & 14: Backpropagation, Hyperparameters & Learning Rate Flashcards
Backpropagation
- technique to compute NN gradient via chain rule
- Initialize hidden layers connection w randomly
- Forward pass: for each training instance predict & measure error
- Backward pass: go through layers reverse -> measure contribution from each connection
- GD: Adjust connection w (beginning with most influencial one)
Backpropagation - Con
- to do for each single training example -> very slow, computational expensive
- can Reach local minimum
Ways to make optimizer faster
- Using good initalization strategy for connection weights
- Using good activation function
- Using Batch Normalization
- Reusing parts of pretrained network
- Use faster optimizer, e.g. Momentum optimization or Nesterov Accelerated Gradient
Momentum Optimization
- technique used in gradient-based optimization algorithms
- introduces a momentum term that accumulates the gradients of previous iterations
- navigates better through areas with high curvature (Krümmung) and escape local minima
- exponentially smoothed averages: Calculates average of previous gradients (more fare gradients get less important)
Nesterov Accelearted Gradient (NAG)
- mesures gradient of cost function not at local position ø but slightly ahead in direction of momentum at ø+ßm (applies gradients computet after momentum step <-> normal: before)
- even faster than momentum optimization
Optimization of parameters vs hyperparameters
- parameters weights: optimiezd with backpropagation
- Hyperparameters: manually or tuning phase (validation dataset, testing different ones -> no overfit of training data)
Learning Rate
- hyperparameter that determines step size / magnitude of parameter updates during the training process
- higher -> larger updates & potentially faster convergence
- lower -> smaller updates & more cautious learning
Learning Rate - Method
- Run DG for whole with fiexed step-size
- Measure error & plot progress
- If error not decreasing -> decrease step size
Bias step-size multiplier
bigger step-size for bias variables
Momentum
- technique where term is a dded to the parameter updates that accounts for previous direction of movement
- e.g momentum value of 0.9: 90% of the previous direction is retained, and only 10% is influenced by the current gradient
Learning Rate - add momentum
Add term that moves in previous direction (ß = 0.9)
Finding good learning rate
- train model for few hundred iterations
- exponentially incrase learning rate from small to large value
- look at learning curve (Loss = y, epoch = x)
- pick learning rate that reduces the longest
Learning schedule - set leraning rate to…
- strategies to reduce learning rate during training
- Power Scheduling: … function of iteration number
- Exponential Scheduling: … gradually drop by factor of 10 every s steps
- Piecewise constant Scheduling: … 1 number for some epochs, then smaller one for next
- Performance Scheduling: … measure validation error every N steps & reduce learning rate by factor of lampda when error stops dropping
Learning rate - Challenges
- Function evaluations can be very expensive for large models, ML pipelines or datasets
- Configuration space = often complex (mix of continous, categorical & conditional hp) & high dimensional (not always clear which hp to optimize)
- No access to gradient of hp loss function (other properties of function used in classical optimization do not apply, e.g. convexity, smoothness)
- Can’t directly optimize for generalization performance (because training data = limited size)
HP Optimization Techniques
- Babysitting (Grad student descent)
- Grid Search
-> model-Free Blackbox Optimization Methods - Random Search
- Gradient-based optimization
- Bayesian optimization (BO)
- Multi-fidelity optimization algorihtm
- Metaheristic algorithm
HP Optimization Techniques - 1. Babysitting
- Babysitting = “Trial & Error” = Grad student descent (GSD)
- implemented by 100% manual tuning
- Problem: large number of hp, complex models, time-consuming model evaluation, non-linear hp interactions
HP Optimization Techniques - 2. Grid Search
- exhaustive search (= brute-force method): evaluate all combinations of the >= 2 hp values -> forming a grid of values
- problem: inefficient for many hp (because nr of evaluations increase exponentially to nr of hp growth) + computational expensive
HP Optimization Techniques - 3. Random Search
- random combinations hp values calculated
- Poblem: can miss important values in the search space
HP Optimization Techniques - 5. Bayesian Optimization
- determines future evaluation points based on previously-obtained results
- build a probability model of the objective function and use it to select the most promising hyperparameters to evaluate in the true objective function
HP Optimization Techniques - 6. Multi-fidelity optimization algorithm
- combining high-& low-fidelity evaluations in iterations:
- low: hp evaluated based on small datasubset (low cost, poor generalization) -> keep promising values
- high: hp region from 1. evaluated based on larger datasubset -> fine tune hp
HP Optimization Techniques - 7. Metaheuristic algorithm
- set of algorithms inspired by biology & widely used)
- employs heuristics and iterative search procedures to explore search space