! S10 & 14: Backpropagation, Hyperparameters & Learning Rate Flashcards

1
Q

Backpropagation

A
  • technique to compute NN gradient via chain rule
  • Initialize hidden layers connection w randomly
  • Forward pass: for each training instance predict & measure error
  • Backward pass: go through layers reverse -> measure contribution from each connection
  • GD: Adjust connection w (beginning with most influencial one)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Backpropagation - Con

A
  • to do for each single training example -> very slow, computational expensive
  • can Reach local minimum
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Ways to make optimizer faster

A
  • Using good initalization strategy for connection weights
  • Using good activation function
  • Using Batch Normalization
  • Reusing parts of pretrained network
  • Use faster optimizer, e.g. Momentum optimization or Nesterov Accelerated Gradient
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Momentum Optimization

A
  • technique used in gradient-based optimization algorithms
  • introduces a momentum term that accumulates the gradients of previous iterations
  • navigates better through areas with high curvature (Krümmung) and escape local minima
  • exponentially smoothed averages: Calculates average of previous gradients (more fare gradients get less important)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Nesterov Accelearted Gradient (NAG)

A
  • mesures gradient of cost function not at local position ø but slightly ahead in direction of momentum at ø+ßm (applies gradients computet after momentum step <-> normal: before)
  • even faster than momentum optimization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Optimization of parameters vs hyperparameters

A
  • parameters weights: optimiezd with backpropagation
  • Hyperparameters: manually or tuning phase (validation dataset, testing different ones -> no overfit of training data)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Learning Rate

A
  • hyperparameter that determines step size / magnitude of parameter updates during the training process
  • higher -> larger updates & potentially faster convergence
  • lower -> smaller updates & more cautious learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Learning Rate - Method

A
  • Run DG for whole with fiexed step-size
  • Measure error & plot progress
  • If error not decreasing -> decrease step size
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Bias step-size multiplier

A

bigger step-size for bias variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Momentum

A
  • technique where term is a dded to the parameter updates that accounts for previous direction of movement
  • e.g momentum value of 0.9: 90% of the previous direction is retained, and only 10% is influenced by the current gradient
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Learning Rate - add momentum

A

Add term that moves in previous direction (ß = 0.9)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Finding good learning rate

A
  • train model for few hundred iterations
  • exponentially incrase learning rate from small to large value
  • look at learning curve (Loss = y, epoch = x)
  • pick learning rate that reduces the longest
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Learning schedule - set leraning rate to…

A
  • strategies to reduce learning rate during training
  • Power Scheduling: … function of iteration number
  • Exponential Scheduling: … gradually drop by factor of 10 every s steps
  • Piecewise constant Scheduling: … 1 number for some epochs, then smaller one for next
  • Performance Scheduling: … measure validation error every N steps & reduce learning rate by factor of lampda when error stops dropping
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Learning rate - Challenges

A
  • Function evaluations can be very expensive for large models, ML pipelines or datasets
  • Configuration space = often complex (mix of continous, categorical & conditional hp) & high dimensional (not always clear which hp to optimize)
  • No access to gradient of hp loss function (other properties of function used in classical optimization do not apply, e.g. convexity, smoothness)
  • Can’t directly optimize for generalization performance (because training data = limited size)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

HP Optimization Techniques

A
  1. Babysitting (Grad student descent)
  2. Grid Search
    -> model-Free Blackbox Optimization Methods
  3. Random Search
  4. Gradient-based optimization
  5. Bayesian optimization (BO)
  6. Multi-fidelity optimization algorihtm
  7. Metaheristic algorithm
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

HP Optimization Techniques - 1. Babysitting

A
  • Babysitting = “Trial & Error” = Grad student descent (GSD)
  • implemented by 100% manual tuning
  • Problem: large number of hp, complex models, time-consuming model evaluation, non-linear hp interactions
17
Q

HP Optimization Techniques - 2. Grid Search

A
  • exhaustive search (= brute-force method): evaluate all combinations of the >= 2 hp values -> forming a grid of values
  • problem: inefficient for many hp (because nr of evaluations increase exponentially to nr of hp growth) + computational expensive
18
Q

HP Optimization Techniques - 3. Random Search

A
  • random combinations hp values calculated
  • Poblem: can miss important values in the search space
19
Q

HP Optimization Techniques - 5. Bayesian Optimization

A
  • determines future evaluation points based on previously-obtained results
  • build a probability model of the objective function and use it to select the most promising hyperparameters to evaluate in the true objective function
20
Q

HP Optimization Techniques - 6. Multi-fidelity optimization algorithm

A
  • combining high-& low-fidelity evaluations in iterations:
    1. low: hp evaluated based on small datasubset (low cost, poor generalization) -> keep promising values
    1. high: hp region from 1. evaluated based on larger datasubset -> fine tune hp
21
Q

HP Optimization Techniques - 7. Metaheuristic algorithm

A
  • set of algorithms inspired by biology & widely used)
  • employs heuristics and iterative search procedures to explore search space