! S10 & 14: Backpropagation, Hyperparameters & Learning Rate Flashcards

Question 1

Q

Backpropagation

Answer

A

technique to compute NN gradient via chain rule
Initialize hidden layers connection w randomly
Forward pass: for each training instance predict & measure error
Backward pass: go through layers reverse -> measure contribution from each connection
GD: Adjust connection w (beginning with most influencial one)

Question 2

Q

Backpropagation - Con

Answer

A

to do for each single training example -> very slow, computational expensive
can Reach local minimum

Question 3

Q

Ways to make optimizer faster

Answer

A

Using good initalization strategy for connection weights
Using good activation function
Using Batch Normalization
Reusing parts of pretrained network
Use faster optimizer, e.g. Momentum optimization or Nesterov Accelerated Gradient

Question 4

Q

Momentum Optimization

Answer

A

technique used in gradient-based optimization algorithms
introduces a momentum term that accumulates the gradients of previous iterations
navigates better through areas with high curvature (Krümmung) and escape local minima
exponentially smoothed averages: Calculates average of previous gradients (more fare gradients get less important)

Question 5

Q

Nesterov Accelearted Gradient (NAG)

Answer

A

mesures gradient of cost function not at local position ø but slightly ahead in direction of momentum at ø+ßm (applies gradients computet after momentum step <-> normal: before)
even faster than momentum optimization

Question 6

Q

Optimization of parameters vs hyperparameters

Answer

A

parameters weights: optimiezd with backpropagation
Hyperparameters: manually or tuning phase (validation dataset, testing different ones -> no overfit of training data)

Question 7

Q

Learning Rate

Answer

A

hyperparameter that determines step size / magnitude of parameter updates during the training process
higher -> larger updates & potentially faster convergence
lower -> smaller updates & more cautious learning

Question 8

Q

Learning Rate - Method

Answer

A

Run DG for whole with fiexed step-size
Measure error & plot progress
If error not decreasing -> decrease step size

Question 9

Q

Bias step-size multiplier

Answer

A

bigger step-size for bias variables

Question 10

Q

Momentum

Answer

A

technique where term is a dded to the parameter updates that accounts for previous direction of movement
e.g momentum value of 0.9: 90% of the previous direction is retained, and only 10% is influenced by the current gradient

Question 11

Q

Learning Rate - add momentum

Answer

A

Add term that moves in previous direction (ß = 0.9)

Question 12

Q

Finding good learning rate

Answer

A

train model for few hundred iterations
exponentially incrase learning rate from small to large value
look at learning curve (Loss = y, epoch = x)
pick learning rate that reduces the longest

Question 13

Q

Learning schedule - set leraning rate to…

Answer

A

strategies to reduce learning rate during training
Power Scheduling: … function of iteration number
Exponential Scheduling: … gradually drop by factor of 10 every s steps
Piecewise constant Scheduling: … 1 number for some epochs, then smaller one for next
Performance Scheduling: … measure validation error every N steps & reduce learning rate by factor of lampda when error stops dropping

Question 14

Q

Learning rate - Challenges

Answer

A

Function evaluations can be very expensive for large models, ML pipelines or datasets
Configuration space = often complex (mix of continous, categorical & conditional hp) & high dimensional (not always clear which hp to optimize)
No access to gradient of hp loss function (other properties of function used in classical optimization do not apply, e.g. convexity, smoothness)
Can’t directly optimize for generalization performance (because training data = limited size)

Question 15

Q

HP Optimization Techniques

Answer

A

Babysitting (Grad student descent)
Grid Search
-> model-Free Blackbox Optimization Methods
Random Search
Gradient-based optimization
Bayesian optimization (BO)
Multi-fidelity optimization algorihtm
Metaheristic algorithm

Question 16

Q

HP Optimization Techniques - 1. Babysitting

Answer

Study These Flashcards

A

Babysitting = “Trial & Error” = Grad student descent (GSD)
implemented by 100% manual tuning
Problem: large number of hp, complex models, time-consuming model evaluation, non-linear hp interactions

Question 17

Q

HP Optimization Techniques - 2. Grid Search

Answer

Study These Flashcards

A

exhaustive search (= brute-force method): evaluate all combinations of the >= 2 hp values -> forming a grid of values
problem: inefficient for many hp (because nr of evaluations increase exponentially to nr of hp growth) + computational expensive

Question 18

Q

HP Optimization Techniques - 3. Random Search

Answer

Study These Flashcards

A

random combinations hp values calculated
Poblem: can miss important values in the search space

Question 19

Q

HP Optimization Techniques - 5. Bayesian Optimization

Answer

Study These Flashcards

A

determines future evaluation points based on previously-obtained results
build a probability model of the objective function and use it to select the most promising hyperparameters to evaluate in the true objective function

Question 20

Q

HP Optimization Techniques - 6. Multi-fidelity optimization algorithm

Answer

Study These Flashcards

A

combining high-& low-fidelity evaluations in iterations:
1. low: hp evaluated based on small datasubset (low cost, poor generalization) -> keep promising values
1. high: hp region from 1. evaluated based on larger datasubset -> fine tune hp

Question 21

Q

HP Optimization Techniques - 7. Metaheuristic algorithm

Answer

Study These Flashcards

A

set of algorithms inspired by biology & widely used)
employs heuristics and iterative search procedures to explore search space

! S10 & 14: Backpropagation, Hyperparameters & Learning Rate Flashcards

(21 cards)