Lecture 11 - Hyper-parameter Optimization - Deep Generative Models - Autoencoders Flashcards
Why is optimization important?
NN objectives are highly non-convex (and worse with depth)
Optimization has huge influence on quality of model
- Factors: Convergence speed, convergence quality
Standard training method is…
Stochastic Gradient (SG)
Explain the stochastic gradient training method
- Choose a random example ‘i’
- Update both ‘v’ and ‘W’
Computing gradient is known as ‘backpropagation’
Backpropagation computes neural network gradient via which rule?
Chain rule
How can we speed up the training process of deep neural networks?
Faster Optimizers!
Four ways:
- Using good initialization strategy for connection weights
- Using a good activation function
- Using Batch Normalization
- Reusing parts of a pretrained network
Another way: Use faster optimizer than regular Gradient Descent optimizer, such as:
- Momentum Optimization
- Nesterov Accelerated Gradient
What are the general ideas of the momentum optimizer?
Primary idea:
- Let a bowling ball roll down a gentle slope on a smooth surface:
- It will start slowly
- Then quickly pick up momentum until it eventually reaches terminal velocity (friction or air resistance limitations)
In deep neural networks, if you cannot use Batch Normalization, you can apply…
Momentum optimization!
True or False: Momentum Optimization uses gradient for acceleration, not for speed
True
True or False: Momentum can be a hyperparameter of Stochastic Gradient Boosting
True
Hyperparameters regulates the design of a model.
Name some hyperparameters in different machine learning models
Machine Learning algorithms have different hyperparameters(Some main, some optional):
- In NN, these include learning rate and weight of regularization
- In Random Forest these include n_estimators, max depth, criterion etc
- In KMeans these include n_clusters
- In SVM, these include the C value and kernel
- etc.
Why is a good learning rate important?
If learning rate is too high -> Training may diverge
If it is set too low -> Training will eventually converge to optimum, at cost of very long time
How can you fit a good learning rate?
You can fit a good learning rate by:
- Training a model for a few hundred iterations, exponentially increasing learning rate from a very small value to a very large value
- Next, look at the learning curve and pick a learning rate slightly lower than one at which learning curve starts shooting back up
- Reinitialize model and train it with that learning rate
-> There’s a good graph in the slides on this
There are different strategies to reduce learning rate during training. These are called Learning schedules.
Name a few of them
- Power Scheduling
- Exponential scheduling
- Piecewise constant scheduling
- Performance scheduling
What are the challenges of Hyper-parameter optimization?
Summary = Resource intensive, configuration space is complex (loads of different variables to tweak), can’t optimize for generalization performance
- Function evaluations can be extremely expensive for large models (e.g. in deep learning), complex machine learning pipelines, or large datasets
- Configuration space is often complex (comprising a mix of continuous, categorical and conditional hyperparameters) and high dimensional (not always clear which hyperparameters to optimize)
- No access to gradient of hyperparameter loss function (Can be a black box)
- One cannot directly optimize for generalization performance as training datasets are of limited size
What are the model-free blackbox optimization methods?
- Babysitting or ‘Trial and error’ or Grad Student Descent (Manual tuning)
- Problems due to large number of hyper-parameters, complex models, time-consuming model evaluations, and non-linear hyper parameter interactions
- Grid Search (GS) most commonly used methods
- Exhaustive search or a brute-force method
- Works by evaluating the cartesian product of user-specified finite set of values
- Problem: Inefficiency for high-dimensionality hyper parameter configuration space:
- # of evaluations increases exponentially to number of hyper parameters growth