Lecture 11 - Hyper-parameter Optimization - Deep Generative Models - Autoencoders Flashcards
Why is optimization important?
NN objectives are highly non-convex (and worse with depth)
Optimization has huge influence on quality of model
- Factors: Convergence speed, convergence quality
Standard training method is…
Stochastic Gradient (SG)
Explain the stochastic gradient training method
- Choose a random example ‘i’
- Update both ‘v’ and ‘W’
Computing gradient is known as ‘backpropagation’
Backpropagation computes neural network gradient via which rule?
Chain rule
How can we speed up the training process of deep neural networks?
Faster Optimizers!
Four ways:
- Using good initialization strategy for connection weights
- Using a good activation function
- Using Batch Normalization
- Reusing parts of a pretrained network
Another way: Use faster optimizer than regular Gradient Descent optimizer, such as:
- Momentum Optimization
- Nesterov Accelerated Gradient
What are the general ideas of the momentum optimizer?
Primary idea:
- Let a bowling ball roll down a gentle slope on a smooth surface:
- It will start slowly
- Then quickly pick up momentum until it eventually reaches terminal velocity (friction or air resistance limitations)
In deep neural networks, if you cannot use Batch Normalization, you can apply…
Momentum optimization!
True or False: Momentum Optimization uses gradient for acceleration, not for speed
True
True or False: Momentum can be a hyperparameter of Stochastic Gradient Boosting
True
Hyperparameters regulates the design of a model.
Name some hyperparameters in different machine learning models
Machine Learning algorithms have different hyperparameters(Some main, some optional):
- In NN, these include learning rate and weight of regularization
- In Random Forest these include n_estimators, max depth, criterion etc
- In KMeans these include n_clusters
- In SVM, these include the C value and kernel
- etc.
Why is a good learning rate important?
If learning rate is too high -> Training may diverge
If it is set too low -> Training will eventually converge to optimum, at cost of very long time
How can you fit a good learning rate?
You can fit a good learning rate by:
- Training a model for a few hundred iterations, exponentially increasing learning rate from a very small value to a very large value
- Next, look at the learning curve and pick a learning rate slightly lower than one at which learning curve starts shooting back up
- Reinitialize model and train it with that learning rate
-> There’s a good graph in the slides on this
There are different strategies to reduce learning rate during training. These are called Learning schedules.
Name a few of them
- Power Scheduling
- Exponential scheduling
- Piecewise constant scheduling
- Performance scheduling
What are the challenges of Hyper-parameter optimization?
Summary = Resource intensive, configuration space is complex (loads of different variables to tweak), can’t optimize for generalization performance
- Function evaluations can be extremely expensive for large models (e.g. in deep learning), complex machine learning pipelines, or large datasets
- Configuration space is often complex (comprising a mix of continuous, categorical and conditional hyperparameters) and high dimensional (not always clear which hyperparameters to optimize)
- No access to gradient of hyperparameter loss function (Can be a black box)
- One cannot directly optimize for generalization performance as training datasets are of limited size
What are the model-free blackbox optimization methods?
- Babysitting or ‘Trial and error’ or Grad Student Descent (Manual tuning)
- Problems due to large number of hyper-parameters, complex models, time-consuming model evaluations, and non-linear hyper parameter interactions
- Grid Search (GS) most commonly used methods
- Exhaustive search or a brute-force method
- Works by evaluating the cartesian product of user-specified finite set of values
- Problem: Inefficiency for high-dimensionality hyper parameter configuration space:
- # of evaluations increases exponentially to number of hyper parameters growth
How does Grid Search work?
All combinations of selected values of hyperparameters are tested to determine the optimal choice
- > Number of points in grid increases exponentially with number of hyperparameters
- -> Example: 5 hyperparameters and test 10 values for each hyperparameter needs to be tested = 10^5 = 100000 times to test its accuracy
Compute heavy - But very easy to implement
How can we limit the downsides of the heavy requirements in Grid Search computation?
By only searching in the logarithmic space (dunno, maybe you understand this Diana)
Logarithms of hyperparameters are sampled uniformly rather than hyperparameters themselves
Example: Instead of sampling learning rate (alpha) between 0.1 and 0.001, first sample log(a) uniformly between -1 and -3 and then exponentiate it to power of 10
What are some other Hyper-parameter optimization techniques?
- Random Search (RS): Like GS
- Randomly selects a pre-defined number of samples between upper and lower bounds as candidate hyper-parameter values, and then trains these candidates until defined budget is exhausted
- RS can explore a larger search space than GS
- Problem: Unnecessary function evaluations since it does not exploit previously well-performing regions
- Gradient-based optimization: traditional technique
- After randomly selecting a data point, it moves towards opposite direction of largest gradient to locate next data point
- Local optimum can be reached after convergence
- Gradient based algorithms have a fast convergence speed to reach local optimum
- Can be used to optimize learning rate in neural networks
- Bayesian Optimization: Popular iterative algorithm
- Determines future evaluation points based on previously-obtained results (Unlike GS and RS)
- Multi-fidelity optimization algorithms: To solve constraint of limited time and resources (use a subset of the original dataset/features)
- Metaheuristic algorithms: Set of algorithms inspired by biological theories and widely used for optimization problems
Are Deep Generative Models used for supervised or unsupervised learning?
Unsupervised.
Deep Generative Models are an efficient way to analyse and understand unlabeled data
What are some examples of use cases for Deep Generative Models?
Visual Recognition, Speech recognition and generation, natural language processing
Deep Generative Models can be divided into two main categories - These are…
Cost function based models - Such as autoencoders and generative adversarial networks
Energy-based models where joint probability is defined using an energy function
What are Boltzmann Machines(BMs)?
BMs are fully connected popular ANN architecture
- Based on stochastic neurons
- These neurons output 1 with some probability, and 0 otherwise
- Probability function that these ANNs use is based on the Boltzmann distribution
- There is no efficient technique to train boltzmann machines
What are the differences between BMs and Restricted Boltzmann Machines(RBMs)?
As their name implies, RBMs are a variant of Boltzmann machines, with the restriction that their neurons must form a bipartite graph: a pair of nodes from each of the two groups of units (commonly referred to as the “visible” and “hidden” units respectively) may have a symmetric connection between them; and there are no connections between nodes within a group. By contrast, “unrestricted” Boltzmann machines may have connections between hidden units. This restriction allows for more efficient training algorithms than are available for the general class of Boltzmann machines, in particular the gradient-based contrastive divergence algorithm.
- I took this from Wikipedia (Simon)
Essentially, there are different variations of The Boltzmann Machines. What are the variations
Boltzmann Machines
- Restricted Boltzmann Machines
- Deep Boltzmann Machines (Undirected model with several layers of latent variables)
- Deep Belief Networks (Combines multiple RBNs)
Summary: Whenever “Deep” is involves, it just means that there are more layers
Autoencoders are…
Unsupervised models
- An example of a generative model
- Useful for dimensionality reduction -> Good for visualisation purposes
Main ideas of a generative model is…
That given an input, a generative model should be able to return an output similar to input input data (like speech for example)
An autoencoder works by…
looking at the input, converts them to an efficient latent representation, and then output something that (hopefully) looks very close to inputs
An autoencoder is composed of two parts. Name them
- An encoder (or recognition network) that converts inputs to a latent representation
- A decoder (or generative network) that converts internal representation to outputs
Autoencoder almost has the same architecture as which ML model?
Multi Level Perceptron (MLP) except the number of neurons in an output layer must be equal to the number of inputs
Stacked Autoencoders (or deep autoencoders) differ from normal autoencoders in that
they can have multiple hidden layers
The advantage of this is that it can help the autoencoder learn more complex codings
Building autoencoders for images require you to build a
Convolutional autoencoder (Regular CNN composed of convolutional layers and pooling layers) -> Reduced spatial dimensionality of inputs
Decoder must do reverse (upscale image and reduce its depth back to original dimensions), and for this use transpose convolutional layers
Building autoencoder for sequences (Such as time series or dimensionality reductions, you must build either
Recurrent Autoencoder
Similar to RNN