Lecture 11 - Hyper-parameter Optimization - Deep Generative Models

Why is optimization important?

NN objectives are highly non-convex (and worse with depth)

Optimization has huge influence on quality of model
- Factors: Convergence speed, convergence quality

How well did you know this?

Not at all

Perfectly

Standard training method is…

Stochastic Gradient (SG)

How well did you know this?

Not at all

Perfectly

Explain the stochastic gradient training method

Choose a random example ‘i’
Update both ‘v’ and ‘W’

Computing gradient is known as ‘backpropagation’

How well did you know this?

Not at all

Perfectly

Backpropagation computes neural network gradient via which rule?

Chain rule

How well did you know this?

Not at all

Perfectly

How can we speed up the training process of deep neural networks?

Faster Optimizers!

Four ways:

Using good initialization strategy for connection weights
Using a good activation function
Using Batch Normalization
Reusing parts of a pretrained network

Another way: Use faster optimizer than regular Gradient Descent optimizer, such as:

Momentum Optimization
Nesterov Accelerated Gradient

How well did you know this?

Not at all

Perfectly

What are the general ideas of the momentum optimizer?

Primary idea:

Let a bowling ball roll down a gentle slope on a smooth surface:
It will start slowly
Then quickly pick up momentum until it eventually reaches terminal velocity (friction or air resistance limitations)

How well did you know this?

Not at all

Perfectly

In deep neural networks, if you cannot use Batch Normalization, you can apply…

Momentum optimization!

How well did you know this?

Not at all

Perfectly

True or False: Momentum Optimization uses gradient for acceleration, not for speed

True

How well did you know this?

Not at all

Perfectly

True or False: Momentum can be a hyperparameter of Stochastic Gradient Boosting

True

How well did you know this?

Not at all

Perfectly

Hyperparameters regulates the design of a model.

Name some hyperparameters in different machine learning models

Machine Learning algorithms have different hyperparameters(Some main, some optional):

In NN, these include learning rate and weight of regularization
In Random Forest these include n_estimators, max depth, criterion etc
In KMeans these include n_clusters
In SVM, these include the C value and kernel
etc.

How well did you know this?

Not at all

Perfectly

Why is a good learning rate important?

If learning rate is too high -> Training may diverge

If it is set too low -> Training will eventually converge to optimum, at cost of very long time

How well did you know this?

Not at all

Perfectly

How can you fit a good learning rate?

You can fit a good learning rate by:

Training a model for a few hundred iterations, exponentially increasing learning rate from a very small value to a very large value
Next, look at the learning curve and pick a learning rate slightly lower than one at which learning curve starts shooting back up
Reinitialize model and train it with that learning rate

-> There’s a good graph in the slides on this

How well did you know this?

Not at all

Perfectly

There are different strategies to reduce learning rate during training. These are called Learning schedules.
Name a few of them

Power Scheduling
Exponential scheduling
Piecewise constant scheduling
Performance scheduling

How well did you know this?

Not at all

Perfectly

What are the challenges of Hyper-parameter optimization?

Summary = Resource intensive, configuration space is complex (loads of different variables to tweak), can’t optimize for generalization performance

Function evaluations can be extremely expensive for large models (e.g. in deep learning), complex machine learning pipelines, or large datasets
Configuration space is often complex (comprising a mix of continuous, categorical and conditional hyperparameters) and high dimensional (not always clear which hyperparameters to optimize)
No access to gradient of hyperparameter loss function (Can be a black box)
One cannot directly optimize for generalization performance as training datasets are of limited size

How well did you know this?

Not at all

Perfectly

What are the model-free blackbox optimization methods?

Babysitting or ‘Trial and error’ or Grad Student Descent (Manual tuning)
Problems due to large number of hyper-parameters, complex models, time-consuming model evaluations, and non-linear hyper parameter interactions
Grid Search (GS) most commonly used methods
Exhaustive search or a brute-force method
Works by evaluating the cartesian product of user-specified finite set of values
Problem: Inefficiency for high-dimensionality hyper parameter configuration space:
# of evaluations increases exponentially to number of hyper parameters growth

How well did you know this?

Not at all

Perfectly

How does Grid Search work?

All combinations of selected values of hyperparameters are tested to determine the optimal choice

> Number of points in grid increases exponentially with number of hyperparameters
-> Example: 5 hyperparameters and test 10 values for each hyperparameter needs to be tested = 10^5 = 100000 times to test its accuracy

Compute heavy - But very easy to implement

How can we limit the downsides of the heavy requirements in Grid Search computation?

By only searching in the logarithmic space (dunno, maybe you understand this Diana)

Logarithms of hyperparameters are sampled uniformly rather than hyperparameters themselves

Example: Instead of sampling learning rate (alpha) between 0.1 and 0.001, first sample log(a) uniformly between -1 and -3 and then exponentiate it to power of 10

What are some other Hyper-parameter optimization techniques?

Random Search (RS): Like GS
Randomly selects a pre-defined number of samples between upper and lower bounds as candidate hyper-parameter values, and then trains these candidates until defined budget is exhausted
RS can explore a larger search space than GS
Problem: Unnecessary function evaluations since it does not exploit previously well-performing regions
Gradient-based optimization: traditional technique
After randomly selecting a data point, it moves towards opposite direction of largest gradient to locate next data point
Local optimum can be reached after convergence
Gradient based algorithms have a fast convergence speed to reach local optimum
Can be used to optimize learning rate in neural networks
Bayesian Optimization: Popular iterative algorithm
Determines future evaluation points based on previously-obtained results (Unlike GS and RS)
Multi-fidelity optimization algorithms: To solve constraint of limited time and resources (use a subset of the original dataset/features)
Metaheuristic algorithms: Set of algorithms inspired by biological theories and widely used for optimization problems

Are Deep Generative Models used for supervised or unsupervised learning?

Unsupervised.

Deep Generative Models are an efficient way to analyse and understand unlabeled data

What are some examples of use cases for Deep Generative Models?

Visual Recognition, Speech recognition and generation, natural language processing

Deep Generative Models can be divided into two main categories - These are…

Cost function based models - Such as autoencoders and generative adversarial networks

Energy-based models where joint probability is defined using an energy function

What are Boltzmann Machines(BMs)?

BMs are fully connected popular ANN architecture

Based on stochastic neurons
- These neurons output 1 with some probability, and 0 otherwise
- Probability function that these ANNs use is based on the Boltzmann distribution
There is no efficient technique to train boltzmann machines

What are the differences between BMs and Restricted Boltzmann Machines(RBMs)?

As their name implies, RBMs are a variant of Boltzmann machines, with the restriction that their neurons must form a bipartite graph: a pair of nodes from each of the two groups of units (commonly referred to as the “visible” and “hidden” units respectively) may have a symmetric connection between them; and there are no connections between nodes within a group. By contrast, “unrestricted” Boltzmann machines may have connections between hidden units. This restriction allows for more efficient training algorithms than are available for the general class of Boltzmann machines, in particular the gradient-based contrastive divergence algorithm.

I took this from Wikipedia (Simon)

Essentially, there are different variations of The Boltzmann Machines. What are the variations

Boltzmann Machines

Restricted Boltzmann Machines
Deep Boltzmann Machines (Undirected model with several layers of latent variables)
Deep Belief Networks (Combines multiple RBNs)

Summary: Whenever “Deep” is involves, it just means that there are more layers

Autoencoders are...

Unsupervised models - An example of a generative model - Useful for dimensionality reduction -> Good for visualisation purposes

Main ideas of a generative model is...

That given an input, a generative model should be able to return an output similar to input input data (like speech for example)

An autoencoder works by...

looking at the input, converts them to an efficient latent representation, and then output something that (hopefully) looks very close to inputs

An autoencoder is composed of two parts. Name them

1. An encoder (or recognition network) that converts inputs to a latent representation 2. A decoder (or generative network) that converts internal representation to outputs

Autoencoder almost has the same architecture as which ML model?

Multi Level Perceptron (MLP) except the number of neurons in an output layer must be equal to the number of inputs

Stacked Autoencoders (or deep autoencoders) differ from normal autoencoders in that

they can have multiple hidden layers The advantage of this is that it can help the autoencoder learn more complex codings

Building autoencoders for images require you to build a

Convolutional autoencoder (Regular CNN composed of convolutional layers and pooling layers) -> Reduced spatial dimensionality of inputs Decoder must do reverse (upscale image and reduce its depth back to original dimensions), and for this use transpose convolutional layers

Building autoencoder for sequences (Such as time series or dimensionality reductions, you must build either

Recurrent Autoencoder | Similar to RNN

Lecture 11 - Hyper-parameter Optimization - Deep Generative Models - Autoencoders Flashcards