Lecture 11 - Hyper-parameter Optimization - Deep Generative Models - Autoencoders Flashcards

1
Q

Why is optimization important?

A

NN objectives are highly non-convex (and worse with depth)

Optimization has huge influence on quality of model
- Factors: Convergence speed, convergence quality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Standard training method is…

A

Stochastic Gradient (SG)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the stochastic gradient training method

A
  • Choose a random example ‘i’
  • Update both ‘v’ and ‘W’

Computing gradient is known as ‘backpropagation’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Backpropagation computes neural network gradient via which rule?

A

Chain rule

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How can we speed up the training process of deep neural networks?

A

Faster Optimizers!

Four ways:

  • Using good initialization strategy for connection weights
  • Using a good activation function
  • Using Batch Normalization
  • Reusing parts of a pretrained network

Another way: Use faster optimizer than regular Gradient Descent optimizer, such as:

  • Momentum Optimization
  • Nesterov Accelerated Gradient
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the general ideas of the momentum optimizer?

A

Primary idea:

  • Let a bowling ball roll down a gentle slope on a smooth surface:
  • It will start slowly
  • Then quickly pick up momentum until it eventually reaches terminal velocity (friction or air resistance limitations)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In deep neural networks, if you cannot use Batch Normalization, you can apply…

A

Momentum optimization!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

True or False: Momentum Optimization uses gradient for acceleration, not for speed

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

True or False: Momentum can be a hyperparameter of Stochastic Gradient Boosting

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Hyperparameters regulates the design of a model.

Name some hyperparameters in different machine learning models

A

Machine Learning algorithms have different hyperparameters(Some main, some optional):

  • In NN, these include learning rate and weight of regularization
  • In Random Forest these include n_estimators, max depth, criterion etc
  • In KMeans these include n_clusters
  • In SVM, these include the C value and kernel
  • etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why is a good learning rate important?

A

If learning rate is too high -> Training may diverge

If it is set too low -> Training will eventually converge to optimum, at cost of very long time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can you fit a good learning rate?

A

You can fit a good learning rate by:

  • Training a model for a few hundred iterations, exponentially increasing learning rate from a very small value to a very large value
  • Next, look at the learning curve and pick a learning rate slightly lower than one at which learning curve starts shooting back up
  • Reinitialize model and train it with that learning rate

-> There’s a good graph in the slides on this

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

There are different strategies to reduce learning rate during training. These are called Learning schedules.
Name a few of them

A
  1. Power Scheduling
  2. Exponential scheduling
  3. Piecewise constant scheduling
  4. Performance scheduling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the challenges of Hyper-parameter optimization?

A

Summary = Resource intensive, configuration space is complex (loads of different variables to tweak), can’t optimize for generalization performance

  • Function evaluations can be extremely expensive for large models (e.g. in deep learning), complex machine learning pipelines, or large datasets
  • Configuration space is often complex (comprising a mix of continuous, categorical and conditional hyperparameters) and high dimensional (not always clear which hyperparameters to optimize)
  • No access to gradient of hyperparameter loss function (Can be a black box)
  • One cannot directly optimize for generalization performance as training datasets are of limited size
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the model-free blackbox optimization methods?

A
  • Babysitting or ‘Trial and error’ or Grad Student Descent (Manual tuning)
  • Problems due to large number of hyper-parameters, complex models, time-consuming model evaluations, and non-linear hyper parameter interactions
  • Grid Search (GS) most commonly used methods
  • Exhaustive search or a brute-force method
  • Works by evaluating the cartesian product of user-specified finite set of values
  • Problem: Inefficiency for high-dimensionality hyper parameter configuration space:
  • # of evaluations increases exponentially to number of hyper parameters growth
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does Grid Search work?

A

All combinations of selected values of hyperparameters are tested to determine the optimal choice

  • > Number of points in grid increases exponentially with number of hyperparameters
  • -> Example: 5 hyperparameters and test 10 values for each hyperparameter needs to be tested = 10^5 = 100000 times to test its accuracy

Compute heavy - But very easy to implement

17
Q

How can we limit the downsides of the heavy requirements in Grid Search computation?

A

By only searching in the logarithmic space (dunno, maybe you understand this Diana)

Logarithms of hyperparameters are sampled uniformly rather than hyperparameters themselves

Example: Instead of sampling learning rate (alpha) between 0.1 and 0.001, first sample log(a) uniformly between -1 and -3 and then exponentiate it to power of 10

18
Q

What are some other Hyper-parameter optimization techniques?

A
  • Random Search (RS): Like GS
  • Randomly selects a pre-defined number of samples between upper and lower bounds as candidate hyper-parameter values, and then trains these candidates until defined budget is exhausted
  • RS can explore a larger search space than GS
  • Problem: Unnecessary function evaluations since it does not exploit previously well-performing regions
  • Gradient-based optimization: traditional technique
  • After randomly selecting a data point, it moves towards opposite direction of largest gradient to locate next data point
  • Local optimum can be reached after convergence
  • Gradient based algorithms have a fast convergence speed to reach local optimum
  • Can be used to optimize learning rate in neural networks
  • Bayesian Optimization: Popular iterative algorithm
  • Determines future evaluation points based on previously-obtained results (Unlike GS and RS)
  • Multi-fidelity optimization algorithms: To solve constraint of limited time and resources (use a subset of the original dataset/features)
  • Metaheuristic algorithms: Set of algorithms inspired by biological theories and widely used for optimization problems
19
Q

Are Deep Generative Models used for supervised or unsupervised learning?

A

Unsupervised.

Deep Generative Models are an efficient way to analyse and understand unlabeled data

20
Q

What are some examples of use cases for Deep Generative Models?

A

Visual Recognition, Speech recognition and generation, natural language processing

21
Q

Deep Generative Models can be divided into two main categories - These are…

A

Cost function based models - Such as autoencoders and generative adversarial networks

Energy-based models where joint probability is defined using an energy function

22
Q

What are Boltzmann Machines(BMs)?

A

BMs are fully connected popular ANN architecture

  • Based on stochastic neurons
    • These neurons output 1 with some probability, and 0 otherwise
    • Probability function that these ANNs use is based on the Boltzmann distribution
  • There is no efficient technique to train boltzmann machines
23
Q

What are the differences between BMs and Restricted Boltzmann Machines(RBMs)?

A

As their name implies, RBMs are a variant of Boltzmann machines, with the restriction that their neurons must form a bipartite graph: a pair of nodes from each of the two groups of units (commonly referred to as the “visible” and “hidden” units respectively) may have a symmetric connection between them; and there are no connections between nodes within a group. By contrast, “unrestricted” Boltzmann machines may have connections between hidden units. This restriction allows for more efficient training algorithms than are available for the general class of Boltzmann machines, in particular the gradient-based contrastive divergence algorithm.

  • I took this from Wikipedia (Simon)
24
Q

Essentially, there are different variations of The Boltzmann Machines. What are the variations

A

Boltzmann Machines

  • Restricted Boltzmann Machines
  • Deep Boltzmann Machines (Undirected model with several layers of latent variables)
  • Deep Belief Networks (Combines multiple RBNs)

Summary: Whenever “Deep” is involves, it just means that there are more layers

25
Q

Autoencoders are…

A

Unsupervised models
- An example of a generative model

  • Useful for dimensionality reduction -> Good for visualisation purposes
26
Q

Main ideas of a generative model is…

A

That given an input, a generative model should be able to return an output similar to input input data (like speech for example)

27
Q

An autoencoder works by…

A

looking at the input, converts them to an efficient latent representation, and then output something that (hopefully) looks very close to inputs

28
Q

An autoencoder is composed of two parts. Name them

A
  1. An encoder (or recognition network) that converts inputs to a latent representation
  2. A decoder (or generative network) that converts internal representation to outputs
29
Q

Autoencoder almost has the same architecture as which ML model?

A

Multi Level Perceptron (MLP) except the number of neurons in an output layer must be equal to the number of inputs

30
Q

Stacked Autoencoders (or deep autoencoders) differ from normal autoencoders in that

A

they can have multiple hidden layers

The advantage of this is that it can help the autoencoder learn more complex codings

31
Q

Building autoencoders for images require you to build a

A

Convolutional autoencoder (Regular CNN composed of convolutional layers and pooling layers) -> Reduced spatial dimensionality of inputs

Decoder must do reverse (upscale image and reduce its depth back to original dimensions), and for this use transpose convolutional layers

32
Q

Building autoencoder for sequences (Such as time series or dimensionality reductions, you must build either

A

Recurrent Autoencoder

Similar to RNN