Chapter 19 How to Implement Bayesian Optimization Flashcards

1
Q

What’s Bayesian Optimization?

P 179

A

Bayesian Optimization is an approach that uses Bayes Theorem to direct the search in order to find the minimum or maximum of an objective function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What’s an objective function?

External, P 178

A

The function that it is desired to maximize or minimize.
P 178: Objective Function. Function that takes a sample and returns a cost.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What’s global function optimization?

P 178

A

Global function optimization, or function optimization for short, involves finding the minimum or maximum of an objective function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the definitions of Samples, Search space and cost?

P 178

A

ˆ Samples. One example from the domain, represented as a vector.
ˆ Search Space: Extent of the domain from which samples can be drawn.

ˆ Cost. Numeric score for a sample calculated via the objective function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The objective function is often easy to specify but can be computationally challenging to calculate or result in a noisy calculation of cost over time. True/False

P 178

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The form (shape) of the objective function is unknown and is often highly nonlinear, and highly multi-dimensional defined by the number of input variables. True/False

P 178

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the below statement mean?

“The objective function is also probably non-convex.”

P 178

A

This means that local extrema may or may not be the global extrema (e.g. could be misleading and result in premature convergence), hence the name of the task as global rather than local optimization.

Nonconvex functions arethose functions that have many minimum points. Local and global minimum points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Little is known about the objective function, and as such, it is often referred to as a____ function and the search process as ____. Further, the objective function is sometimes called an ____ given the ability to only give answers.

P 178

A

black box, black box optimization, oracle

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Is the object function easy to specifiy?

P 178

A

The objective function is often easy to specify but can be computationally challenging to calculate or result in a noisy calculation of cost over time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define:

ˆ Algorithm Training.
ˆ Algorithm Tuning.
ˆ Predictive Modeling.

P 178

A

Optimization of model parameters.
Optimization of model hyperparameters.
Optimization of data, data preparation, and algorithm selection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A directed approach to global optimization that uses probability (Bayes Theorem) is called ____.

P 178

A

Bayesian Optimization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

For what kind of objective function is Bayes optimization most useful?

P 179

A

It is an approach that is most useful for objective functions that are complex, noisy, and/or expensive to evaluate.

Bayesian optimization is a powerful strategy for finding the extrema of objective functions that are expensive to evaluate. It is particularly useful when these evaluations are costly, when one does not have access to derivatives, or when the problem at hand is non-convex.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Given the Bayes optimization formula below, why is the posterior called a surrogate function for the objective function? P 179

P(f|D) = P(D|f) × P(f)

A

The posterior represents everything we know about the objective function. It is an approximation of the objective function and can be used to estimate the cost of different candidate samples that we may want to evaluate. In this way, the posterior probability is a surrogate objective function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

An important hyper-parameter in the GP (Gaussian Process) model is the
____.

P 184

from sklearn.gaussian_process import GaussianProcessRegressor

A

kernel

Gaussian processes are non-parametric kernel based Bayesian tools to perform inference (Kernel is the heart of the model)

Kernel methods use kernels (or basis functions) to map the input data into a different space. After this mapping, simple models can be trained on the new feature space, instead of the input space, which can result in an increase in the performance of the models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the acquisition function?

P 180

A

Acquisition Function: Technique by which the posterior is used to select the next sample from the search space.

Web: Bayesian optimization is a sample-efficient approach to global optimization that relies on theoretically motivated value heuristics (acquisition functions) to guide its search process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

“The surrogate function gives us an estimate of the objective function, which can be used to direct future sampling.” What is meant by sampling here?

P 180

A

Sampling involves careful use of the posterior in a function known as the acquisition function, e.g. for acquiring more samples. We want to use our belief about the objective function to sample the area of the search space that is most likely to pay off, therefore the acquisition will optimize the conditional probability of locations in the search to generate the next sample.

17
Q

What are the steps of Bayesian optimization?

P 180

A

The Bayesian Optimization algorithm can be summarized as follows:
1. Select a Sample by Optimizing the Acquisition Function.
2. Evaluate the Sample With the Objective Function.
3. Update the Data and, in turn, the Surrogate Function.
4. Go To 1.

Based on the worked example P 189:

first we fit a surrogate function (GP regression) to the dataset we have.
then using a for loop, we do the following:
1) Use a search strategy for sampling, in the example the random search method is used for sampling the domain
2) Use an acquisition function to find the instance in the sample, that has a surrogate function score, closest to the max/min of the surrogate scores for the original sample
3) Choose that instance and use the actual objective function on it, then add the instance and the output of the function to the original sample ( mostly it has a value -x- close to the value at which the extrema happen)
4) Fit the surrogate model to the updated dataset and target
5) Go back to 1 and repeat the process

In the end we have many values -x- clustering around the actual value at which the extrema happen, and this loop helps the surrogate function to find the value -x- at which the extrema happen with more precision
| With every iteration (n_call in gp_minimize) of hyper parameter tuning using Bayes Optimization, we get closer to the optimal hyper parameter, because the search is changing from coarse to fine in each iteration.

18
Q

The surrogate function is a technique used to best approximate the mapping of input examples to an output score. Probabilistically, it summarizes the conditional probability of an objective function (f), given the available data (D) or P(f|D).

A number of techniques can be used for this, although the most popular is to treat the problem as a regression predictive modeling problem with the data representing the input and the score representing the output to the model.

This is often best modeled using a ____ or a ____.

P 183

A

Random forest, Gaussian Process

using a GP regression model is often preferred.

19
Q

What’s Multivariate Normal Distribution?

External

A

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions.

A Gaussian Process, or GP, is a model that constructs a joint probability distribution over the variables, assuming a multivariate Gaussian distribution.

20
Q

What does .predict() method from GaussianProcessRegressor return?

P 184

A

We can call this function any time to estimate the cost of one or more samples. The result for a given sample will be a mean of the distribution at that point. We can also get the standard deviation of the distribution at that point in the function by specifying the argument return std=True. (Me: it’s a probabilistic model, so gives a range of values for the possible outcome, which is the output of the estimated cost function)

The GaussianProcessRegressor model will estimate the cost for one or more samples provided to it.

21
Q

We would expect the surrogate function to have a crude approximation of the ____. P 185

A

True non-noisy objective function

22
Q

What does the acquisition function do?

P 188

A

The acquisition function is responsible for scoring or estimating the likelihood that a given candidate sample (input) is worth evaluating with the real objective function.

23
Q

Given a Gaussian Process model as the surrogate function, we can use the ____ from this model in the acquisition function to calculate the probability that a given sample is worth evaluating.

P 188

A

probabilistic information

24
Q

What are 3 common examples of probabilistic acquisition functions?
Which is the simplest and which is the most commonly used?

P 188

A

ˆ Probability of Improvement (PI). simplest
ˆ Expected Improvement (EI). most common
ˆ Lower Confidence Bound (LCB).

The acquisition function is responsible for scoring or estimating the likelihood that a given candidate sample (input) is worth evaluating with the real objective function

25
Q

Define a strategy for sampling the surrogate function. (Me: sampling its domain)

P 187

A
  • The search strategy used to navigate the domain in response to the surrogate function can be: random sampling, grid-based sampling, local search strategies (more common)

Note After sampling, the acquisition function is used to interpret and score the response from the surrogate function (after drawing samples from domain, we evaluate them using acquisition function, then optimize the acquisition by choosing one or more candidates from the sample that give the best scores, then we add the sample/s to the domain and sample from the updated domain using the selected search strategy to fit the surrogate function to the new sample)

26
Q

Two popular libraries for Bayesian Optimization include ____ and ____. In machine learning, these libraries are often used for ____.

P 193

A

Scikit-Optimize, HyperOpt, tuning the hyperparameters of algorithms

27
Q

There are two ways that scikit-optimize can be used to optimize the hyperparameters of a scikit-learn algorithm. What are they?

P 193

A
  • Perform the optimization directly on a search space.
  • Use the BayesSearchCV class, a sibling of the scikit-learn native classes for random and grid searching.
28
Q

Why is Bayesian Optimization used for Hyperparameter tuning?

P 193

A

Hyperparameter tuning is a good fit for Bayesian Optimization because:

The evaluation function is computationally expensive (e.g. training models for each set of hyperparameters) and noisy (e.g. noise in training data and stochastic learning algorithms).

29
Q

By default, gp_minimize() will use a ____ acquisition function that tries to figure out the best strategy, but this can be configured via the ____ argument. The optimization will also run for 100 iterations by default, but this can be controlled via the ____ argument.

P 194

A

gp_hedge, acq_func, n_calls

gp_minimize(): Bayesian optimization using Gaussian Processes.

30
Q

Why do we add data to the dataset when using Bayesian Optimization?

External

A

Because this would guide the search for the optimal data point in the search space.

How?

 1- Each time we sample the search space (using various search strategies) and use surrogate function to estimate  the score of each Data point, we choose a data point by optimizing the acquisition function.

2- This data point in the sample, is the most similar point to the data point in the dataset which is currently maximizing/minimizing the surrogate function.

3- We add this point to the dataset and fit the surrogate function on the new dataset

4- We take a new sample from the search space, and find the output of the surrogate function for that set of data points

5- We again find the most similar point/s to the current data point optimizing the surrogate function and add it/them to the dataset

Repeating this a few times, we have a lot of data points around the data point that actually optimizes the objective function, therefore, surrogate function can find the best estimate of the optimizing point

The Hiring:

We want to hire someone for a company the CEO is crazy about!
There is:
CEO( Objective function)
Consultant (Surrogate function)
HR (Acquisition function)
Employees (Data points in the dataset, we also have the CEO love score for each employee)
Candidates (Data points in the search space)

We can’t ask the CEO each time we want to hire someone out of a lot of candidates if they like them or not, but we can ask the consultant what they think the CEO would say.
What we do is:

1 HR searchs for some candidates and chooses a bunch,

2 then asks the consultant to rate them and the current employees on how much the CEO would love them (CEO doesn’t talk to HR and consultant does this according to how they think CEO love score works, because the CEO has talked about it w/ them- surrogate is trained on the dataest with data points and the true objective scores),

3 Then uses the consultant’s opinion to choose someone similar to the employee who the CEO would love the most

4 After choosing one or more candidates, they add them to the employees and ask the CEO to tell them how much they love the new hire

5 The HR after talking to the consultant now knows a bit more about what sort of candidates would have a chance, so the search is a bit more refined now, they choose a new set of candidates base on what they know now

6 They go to the consultant and this process is repeated a few times, till they get tired, then they choose the newly hired employee CEO loves the most and celebrate 🥳🥂