Chapter 19 How to Implement Bayesian Optimization Flashcards by Mahsa Zamanifard

What’s Bayesian Optimization?

P 179

Bayesian Optimization is an approach that uses Bayes Theorem to direct the search in order to find the minimum or maximum of an objective function.

How well did you know this?

Not at all

Perfectly

What’s an objective function?

External, P 178

The function that it is desired to maximize or minimize.
P 178: Objective Function. Function that takes a sample and returns a cost.

How well did you know this?

Not at all

Perfectly

What’s global function optimization?

P 178

Global function optimization, or function optimization for short, involves finding the minimum or maximum of an objective function.

How well did you know this?

Not at all

Perfectly

What are the definitions of Samples, Search space and cost?

P 178

Samples. One example from the domain, represented as a vector.
Search Space: Extent of the domain from which samples can be drawn.

Cost. Numeric score for a sample calculated via the objective function.

How well did you know this?

Not at all

Perfectly

The objective function is often easy to specify but can be computationally challenging to calculate or result in a noisy calculation of cost over time. True/False

P 178

True

How well did you know this?

Not at all

Perfectly

The form (shape) of the objective function is unknown and is often highly nonlinear, and highly multi-dimensional defined by the number of input variables. True/False

P 178

True

How well did you know this?

Not at all

Perfectly

What does the below statement mean?

“The objective function is also probably non-convex.”

P 178

This means that local extrema may or may not be the global extrema (e.g. could be misleading and result in premature convergence), hence the name of the task as global rather than local optimization.

Nonconvex functions arethose functions that have many minimum points. Local and global minimum points.

How well did you know this?

Not at all

Perfectly

Little is known about the objective function, and as such, it is often referred to as a____ function and the search process as ____. Further, the objective function is sometimes called an ____ given the ability to only give answers.

P 178

black box, black box optimization, oracle

How well did you know this?

Not at all

Perfectly

Is the object function easy to specifiy?

P 178

The objective function is often easy to specify but can be computationally challenging to calculate or result in a noisy calculation of cost over time.

How well did you know this?

Not at all

Perfectly

Define:

Algorithm Training.
Algorithm Tuning.
Predictive Modeling.

P 178

Optimization of model parameters.
Optimization of model hyperparameters.
Optimization of data, data preparation, and algorithm selection.

How well did you know this?

Not at all

Perfectly

A directed approach to global optimization that uses probability (Bayes Theorem) is called ____.

P 178

Bayesian Optimization

How well did you know this?

Not at all

Perfectly

For what kind of objective function is Bayes optimization most useful?

P 179

It is an approach that is most useful for objective functions that are complex, noisy, and/or expensive to evaluate.

Bayesian optimization is a powerful strategy for finding the extrema of objective functions that are expensive to evaluate. It is particularly useful when these evaluations are costly, when one does not have access to derivatives, or when the problem at hand is non-convex.

How well did you know this?

Not at all

Perfectly

Given the Bayes optimization formula below, why is the posterior called a surrogate function for the objective function? P 179

P(f|D) = P(D|f) × P(f)

The posterior represents everything we know about the objective function. It is an approximation of the objective function and can be used to estimate the cost of different candidate samples that we may want to evaluate. In this way, the posterior probability is a surrogate objective function.

How well did you know this?

Not at all

Perfectly

An important hyper-parameter in the GP (Gaussian Process) model is the
____.

P 184

from sklearn.gaussian_process import GaussianProcessRegressor

kernel

Gaussian processes are non-parametric kernel based Bayesian tools to perform inference (Kernel is the heart of the model)

Kernel methods use kernels (or basis functions) to map the input data into a different space. After this mapping, simple models can be trained on the new feature space, instead of the input space, which can result in an increase in the performance of the models.

How well did you know this?

Not at all

Perfectly

What is the acquisition function?

P 180

Acquisition Function: Technique by which the posterior is used to select the next sample from the search space.

Web: Bayesian optimization is a sample-efficient approach to global optimization that relies on theoretically motivated value heuristics (acquisition functions) to guide its search process.

How well did you know this?

Not at all

Perfectly

“The surrogate function gives us an estimate of the objective function, which can be used to direct future sampling.” What is meant by sampling here?

P 180

Study These Flashcards

Sampling involves careful use of the posterior in a function known as the acquisition function, e.g. for acquiring more samples. We want to use our belief about the objective function to sample the area of the search space that is most likely to pay off, therefore the acquisition will optimize the conditional probability of locations in the search to generate the next sample.

What are the steps of Bayesian optimization?

P 180

Study These Flashcards

The Bayesian Optimization algorithm can be summarized as follows:
1. Select a Sample by Optimizing the Acquisition Function.
2. Evaluate the Sample With the Objective Function.
3. Update the Data and, in turn, the Surrogate Function.
4. Go To 1.

Based on the worked example P 189:

first we fit a surrogate function (GP regression) to the dataset we have.
then using a for loop, we do the following:
1) Use a search strategy for sampling, in the example the random search method is used for sampling the domain
2) Use an acquisition function to find the instance in the sample, that has a surrogate function score, closest to the max/min of the surrogate scores for the original sample
3) Choose that instance and use the actual objective function on it, then add the instance and the output of the function to the original sample ( mostly it has a value -x- close to the value at which the extrema happen)
4) Fit the surrogate model to the updated dataset and target
5) Go back to 1 and repeat the process

In the end we have many values -x- clustering around the actual value at which the extrema happen, and this loop helps the surrogate function to find the value -x- at which the extrema happen with more precision
| With every iteration (n_call in gp_minimize) of hyper parameter tuning using Bayes Optimization, we get closer to the optimal hyper parameter, because the search is changing from coarse to fine in each iteration.

Prob ML PlayGround

The surrogate function is a technique used to best approximate the mapping of input examples to an output score. Probabilistically, it summarizes the conditional probability of an objective function (f), given the available data (D) or P(f|D).

A number of techniques can be used for this, although the most popular is to treat the problem as a regression predictive modeling problem with the data representing the input and the score representing the output to the model.

This is often best modeled using a ____ or a ____.

P 183

Study These Flashcards

Random forest, Gaussian Process

using a GP regression model is often preferred.

What’s Multivariate Normal Distribution?

External

Study These Flashcards

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions.

A Gaussian Process, or GP, is a model that constructs a joint probability distribution over the variables, assuming a multivariate Gaussian distribution.

What does .predict() method from GaussianProcessRegressor return?

P 184

Study These Flashcards

We can call this function any time to estimate the cost of one or more samples. The result for a given sample will be a mean of the distribution at that point. We can also get the standard deviation of the distribution at that point in the function by specifying the argument return std=True. (Me: it’s a probabilistic model, so gives a range of values for the possible outcome, which is the output of the estimated cost function)

The GaussianProcessRegressor model will estimate the cost for one or more samples provided to it.

We would expect the surrogate function to have a crude approximation of the ____. P 185

Study These Flashcards

True non-noisy objective function

What does the acquisition function do?

P 188

Study These Flashcards

The acquisition function is responsible for scoring or estimating the likelihood that a given candidate sample (input) is worth evaluating with the real objective function.

Given a Gaussian Process model as the surrogate function, we can use the ____ from this model in the acquisition function to calculate the probability that a given sample is worth evaluating.

P 188

Study These Flashcards

probabilistic information

What are 3 common examples of probabilistic acquisition functions?
Which is the simplest and which is the most commonly used?

P 188

Study These Flashcards

Probability of Improvement (PI). simplest
Expected Improvement (EI). most common
Lower Confidence Bound (LCB).

The acquisition function is responsible for scoring or estimating the likelihood that a given candidate sample (input) is worth evaluating with the real objective function

Define a strategy for sampling the surrogate function. (Me: sampling its domain) | P 187

* The search strategy used to navigate the domain in response to the surrogate function can be: random sampling, grid-based sampling, local search strategies (more common) **Note** After sampling, the acquisition function is used to interpret and score the response from the surrogate function (after drawing samples from domain, we evaluate them using acquisition function, then optimize the acquisition by choosing one or more candidates from the sample that give the best scores, then we add the sample/s to the domain and sample from the updated domain using the selected search strategy to fit the surrogate function to the new sample)

Two popular libraries for Bayesian Optimization include ____ and ____. In machine learning, these libraries are often used for ____. | P 193

Scikit-Optimize, HyperOpt, tuning the hyperparameters of algorithms

There are two ways that scikit-optimize can be used to optimize the hyperparameters of a scikit-learn algorithm. What are they? | P 193

* Perform the optimization directly on a search space. * Use the BayesSearchCV class, a sibling of the scikit-learn native classes for random and grid searching.

Why is Bayesian Optimization used for Hyperparameter tuning? | P 193

Hyperparameter tuning is a good fit for Bayesian Optimization because: The evaluation function is computationally expensive (e.g. training models for each set of hyperparameters) and noisy (e.g. noise in training data and stochastic learning algorithms).

By default, gp_minimize() will use a ____ acquisition function that tries to figure out the best strategy, but this can be configured via the ____ argument. The optimization will also run for 100 iterations by default, but this can be controlled via the ____ argument. | P 194

gp\_hedge, acq\_func, n\_calls ## Footnote gp\_minimize(): Bayesian optimization using Gaussian Processes.

Why do we add data to the dataset when using Bayesian Optimization? | External

Because this would guide the search for the optimal data point in the search space. ## Footnote How? 1- Each time we sample the search space (using various search strategies) and use surrogate function to estimate the score of each Data point, we choose a data point by optimizing the acquisition function. 2- This data point in the sample, is the most similar point to the data point in the dataset which is currently maximizing/minimizing the surrogate function. 3- We add this point to the dataset and fit the surrogate function on the new dataset 4- We take a new sample from the search space, and find the output of the surrogate function for that set of data points 5- We again find the most similar point/s to the current data point optimizing the surrogate function and add it/them to the dataset Repeating this a few times, we have a lot of data points around the data point that actually optimizes the objective function, therefore, surrogate function can find the best estimate of the optimizing point **The Hiring:** We want to hire someone for a company the CEO is crazy about! There is: **CEO( Objective function)** **Consultant (Surrogate function)** **HR (Acquisition function)** **Employees (Data points in the dataset, we also have the CEO love score for each employee)** **Candidates (Data points in the search space)** We can't ask the CEO each time we want to hire someone out of a lot of candidates if they like them or not, but we can ask the consultant what they think the CEO would say. What we do is: **1** HR searchs for some candidates and chooses a bunch, **2** then asks the consultant to rate them and the current employees on how much the CEO would love them (CEO doesn't talk to HR and consultant does this according to how they think CEO love score works, because the CEO has talked about it w/ them- surrogate is trained on the dataest with data points and the true objective scores), **3** Then uses the consultant's opinion to choose someone similar to the employee who the CEO would love the most **4** After choosing one or more candidates, they add them to the employees and ask the CEO to tell them how much they love the new hire **5** The HR after talking to the consultant now knows a bit more about what sort of candidates would have a chance, so the search is a bit more refined now, they choose a new set of candidates base on what they know now **6** They go to the consultant and this process is repeated a few times, till they get tired, then they choose the newly hired employee CEO loves the most and celebrate 🥳🥂

Chapter 19 How to Implement Bayesian Optimization Flashcards

(30 cards)