Chapter 17 Bayes Theorem and Machine Learning Flashcards

Question 1

Q

How can we frame a hypothesis and data as a bayes theorem problem? (formula) P 159

Answer

A

P(h|D) = P(D|h) × P(h) /P(D)

Breaking this down, it says that the probability of a given hypothesis holding or being true given some observed data can be calculated as the probability of observing the data given the hypothesis multiplied by the probability of the hypothesis being true regardless of the data, divided by the probability of observing the data regardless of the hypothesis.

Question 2

Q

Under Bayes framework, each piece of the calculation has a specific name; for example: P 159
P(h|D):____
P(h): ____

Answer

A

Posterior probability of the hypothesis (the thing we want to calculate), Prior probability of the hypothesis.

Question 3

Q

What do we want to maximize when using Bayes Theorem in applied machine learning? P 159

Answer

A

The notion of testing different models on a dataset in applied machine learning can be thought of as estimating the probability of each hypothesis (h1, h2, h3, · · · , ∈ H) being true given the observed data. Under this framework, the probability of the data (D) is constant as it is used in the assessment of each hypothesis. Therefore, it can be removed from the calculation (of the Bayes Theorem formula) to give the simplified unnormalized estimate as follows:
max h ∈ H P(h|D) = P(D|h) × P(h)
If we do not have any prior information about the hypothesis being tested, they can be assigned a uniform probability, and this term too will be a constant and can be removed from the calculation to give the following:
max h ∈ H P(h|D) = P(D|h)

IF P(h|D) is maximized THEN P(D|h) (density estimation) is maximized, so the goal is to locate a hypothesis that best explains the observed data

Question 4

Q

A popular replacement for maximizing the likelihood is maximizing the ____of the parameters instead. P 160

Answer

A

Bayesian posterior probability density

Question 5

Q

The maximum likelihood hypothesis might not be the maximum a posteriori (MAP), but if one assumes ____over the hypotheses then it is. P 162

Answer

A

uniform prior probabilities

Question 6

Q

Which is better, MLE (Maximum Likelihood Estimation) or MAP(Maximum A Posteriori)? P 162

Answer

A

One framework is not better than another, and as mentioned, in many cases, both frameworks frame the same optimization problem from different perspectives. Instead, MAP is appropriate for those problems where there is some prior information, e.g. where a meaningful prior can be set to weight the choice of different distributions and parameters or model parameters. MLE is more appropriate where there is no such prior.

For MAP the optimization problem is: max h ∈ H P(h|D) = P(D|h) × P(h), if the hypothesis prior (P(h)) is not uniform, then it plays a role in the optimization

In fact, if we assume that all values of θ (all the different hypothesis)
are equally likely because we don’t have any prior information (e.g. a uniform prior), then both calculations are equivalent. Because of this equivalence, both MLE and MAP often converge to the same optimization problem for many machine learning algorithms. This is not always the case; if the calculation of the MLE and MAP optimization problem differ, the MLE and MAP solution found for an algorithm may also differ. P 162

Question 7

Q

We calculate a point estimate such as a moment of the distribution, like the mode, the most common value( which is the same as the mean for the normal distribution) in MAP. True/False, why? P 162

Answer

A

True, we are not calculating the full Posterior probability distribution, instead we calculate a point estimate such as a moment of the distribution

MAP: Max P(X|theta)×P(theta) which is equivalent to max h€H P(h|D)≈P(D|h)

Note: often estimating the density is too challenging, so we’re happy with a point estimate from the target distribution, such as the mean P 160

Question 8

Q

The addition of the prior to the MLE can be thought of as a type of regularization of the MLE calculation. This insight allows other regularization methods (e.g. L2 norm in models that use a weighted sum of inputs) to be interpreted under a framework of MAP Bayesian inference. L2 regularization is equivalent to MAP Bayesian inference with a ____. P 162

Answer

A

Gaussian prior on the weights

Question 9

Q

What is the difference between analytical and numerical solutions? External

Answer

A

Analytical is exact; numerical is approximate. For example, some differential equations cannot be solved exactly (analytic or closed form solution) and we must rely on numerical techniques to solve them. You write: some differential equations cannot be solved exactly.

Question 10

Q

What’s Bayes classifier? P 163

Answer

A

Bayes Classifier: Probabilistic model that makes the most probable prediction for new examples.

Specifically, the Bayes optimal classifier answers the question: What is the most probable classification of the new instance given the training data?

Question 11

Q

Are Bayes classifier and MAP (Maximum A Posteriori) the same? P 163

Answer

A

Bayes Classifier is different from the MAP framework that seeks the most probable hypothesis (model). In Bayes Classifier, we are interested in making a specific prediction.

Question 12

Q

What is the formula that Bayes optimal classifier uses to produce class probabilities for a new instance? P 163

Answer

A

max sum <sup>h∈H</sup> ( P(vj |hi) × P(hi |D))
Where vj is a new instance to be classified, H are the set of hypotheses for classifying the instance, hi is a given hypothesis, P(vj|hi) is the posterior probability for vi given hypothesis hi, and P(hi|D) is the posterior probability of the hypothesis hi given the data D. Selecting the outcome with the maximum probability is an example of a Bayes optimal classification.

Any model that classifies examples using this equation is a Bayes optimal classifier and no other model can outperform this technique on average. Hence the name optimal classifier.

Question 13

Q

The Bayes classifier produces the lowest possible test error rate, called the ____. P 164

Answer

A

Bayes error rate

Question 14

Q

Because of the computational cost of Bayes optimal classifying strategy, we instead can work with direct simplifications of the approach. Two of the most commonly used approaches are using a sampling algorithm for hypotheses such as____, or to use the simplifying assumptions of the ____classifier. P 164

Answer

A

Gibbs sampling, Naive Bayes

Question 15

Q

Definitions: P 164
Gibbs Algorithm.
Naive Bayes.

Answer

A

Randomly sample hypotheses based on their posterior probability.
Assume that variables in the input data are conditionally independent

Question 16

Q

Assume that variables in the input data are conditionally independent. many nonlinear machine learning algorithms are able to make predictions that are close approximations of the Bayes classifier in practice, Despite the fact that it is a very simple approach, ____ can often produce classifiers that are surprisingly close to the optimal Bayes classifier. P 164

Answer

Study These Flashcards

A

KNN

Chapter 17 Bayes Theorem and Machine Learning Flashcards

(16 cards)