Chapter 17 Bayes Theorem and Machine Learning Flashcards

1
Q

How can we frame a hypothesis and data as a bayes theorem problem? (formula) P 159

A

P(h|D) = P(D|h) × P(h) /P(D)

Breaking this down, it says that the probability of a given hypothesis holding or being true given some observed data can be calculated as the probability of observing the data given the hypothesis multiplied by the probability of the hypothesis being true regardless of the data, divided by the probability of observing the data regardless of the hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Under Bayes framework, each piece of the calculation has a specific name; for example: P 159
ˆ P(h|D):____
ˆ P(h): ____

A

Posterior probability of the hypothesis (the thing we want to calculate), Prior probability of the hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What do we want to maximize when using Bayes Theorem in applied machine learning? P 159

A

The notion of testing different models on a dataset in applied machine learning can be thought of as estimating the probability of each hypothesis (h1, h2, h3, · · · , ∈ H) being true given the observed data. Under this framework, the probability of the data (D) is constant as it is used in the assessment of each hypothesis. Therefore, it can be removed from the calculation (of the Bayes Theorem formula) to give the simplified unnormalized estimate as follows:
max h ∈ H P(h|D) = P(D|h) × P(h)
If we do not have any prior information about the hypothesis being tested, they can be assigned a uniform probability, and this term too will be a constant and can be removed from the calculation to give the following:
max h ∈ H P(h|D) = P(D|h)

IF P(h|D) is maximized THEN P(D|h) (density estimation) is maximized, so the goal is to locate a hypothesis that best explains the observed data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

A popular replacement for maximizing the likelihood is maximizing the ____of the parameters instead. P 160

A

Bayesian posterior probability density

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The maximum likelihood hypothesis might not be the maximum a posteriori (MAP), but if one assumes ____over the hypotheses then it is. P 162

A

uniform prior probabilities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which is better, MLE (Maximum Likelihood Estimation) or MAP(Maximum A Posteriori)? P 162

A

One framework is not better than another, and as mentioned, in many cases, both frameworks frame the same optimization problem from different perspectives. Instead, MAP is appropriate for those problems where there is some prior information, e.g. where a meaningful prior can be set to weight the choice of different distributions and parameters or model parameters. MLE is more appropriate where there is no such prior.

For MAP the optimization problem is: max h ∈ H P(h|D) = P(D|h) × P(h), if the hypothesis prior (P(h)) is not uniform, then it plays a role in the optimization

In fact, if we assume that all values of θ (all the different hypothesis)
are equally likely because we don’t have any prior information (e.g. a uniform prior), then both calculations are equivalent. Because of this equivalence, both MLE and MAP often converge to the same optimization problem for many machine learning algorithms. This is not always the case; if the calculation of the MLE and MAP optimization problem differ, the MLE and MAP solution found for an algorithm may also differ. P 162

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

We calculate a point estimate such as a moment of the distribution, like the mode, the most common value( which is the same as the mean for the normal distribution) in MAP. True/False, why? P 162

A

True, we are not calculating the full Posterior probability distribution, instead we calculate a point estimate such as a moment of the distribution

MAP: Max P(X|theta)×P(theta) which is equivalent to max h€H P(h|D)≈P(D|h)

Note: often estimating the density is too challenging, so we’re happy with a point estimate from the target distribution, such as the mean P 160

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

The addition of the prior to the MLE can be thought of as a type of regularization of the MLE calculation. This insight allows other regularization methods (e.g. L2 norm in models that use a weighted sum of inputs) to be interpreted under a framework of MAP Bayesian inference. L2 regularization is equivalent to MAP Bayesian inference with a ____. P 162

A

Gaussian prior on the weights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the difference between analytical and numerical solutions? External

A

Analytical is exact; numerical is approximate. For example, some differential equations cannot be solved exactly (analytic or closed form solution) and we must rely on numerical techniques to solve them. You write: some differential equations cannot be solved exactly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What’s Bayes classifier? P 163

A

Bayes Classifier: Probabilistic model that makes the most probable prediction for new examples.

Specifically, the Bayes optimal classifier answers the question: What is the most probable classification of the new instance given the training data?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Are Bayes classifier and MAP (Maximum A Posteriori) the same? P 163

A

Bayes Classifier is different from the MAP framework that seeks the most probable hypothesis (model). In Bayes Classifier, we are interested in making a specific prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the formula that Bayes optimal classifier uses to produce class probabilities for a new instance? P 163

A

max sum <sup>h∈H</sup> ( P(vj |hi) × P(hi |D))
Where vj is a new instance to be classified, H are the set of hypotheses for classifying the instance, hi is a given hypothesis, P(vj|hi) is the posterior probability for vi given hypothesis hi, and P(hi|D) is the posterior probability of the hypothesis hi given the data D. Selecting the outcome with the maximum probability is an example of a Bayes optimal classification.

Any model that classifies examples using this equation is a Bayes optimal classifier and no other model can outperform this technique on average. Hence the name optimal classifier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The Bayes classifier produces the lowest possible test error rate, called the ____. P 164

A

Bayes error rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Because of the computational cost of Bayes optimal classifying strategy, we instead can work with direct simplifications of the approach. Two of the most commonly used approaches are using a sampling algorithm for hypotheses such as____, or to use the simplifying assumptions of the ____classifier. P 164

A

Gibbs sampling, Naive Bayes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Definitions: P 164
ˆ Gibbs Algorithm.
ˆ Naive Bayes.

A
  • Randomly sample hypotheses based on their posterior probability.
  • Assume that variables in the input data are conditionally independent
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Assume that variables in the input data are conditionally independent. many nonlinear machine learning algorithms are able to make predictions that are close approximations of the Bayes classifier in practice, Despite the fact that it is a very simple approach, ____ can often produce classifiers that are surprisingly close to the optimal Bayes classifier. P 164

A

KNN