Chapter 17 Bayes Theorem and Machine Learning Flashcards
How can we frame a hypothesis and data as a bayes theorem problem? (formula) P 159
P(h|D) = P(D|h) × P(h) /P(D)
Breaking this down, it says that the probability of a given hypothesis holding or being true given some observed data can be calculated as the probability of observing the data given the hypothesis multiplied by the probability of the hypothesis being true regardless of the data, divided by the probability of observing the data regardless of the hypothesis.
Under Bayes framework, each piece of the calculation has a specific name; for example: P 159
P(h|D):____
P(h): ____
Posterior probability of the hypothesis (the thing we want to calculate), Prior probability of the hypothesis.
What do we want to maximize when using Bayes Theorem in applied machine learning? P 159
The notion of testing different models on a dataset in applied machine learning can be thought of as estimating the probability of each hypothesis (h1, h2, h3, · · · , ∈ H) being true given the observed data. Under this framework, the probability of the data (D) is constant as it is used in the assessment of each hypothesis. Therefore, it can be removed from the calculation (of the Bayes Theorem formula) to give the simplified unnormalized estimate as follows:
max h ∈ H P(h|D) = P(D|h) × P(h)
If we do not have any prior information about the hypothesis being tested, they can be assigned a uniform probability, and this term too will be a constant and can be removed from the calculation to give the following:
max h ∈ H P(h|D) = P(D|h)
IF P(h|D) is maximized THEN P(D|h) (density estimation) is maximized, so the goal is to locate a hypothesis that best explains the observed data
A popular replacement for maximizing the likelihood is maximizing the ____of the parameters instead. P 160
Bayesian posterior probability density
The maximum likelihood hypothesis might not be the maximum a posteriori (MAP), but if one assumes ____over the hypotheses then it is. P 162
uniform prior probabilities
Which is better, MLE (Maximum Likelihood Estimation) or MAP(Maximum A Posteriori)? P 162
One framework is not better than another, and as mentioned, in many cases, both frameworks frame the same optimization problem from different perspectives. Instead, MAP is appropriate for those problems where there is some prior information, e.g. where a meaningful prior can be set to weight the choice of different distributions and parameters or model parameters. MLE is more appropriate where there is no such prior.
For MAP the optimization problem is: max h ∈ H P(h|D) = P(D|h) × P(h), if the hypothesis prior (P(h)) is not uniform, then it plays a role in the optimization
In fact, if we assume that all values of θ (all the different hypothesis)
are equally likely because we don’t have any prior information (e.g. a uniform prior), then both calculations are equivalent. Because of this equivalence, both MLE and MAP often converge to the same optimization problem for many machine learning algorithms. This is not always the case; if the calculation of the MLE and MAP optimization problem differ, the MLE and MAP solution found for an algorithm may also differ. P 162
We calculate a point estimate such as a moment of the distribution, like the mode, the most common value( which is the same as the mean for the normal distribution) in MAP. True/False, why? P 162
True, we are not calculating the full Posterior probability distribution, instead we calculate a point estimate such as a moment of the distribution
MAP: Max P(X|theta)×P(theta) which is equivalent to max h€H P(h|D)≈P(D|h)
Note: often estimating the density is too challenging, so we’re happy with a point estimate from the target distribution, such as the mean P 160
The addition of the prior to the MLE can be thought of as a type of regularization of the MLE calculation. This insight allows other regularization methods (e.g. L2 norm in models that use a weighted sum of inputs) to be interpreted under a framework of MAP Bayesian inference. L2 regularization is equivalent to MAP Bayesian inference with a ____. P 162
Gaussian prior on the weights
What is the difference between analytical and numerical solutions? External
Analytical is exact; numerical is approximate. For example, some differential equations cannot be solved exactly (analytic or closed form solution) and we must rely on numerical techniques to solve them. You write: some differential equations cannot be solved exactly.
What’s Bayes classifier? P 163
Bayes Classifier: Probabilistic model that makes the most probable prediction for new examples.
Specifically, the Bayes optimal classifier answers the question: What is the most probable classification of the new instance given the training data?
Are Bayes classifier and MAP (Maximum A Posteriori) the same? P 163
Bayes Classifier is different from the MAP framework that seeks the most probable hypothesis (model). In Bayes Classifier, we are interested in making a specific prediction.
What is the formula that Bayes optimal classifier uses to produce class probabilities for a new instance? P 163
max sum <sup>h∈H</sup> ( P(vj |hi) × P(hi |D))
Where vj is a new instance to be classified, H are the set of hypotheses for classifying the instance, hi is a given hypothesis, P(vj|hi) is the posterior probability for vi given hypothesis hi, and P(hi|D) is the posterior probability of the hypothesis hi given the data D. Selecting the outcome with the maximum probability is an example of a Bayes optimal classification.
Any model that classifies examples using this equation is a Bayes optimal classifier and no other model can outperform this technique on average. Hence the name optimal classifier.
The Bayes classifier produces the lowest possible test error rate, called the ____. P 164
Bayes error rate
Because of the computational cost of Bayes optimal classifying strategy, we instead can work with direct simplifications of the approach. Two of the most commonly used approaches are using a sampling algorithm for hypotheses such as____, or to use the simplifying assumptions of the ____classifier. P 164
Gibbs sampling, Naive Bayes
Definitions: P 164
Gibbs Algorithm.
Naive Bayes.
- Randomly sample hypotheses based on their posterior probability.
- Assume that variables in the input data are conditionally independent