Kursusgang 2+3 (Bayesian decision theory) + (Parametric and nonparametric methods) Flashcards
Explain Bayes’ theorem and the formula for it
Bayes’ Theorem is used to determine the conditional probability of an event. It is used to find the probability of an event, based on prior knowledge of conditions that might be related to that event. Bayes’ theorem converts a prior probability into a posterior probability by incorporating the observed data
p(C_i | x) = [p(x | C_i) * p(C_i)] / p(x)
- P(x|C_i) is the probability of observing x as the input, given that it belongs to class Ci, i=1,…,K.
- p(x) is the normalization constant, ensuring that the sum of the conditional probability p(Ci|x) over all values of x is equal to one.
What does the Bayes’ classifier do?
To minimize error, the Bayes’ classifier chooses the class with the highest posterior probability, i.e.,
choose C_i if p(C_i | x) = max_k [p(C_k | x)].
What are the different models for density estimation?
Parametric: Assume a single model for p (x | Ci), e.g., Gaussian density.
Semiparametric: p (x | Ci) is a mixture of densities, e.g., Gaussian mixture model (GMM).
* E.g., different phonemes in speech.
* Being a clustering problem as well.
Nonparametric: No model; data speaks for itself, e.g., histogram estimator
How does parametric estimation work?
Assume an independent and identically distributed (iid) sample X = { x^t }_{t =1}^N where x^t
~ p (x)
Assume a form for p (x | θ) and estimate θ, its sufficient statistics, using X e.g., N ( μ, σ2), where θ = { μ, σ2}. Useful for models with a small number of parameters.
How do nonparametric methods work?
They make fewer assumptions about the data and are more flexible. Nonparametric models still contain parameters, but they control the model complexity rather than the form of the distribution. They are rational, similar inputs have similar outputs. It lets the data speak for itself.
What type of density estimation is a histogram?
It is a nonparametric method.
How are histograms defined and what are the two possible ways of making a histogram?
For N the number of observations, K the number of points that lie inside R, and V the volume of R,
p(x) = K / NV
The kernel approach fixes V and determines K from the data.
The K-nearest neighbour method fixes K and determines the value of V from the data.
How does the kernel approach for making histograms work?
Given the training set X={x t }t drawn iid from p(x) and divide data into bins of size h, then
p(x) = #{x^t in the same bin as x} / Nh
N is the total number of observations
Fixed bins are then [x0+mh, x0+mh+h]
How does the K-nearest neighbour method for making histograms work?
Fix the instances (neighbors) k and compute bin width as
p(x) = k / 2Nd_k(x)
where d_k(x) is the distance to the k’th closest instance to x.
What is the K-nearest neighbour estimator?
It is a form of nonparametric classification.
Find the k training examples (x_1,y_1),…, (x_k,y_k) that are closest to the test example x. Then predict the most frequent class among those yi’s.
p(x | C_i) = k_i / N_i V^k(x)
p(C_i | x) = [p(x | C_i) * p(C_i)] / p(x) = k_i / K
What is a Voronoi diagram?
It is a coloured diagram where all points closest to one training point are one colour, while the ones closest to another is another colour. Very useful for 1-nearest neighbor classification.
How do you choose k or h in the k-nearest neighbor or kernel approach?
When k or h is small, single instances matter; bias is small, variance is large (undersmoothing): High complexity
As k or h increases, we average over more instances and variance decreases but bias increases (oversmoothing): Low
complexity
This is fine-tuned by using cross-validation
What is Bayesian estimation and what is it used for?
Bayesian estimation can provide powerful insights into over-fitting and practical techniques for addressing the model complexity issue. This is done by
- Treating θ as a random variable with prior p(θ) .
- The sample data tells the likelihood density p(X | θ).
- Posterior density of θ : p( θ | X ) = p(X | θ )p(θ) / p(X)
Parameter estimation
* Maximum likelihood (ML): θ_{ML} = argmax_θ p(X | θ)
* Maximum a posteriori (MAP): θ_{MAP} = argmax_θ p(θ | X)
ML estimator is a MAP estimator for the uniform prior.
- Bayesian estimate:
θ_{Bayes}=E[θ|X]=∫θ p(θ | X)dθ, i.e., expectation
Explain the differences and similarities between maximum likelihood and maximum a posteriori.
Maximum likelihood is an estimator, when no prior distribution is available, in which θ is set to the value that maximizes the likelihood function l (θ|X) = p(X|θ)
* θML = argmax_θ p(X | θ).
Maximum a posteriori is an estimator in which θ is determined by maximizing the posterior distribution p(θ|X).
* θMAP = argmax_θ p(θ | X).
Maximum likelihood estimator is a maximum a posteriori estimator for the uniform prior.
What are the advantages and disadvantages of Bayesian estimation?
Advantage: The inclusion of prior knowledge – a reasonable prior - leads to a much less extreme conclusion, especially when the amount of training data is small.
Disadvantage: The need to marginalize (sum or integrate) over the whole of
parameter space.
θ_{Bayes}=E[θ | X]=∫ θ p(θ | X) dθ