Kursusgang 2+3 (Bayesian decision theory) + (Parametric and nonparametric methods) Flashcards

Question 1

Q

Explain Bayes’ theorem and the formula for it

Answer

A

Bayes’ Theorem is used to determine the conditional probability of an event. It is used to find the probability of an event, based on prior knowledge of conditions that might be related to that event. Bayes’ theorem converts a prior probability into a posterior probability by incorporating the observed data

p(C_i | x) = [p(x | C_i) * p(C_i)] / p(x)

P(x|C_i) is the probability of observing x as the input, given that it belongs to class Ci, i=1,…,K. It is also the likelihood.
p(x) is the normalization constant, ensuring that the sum of the conditional probability p(Ci|x) over all values of x is equal to one.

Question 2

Q

What does the Bayes’ classifier do?

Answer

A

To minimize error, the Bayes’ classifier chooses the class with the highest posterior probability, i.e.,

choose C_i if p(C_i | x) = max_k [p(C_k | x)].

Question 3

Q

What are the different models for density estimation?

Answer

A

Parametric: Assume a single model for p (x | Ci), e.g., Gaussian density.

Semiparametric: p (x | Ci) is a mixture of densities, e.g., Gaussian mixture model (GMM).
* E.g., different phonemes in speech.
* Being a clustering problem as well.

Nonparametric: No model; data speaks for itself, e.g., histogram estimator

Question 4

Q

How does parametric estimation work?

Answer

A

Assume an iid distributed sample X = { x^t }_{t =1}^N where x^t ~ p (x)
Assume a form for p (x | θ) and estimate θ, its sufficient statistics, using X e.g., N (μ, σ^2), where θ = { μ, σ^2}. Useful for models with a small number of parameters.

Question 5

Q

How do nonparametric methods work?

Answer

A

They make fewer assumptions about the data and are more flexible. Nonparametric models still contain parameters, but they control the model complexity rather than the form of the distribution. They are rational, similar inputs have similar outputs. It lets the data speak for itself.

Question 6

Q

What type of density estimation is a histogram?

Answer

A

It is a nonparametric method.

Question 7

Q

How are histograms defined and what are the two possible ways of making a histogram?

Answer

A

For N the number of observations, K the number of points that lie inside R, and V the volume of R,
p(x) = K / NV

The kernel approach fixes V and determines K from the data.
The K-nearest neighbour method fixes K and determines the value of V from the data.

Question 8

Q

How does the kernel approach for making histograms work?

Answer

A

Given the training set X={x^t} drawn iid from p(x) and divide data into bins of size h, then

p(x) = #{x^t in the same bin as x} / Nh

N is the total number of observations
Fixed bins are then [x_0+mh, x_0+mh+h]

Question 9

Q

How does the K-nearest neighbour method for making histograms work?

Answer

A

Fix the instances (neighbors) k and compute bin width as

p(x) = k / 2Nd_k(x)

where d_k(x) is the distance to the k’th closest instance to x.

Question 10

Q

What is the K-nearest neighbour estimator?

Answer

A

It is a form of nonparametric classification.
Find the k training examples (x_1,y_1),…, (x_k,y_k) that are the closest to the test example x. Then predict the most frequent class among those yi’s.

p(x | C_i) = k_i / N_i V^k(x)

p(C_i | x) = [p(x | C_i) * p(C_i)] / p(x) = k_i / K

Question 11

Q

What is a Voronoi diagram?

Answer

A

It is a diagram where all points closest to one training point is one box, while the ones closest to another is another box. Diagram of for 1-nearest neighbor classification.

Question 12

Q

What effect does k or h in the k-nearest neighbor or kernel approach have?

Answer

A

When k or h is small, single instances matter; bias is small, variance is large (undersmoothing): High complexity

As k or h increases, we average over more instances and variance decreases but bias increases (oversmoothing): Low
complexity

This is fine-tuned by using cross-validation

Question 13

Q

What is Bayesian estimation and what is it used for?

Answer

A

Bayesian estimation can provide powerful insights into over-fitting and practical techniques for addressing the model complexity issue. This is done by

Treating θ as a random variable with prior p(θ) .
The sample data tells the likelihood density p(X | θ).
Posterior density of θ : p( θ | X ) = p(X | θ )p(θ) / p(X)

Parameter estimation
* Maximum likelihood (ML): θ_{ML} = argmax_θ p(X | θ)
* Maximum a posteriori (MAP): θ_{MAP} = argmax_θ p(θ | X)
ML estimator is a MAP estimator for the uniform prior.

Bayesian estimate:
θ_{Bayes}=E[θ|X]=∫θ p(θ | X)dθ, i.e., expectation

Question 14

Q

Explain the differences and similarities between maximum likelihood and maximum a posteriori.

Answer

A

Maximum likelihood is an estimator, when no prior distribution is available, in which θ is set to the value that maximizes the likelihood function l (θ|X) = p(X|θ)
* θML = argmax_θ p(X | θ).

Maximum a posteriori is an estimator in which θ is determined by maximizing the posterior distribution p(θ|X).
* θMAP = argmax_θ p(θ | X).

Maximum likelihood estimator is a maximum a posteriori estimator for the uniform prior.

Question 15

Q

What are the advantages and disadvantages of Bayesian estimation?

Answer

A

Advantage: The inclusion of prior knowledge – a reasonable prior - leads to a much less extreme conclusion, especially when the amount of training data is small.

Disadvantage: The need to marginalize (sum or integrate) over the whole of
parameter space.

θ_{Bayes}=E[θ | X]=∫ θ p(θ | X) dθ

Question 16

Q

What is Bayesian parametric classification?

Answer

Study These Flashcards

A

The model learns the prior probabilities of each class and the likelihood of the data given each class. This involves estimating the parameters of the probability distributions for each class based on the training data.

Then it chooses the class that gives maximal posterior probability p(C_k | x) OR p(x | C_k) p(C_k)

Question 17

Q

What is classification accuracy and error rate?

Answer

Study These Flashcards

A

Accuracy = number of correct predictions / total number of predictions

Error rate = 1 - accuracy

Question 18

Q

What is a confusion matrix?

Answer

Study These Flashcards

A

If used for binary classification it denotes the amount of true positive, true negative, false positive
and false negative classifications in a grid.

If used for non-binary classification, it is instead divided into prediction along one axis and actual class along the other.

Question 19

Q

What is parametric regression?

Answer

Study These Flashcards

A

Output = function of input + random noise:
r = f(x) + 𝜀 where 𝜀~N(0, 𝜎^2)

Let g(x,θ) be the estimator of f(x) , then maximum likelihood estimator of r is
p(r | x) ~ N( g(x,θ), 𝜎^2)

Maximizing the maximum likehood estimator, L(θ | X), is equal to minimizing the error function

E(θ | X) = 1/2 \sum_{t=1}^N [r^t - g(x^t | θ) ]^2,

where θ that minimize the error function is called the least squares estimates and is quivalent to maximum likelihood solution under an assumed Gaussian noise model.

Kursusgang 2+3 (Bayesian decision theory) + (Parametric and nonparametric methods) Flashcards

(19 cards)