Kursusgang 2+3 (Bayesian decision theory) + (Parametric and nonparametric methods) Flashcards

1
Q

Explain Bayes’ theorem and the formula for it

A

Bayes’ Theorem is used to determine the conditional probability of an event. It is used to find the probability of an event, based on prior knowledge of conditions that might be related to that event. Bayes’ theorem converts a prior probability into a posterior probability by incorporating the observed data

p(C_i | x) = [p(x | C_i) * p(C_i)] / p(x)

  • P(x|C_i) is the probability of observing x as the input, given that it belongs to class Ci, i=1,…,K.
  • p(x) is the normalization constant, ensuring that the sum of the conditional probability p(Ci|x) over all values of x is equal to one.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does the Bayes’ classifier do?

A

To minimize error, the Bayes’ classifier chooses the class with the highest posterior probability, i.e.,

choose C_i if p(C_i | x) = max_k [p(C_k | x)].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the different models for density estimation?

A

Parametric: Assume a single model for p (x | Ci), e.g., Gaussian density.

Semiparametric: p (x | Ci) is a mixture of densities, e.g., Gaussian mixture model (GMM).
* E.g., different phonemes in speech.
* Being a clustering problem as well.

Nonparametric: No model; data speaks for itself, e.g., histogram estimator

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does parametric estimation work?

A

Assume an independent and identically distributed (iid) sample X = { x^t }_{t =1}^N where x^t
~ p (x)
Assume a form for p (x | θ) and estimate θ, its sufficient statistics, using X e.g., N ( μ, σ2), where θ = { μ, σ2}. Useful for models with a small number of parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do nonparametric methods work?

A

They make fewer assumptions about the data and are more flexible. Nonparametric models still contain parameters, but they control the model complexity rather than the form of the distribution. They are rational, similar inputs have similar outputs. It lets the data speak for itself.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What type of density estimation is a histogram?

A

It is a nonparametric method.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How are histograms defined and what are the two possible ways of making a histogram?

A

For N the number of observations, K the number of points that lie inside R, and V the volume of R,
p(x) = K / NV

The kernel approach fixes V and determines K from the data.
The K-nearest neighbour method fixes K and determines the value of V from the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does the kernel approach for making histograms work?

A

Given the training set X={x t }t drawn iid from p(x) and divide data into bins of size h, then

p(x) = #{x^t in the same bin as x} / Nh

N is the total number of observations
Fixed bins are then [x0+mh, x0+mh+h]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does the K-nearest neighbour method for making histograms work?

A

Fix the instances (neighbors) k and compute bin width as

p(x) = k / 2Nd_k(x)

where d_k(x) is the distance to the k’th closest instance to x.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the K-nearest neighbour estimator?

A

It is a form of nonparametric classification.
Find the k training examples (x_1,y_1),…, (x_k,y_k) that are closest to the test example x. Then predict the most frequent class among those yi’s.

p(x | C_i) = k_i / N_i V^k(x)

p(C_i | x) = [p(x | C_i) * p(C_i)] / p(x) = k_i / K

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a Voronoi diagram?

A

It is a coloured diagram where all points closest to one training point are one colour, while the ones closest to another is another colour. Very useful for 1-nearest neighbor classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you choose k or h in the k-nearest neighbor or kernel approach?

A

When k or h is small, single instances matter; bias is small, variance is large (undersmoothing): High complexity

As k or h increases, we average over more instances and variance decreases but bias increases (oversmoothing): Low
complexity

This is fine-tuned by using cross-validation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Bayesian estimation and what is it used for?

A

Bayesian estimation can provide powerful insights into over-fitting and practical techniques for addressing the model complexity issue. This is done by

  • Treating θ as a random variable with prior p(θ) .
  • The sample data tells the likelihood density p(X | θ).
  • Posterior density of θ : p( θ | X ) = p(X | θ )p(θ) / p(X)

Parameter estimation
* Maximum likelihood (ML): θ_{ML} = argmax_θ p(X | θ)
* Maximum a posteriori (MAP): θ_{MAP} = argmax_θ p(θ | X)
ML estimator is a MAP estimator for the uniform prior.

  • Bayesian estimate:
    θ_{Bayes}=E[θ|X]=∫θ p(θ | X)dθ, i.e., expectation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain the differences and similarities between maximum likelihood and maximum a posteriori.

A

Maximum likelihood is an estimator, when no prior distribution is available, in which θ is set to the value that maximizes the likelihood function l (θ|X) = p(X|θ)
* θML = argmax_θ p(X | θ).

Maximum a posteriori is an estimator in which θ is determined by maximizing the posterior distribution p(θ|X).
* θMAP = argmax_θ p(θ | X).

Maximum likelihood estimator is a maximum a posteriori estimator for the uniform prior.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the advantages and disadvantages of Bayesian estimation?

A

Advantage: The inclusion of prior knowledge – a reasonable prior - leads to a much less extreme conclusion, especially when the amount of training data is small.

Disadvantage: The need to marginalize (sum or integrate) over the whole of
parameter space.

θ_{Bayes}=E[θ | X]=∫ θ p(θ | X) dθ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Bayesian parametric classification?

A

The model learns the prior probabilities of each class and the likelihood of the data given each class. This involves estimating the parameters of the probability distributions for each class based on the training data.

Then it chooses the class that gives maximal posterior probability p(C_k | x) OR p(x | C_k) p(C_k)

17
Q

What is classification accuracy and error rate?

A

Accuracy = number of correct predictions / total number of predictions

Error rate = 1 - accuracy

18
Q

What is a confusion matrix?

A

If used for binary classification it denotes the amount of true positive, true negative, false positive
and false negative classifications in a grid.

If used for non-binary classification, it is instead divided into prediction along one axis and actual class along the other.

19
Q

What is parametric regression?

A

Output = function of input + random noise:
r = f(x) + 𝜀 where 𝜀~N(0, 𝜎^2)

Let g(x,θ) be the estimator of f(x) , then maximum likelihood estimator of r is
p(r | x) ~ N( g(x,θ), 𝜎^2)

Maximizing the maximum likehood estimator, L(θ | X), is equal to minimizing the error function

E(θ | X) = 1/2 \sum_{t=1}^N [r^t - g(x^t | θ) ]^2,

where θ that minimize the error function is called the least squares estimates and is quivalent to maximum likelihood solution under an assumed Gaussian noise model.