Logistic Regression Flashcards

1
Q

Which classification algorithms were mentioned in the syllabus?

A

Bayes classifier, Logistic Regression, K-Nearest Neighbors, and Support Vector Machines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How many of these classification algorithms are probabilistic?

A

Two are probabilistic: the Bayes classifier and logistic regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How many of these classification algorithms are non-probabilistic?

A

Two are non-probabilistic: K-nearest neighbors and Support Vector Machines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the key difference between the Bayes classifier and logistic regression in modeling?

A

The Bayes classifier models each class separately and uses Bayes rule, whereas logistic regression directly models the probability P(tnew = k|xnew).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why can’t we simply use a linear function like w^T x as a probability?

A

Because w^T x is unbounded and can produce values outside the range [0, 1].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the ‘squashing’ function used in logistic regression for binary classification?

A

It is the sigmoid function h(w^T x) = 1 / (1 + exp(-w^T x)).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the probability output for the negative class (t=0) in logistic regression?

A

It is 1 - h(w^T x) = exp(-w^T x) / (1 + exp(-w^T x)).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why do we use the likelihood p(t|X, w) in logistic regression?

A

To measure how well the parameters w predict the observed binary labels t given the training data X.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the form of the likelihood p(t|X, w) for logistic regression?

A

The product over all observations: ∏( h(w^T x_n) ) for t_n=1 and ∏(1 - h(w^T x_n)) for t_n=0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the Cross Entropy (negative log-likelihood) in logistic regression?

A

J(w) = -Σ[t_n log(h(w^T x_n)) + (1 - t_n) log(1 - h(w^T x_n))].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do we find the parameters w that minimize the Cross Entropy?

A

By setting the gradient of J(w) to zero or using an iterative method like Gradient Descent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why is the Cross Entropy in logistic regression convex in w?

A

Because the sigmoid function is log-concave in certain ranges, and combining it into the Cross Entropy yields a convex function with respect to w.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the main idea behind multiclass classification in logistic regression?

A

Use the softmax function to model P(tn=k|xn) across K classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How are labels represented in multiclass logistic regression?

A

Using a one-hot encoding vector, where each class corresponds to a 1 in one dimension and 0 in others.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the softmax function for class k in multiclass logistic regression?

A

P(tn=k|xn) = exp(-w^(k) x_n) / Σ( exp(-w^(ℓ) x_n) ) for ℓ = 1..K.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the Cross Entropy loss for multiclass logistic regression?

A

J = -Σ_ n Σ_ k [ t_n,k log( exp(-w^(k) x_n) / Σ_ℓ exp(-w^(ℓ) x_n ) ) ].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do we compute the gradient of the multiclass Cross Entropy loss w.r.t w^(k)?

A

∂J/∂w^(k)_j = -Σ_n [ t_n,k - ( exp(-w^(k)x_n)/Σ_ℓ exp(-w^(ℓ)x_n) ) ] x_n,j.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is Bayesian logistic regression trying to achieve?

A

It places a prior on w, defines a likelihood, and seeks the posterior p(w|X,t) to make predictive distributions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why is there no closed-form solution for the posterior in Bayesian logistic regression?

A

Because the likelihood (sigmoid-based) is not conjugate to the Gaussian prior, making the integral intractable.

20
Q

What is the MAP (Maximum A Posteriori) estimate in Bayesian logistic regression?

A

It is the w that maximizes p(w|X,t), which is equivalent to maximizing the product of the likelihood and the prior.

21
Q

Why do we often use numerical optimization for MAP in logistic regression?

A

Because we cannot solve ∂J/∂w = 0 analytically for logistic regression with a prior, so we rely on iterative methods like Gradient Ascent or Newton-Raphson.

22
Q

What is the geometric interpretation of the decision boundary in logistic regression?

A

It is the set of x where w^T x = 0, which corresponds to P(t=1|x)=0.5.

23
Q

What does the Laplace approximation do in Bayesian logistic regression?

A

It approximates the posterior p(w|X,t) with a Gaussian N(µ,Σ) centered at the mode of the posterior.

24
Q

How do we choose µ and Σ in the Laplace approximation?

A

µ is the MAP estimate, and Σ^(-1) is the negative Hessian of the log posterior evaluated at the MAP.

25
Q

Why is the Laplace approximation sometimes inadequate?

A

Because it is centered on the posterior mode and assumes local Gaussian shape, it can be poor when the true posterior is skewed or multi-modal.

26
Q

How can we use the Laplace approximation to make predictions?

A

By drawing samples from the approximate N(µ,Σ) and averaging the predicted probabilities from each sample.

27
Q

What is MCMC sampling in Bayesian logistic regression?

A

A set of algorithms (e.g., Metropolis-Hastings) that generate samples directly from the true posterior p(w|X,t).

28
Q

Why is it possible to sample from a distribution we cannot compute explicitly?

A

Because methods like Metropolis-Hastings use acceptance-rejection steps that rely only on ratios of unnormalized probabilities.

29
Q

What is the role of the proposal distribution in Metropolis-Hastings?

A

It proposes a new sample w’ based on the current sample w. Often a Gaussian centered at w is used.

30
Q

What is the acceptance ratio r in Metropolis-Hastings?

A

r = [g(w’) / g(w)] × [p(w|w’) / p(w’|w)], where g(w) is the unnormalized posterior and p(w’|w) is the proposal distribution.

31
Q

What happens if the acceptance ratio r >= 1 in Metropolis-Hastings?

A

We always accept the new sample w’ and set w_s = w’.

32
Q

What if r < 1 in Metropolis-Hastings?

A

We accept the new sample with probability r, or keep the old sample otherwise.

33
Q

How do MCMC samples help in prediction?

A

We approximate P(tnew=1|xnew) by averaging the predictions across all sampled w values.

34
Q

What is the difference between the Laplace approximation and MCMC in logistic regression?

A

Laplace approximates the posterior with a single Gaussian around the mode, while MCMC samples directly from the true posterior without that Gaussian assumption.

35
Q

Why can the Laplace approximation admit unlikely boundary shapes compared to MCMC?

A

Because it maintains a symmetric Gaussian distribution around the mode, allowing some parameter regions that the true posterior might consider unlikely.

36
Q

What is the main trade-off among MAP, Laplace approximation, and MCMC?

A

MAP is simplest but ignores posterior spread, Laplace is more faithful but still approximate, and MCMC is most accurate yet more computationally expensive.

37
Q

How do Bayesian methods handle predictive uncertainty compared to point estimates?

A

They average over the posterior distribution of parameters, capturing uncertainty in the predictions.

38
Q

What is a key limitation if the posterior distribution is multi-modal?

A

Laplace approximation centered on a single mode can be very inaccurate, while MCMC can explore multiple modes if run properly.

39
Q

What is the high-level benefit of logistic regression over simple linear classification?

A

Logistic regression provides a probabilistic interpretation of class membership and a smoothly varying decision boundary.

40
Q

Which loss function is typically minimized during logistic regression training?

A

The Cross Entropy loss.

41
Q

Which function in binary logistic regression outputs a probability between 0 and 1?

A

The sigmoid (logistic) function, 1 / (1 + exp(-z)).

42
Q

What are the typical numerical methods to fit logistic regression parameters?

A

Gradient-based approaches such as Gradient Descent, and second-order methods like Newton-Raphson.

43
Q

Why does regularization (e.g., a prior) help in logistic regression?

A

It penalizes large parameter values, helping prevent overfitting and improving generalization.

44
Q

What is the ‘posterior predictive distribution’ in Bayesian logistic regression?

A

P(tnew=1|xnew,X,t) obtained by integrating (or averaging) over all posterior values of w.

45
Q

What are some examples of alternative sampling methods besides Metropolis-Hastings?

A

Gibbs sampling, Hamiltonian Monte Carlo, or slice sampling.

46
Q

What is the main advantage of a fully Bayesian approach in logistic regression?

A

It provides a full posterior over parameters, offering more robust uncertainty estimates and predictions compared to single-point estimates.