Logistic Regression Flashcards
Which classification algorithms were mentioned in the syllabus?
Bayes classifier, Logistic Regression, K-Nearest Neighbors, and Support Vector Machines.
How many of these classification algorithms are probabilistic?
Two are probabilistic: the Bayes classifier and logistic regression.
How many of these classification algorithms are non-probabilistic?
Two are non-probabilistic: K-nearest neighbors and Support Vector Machines.
What is the key difference between the Bayes classifier and logistic regression in modeling?
The Bayes classifier models each class separately and uses Bayes rule, whereas logistic regression directly models the probability P(tnew = k|xnew).
Why can’t we simply use a linear function like w^T x as a probability?
Because w^T x is unbounded and can produce values outside the range [0, 1].
What is the ‘squashing’ function used in logistic regression for binary classification?
It is the sigmoid function h(w^T x) = 1 / (1 + exp(-w^T x)).
What is the probability output for the negative class (t=0) in logistic regression?
It is 1 - h(w^T x) = exp(-w^T x) / (1 + exp(-w^T x)).
Why do we use the likelihood p(t|X, w) in logistic regression?
To measure how well the parameters w predict the observed binary labels t given the training data X.
What is the form of the likelihood p(t|X, w) for logistic regression?
The product over all observations: ∏( h(w^T x_n) ) for t_n=1 and ∏(1 - h(w^T x_n)) for t_n=0.
What is the Cross Entropy (negative log-likelihood) in logistic regression?
J(w) = -Σ[t_n log(h(w^T x_n)) + (1 - t_n) log(1 - h(w^T x_n))].
How do we find the parameters w that minimize the Cross Entropy?
By setting the gradient of J(w) to zero or using an iterative method like Gradient Descent.
Why is the Cross Entropy in logistic regression convex in w?
Because the sigmoid function is log-concave in certain ranges, and combining it into the Cross Entropy yields a convex function with respect to w.
What is the main idea behind multiclass classification in logistic regression?
Use the softmax function to model P(tn=k|xn) across K classes.
How are labels represented in multiclass logistic regression?
Using a one-hot encoding vector, where each class corresponds to a 1 in one dimension and 0 in others.
What is the softmax function for class k in multiclass logistic regression?
P(tn=k|xn) = exp(-w^(k) x_n) / Σ( exp(-w^(ℓ) x_n) ) for ℓ = 1..K.
What is the Cross Entropy loss for multiclass logistic regression?
J = -Σ_ n Σ_ k [ t_n,k log( exp(-w^(k) x_n) / Σ_ℓ exp(-w^(ℓ) x_n ) ) ].
How do we compute the gradient of the multiclass Cross Entropy loss w.r.t w^(k)?
∂J/∂w^(k)_j = -Σ_n [ t_n,k - ( exp(-w^(k)x_n)/Σ_ℓ exp(-w^(ℓ)x_n) ) ] x_n,j.
What is Bayesian logistic regression trying to achieve?
It places a prior on w, defines a likelihood, and seeks the posterior p(w|X,t) to make predictive distributions.
Why is there no closed-form solution for the posterior in Bayesian logistic regression?
Because the likelihood (sigmoid-based) is not conjugate to the Gaussian prior, making the integral intractable.
What is the MAP (Maximum A Posteriori) estimate in Bayesian logistic regression?
It is the w that maximizes p(w|X,t), which is equivalent to maximizing the product of the likelihood and the prior.
Why do we often use numerical optimization for MAP in logistic regression?
Because we cannot solve ∂J/∂w = 0 analytically for logistic regression with a prior, so we rely on iterative methods like Gradient Ascent or Newton-Raphson.
What is the geometric interpretation of the decision boundary in logistic regression?
It is the set of x where w^T x = 0, which corresponds to P(t=1|x)=0.5.
What does the Laplace approximation do in Bayesian logistic regression?
It approximates the posterior p(w|X,t) with a Gaussian N(µ,Σ) centered at the mode of the posterior.
How do we choose µ and Σ in the Laplace approximation?
µ is the MAP estimate, and Σ^(-1) is the negative Hessian of the log posterior evaluated at the MAP.
Why is the Laplace approximation sometimes inadequate?
Because it is centered on the posterior mode and assumes local Gaussian shape, it can be poor when the true posterior is skewed or multi-modal.
How can we use the Laplace approximation to make predictions?
By drawing samples from the approximate N(µ,Σ) and averaging the predicted probabilities from each sample.
What is MCMC sampling in Bayesian logistic regression?
A set of algorithms (e.g., Metropolis-Hastings) that generate samples directly from the true posterior p(w|X,t).
Why is it possible to sample from a distribution we cannot compute explicitly?
Because methods like Metropolis-Hastings use acceptance-rejection steps that rely only on ratios of unnormalized probabilities.
What is the role of the proposal distribution in Metropolis-Hastings?
It proposes a new sample w’ based on the current sample w. Often a Gaussian centered at w is used.
What is the acceptance ratio r in Metropolis-Hastings?
r = [g(w’) / g(w)] × [p(w|w’) / p(w’|w)], where g(w) is the unnormalized posterior and p(w’|w) is the proposal distribution.
What happens if the acceptance ratio r >= 1 in Metropolis-Hastings?
We always accept the new sample w’ and set w_s = w’.
What if r < 1 in Metropolis-Hastings?
We accept the new sample with probability r, or keep the old sample otherwise.
How do MCMC samples help in prediction?
We approximate P(tnew=1|xnew) by averaging the predictions across all sampled w values.
What is the difference between the Laplace approximation and MCMC in logistic regression?
Laplace approximates the posterior with a single Gaussian around the mode, while MCMC samples directly from the true posterior without that Gaussian assumption.
Why can the Laplace approximation admit unlikely boundary shapes compared to MCMC?
Because it maintains a symmetric Gaussian distribution around the mode, allowing some parameter regions that the true posterior might consider unlikely.
What is the main trade-off among MAP, Laplace approximation, and MCMC?
MAP is simplest but ignores posterior spread, Laplace is more faithful but still approximate, and MCMC is most accurate yet more computationally expensive.
How do Bayesian methods handle predictive uncertainty compared to point estimates?
They average over the posterior distribution of parameters, capturing uncertainty in the predictions.
What is a key limitation if the posterior distribution is multi-modal?
Laplace approximation centered on a single mode can be very inaccurate, while MCMC can explore multiple modes if run properly.
What is the high-level benefit of logistic regression over simple linear classification?
Logistic regression provides a probabilistic interpretation of class membership and a smoothly varying decision boundary.
Which loss function is typically minimized during logistic regression training?
The Cross Entropy loss.
Which function in binary logistic regression outputs a probability between 0 and 1?
The sigmoid (logistic) function, 1 / (1 + exp(-z)).
What are the typical numerical methods to fit logistic regression parameters?
Gradient-based approaches such as Gradient Descent, and second-order methods like Newton-Raphson.
Why does regularization (e.g., a prior) help in logistic regression?
It penalizes large parameter values, helping prevent overfitting and improving generalization.
What is the ‘posterior predictive distribution’ in Bayesian logistic regression?
P(tnew=1|xnew,X,t) obtained by integrating (or averaging) over all posterior values of w.
What are some examples of alternative sampling methods besides Metropolis-Hastings?
Gibbs sampling, Hamiltonian Monte Carlo, or slice sampling.
What is the main advantage of a fully Bayesian approach in logistic regression?
It provides a full posterior over parameters, offering more robust uncertainty estimates and predictions compared to single-point estimates.