week 7 Flashcards
What is the goal of logistic regression?
To model the probability of a binary (0 or 1) response variable as a function of covariates.
What probability distribution is assumed for the binary response variable Y_i in logistic regression?
Bernoulli distribution with success probability π_i, P(Y_i = y_i) = π_i^{y_i} (1-π_i)^{1-y_i}.
What is the standard ‘link function’ used in logistic regression to connect the probability π_i to a linear combination of covariates x_i?
The logit function: logit(π_i) = log(π_i / (1-π_i)) = x_i^T θ.
What function maps the linear predictor x_i^T θ back to the probability π_i?
The logistic (or sigmoid) function: π_i = σ(x_i^T θ) = 1 / (1 + exp(-x_i^T θ)).
Write the log-likelihood function l(θ; X, y) for logistic regression.
l(θ; X, y) = Σ_{i=1}^n [y_i log(π_i) + (1-y_i) log(1-π_i)], where π_i = σ(x_i^T θ).
Write the simplified log-likelihood function using u_i = exp(x_i^T θ).
l(θ; X, y) = Σ_{i=1}^n [y_i log(u_i) - log(1+u_i)].
What is the score function S(θ) (gradient of the log-likelihood) for logistic regression?
S(θ) = ∇θ l(θ; X, y) = Σ{i=1}^n (y_i - π_i) x_i.
How is the Maximum Likelihood Estimate (MLE) of θ typically found in logistic regression?
By numerically solving the system of equations S(θ) = Σ_{i=1}^n (y_i - π_i) x_i = 0, often using methods like Newton-Raphson.
What is the observed information matrix i(θ) (negative Hessian of the log-likelihood) for logistic regression?
i(θ) = -∇²l(θ | X, y) = Σ_{i=1}^n π_i(1 - π_i) x_i x_i^T.
What objective function is maximized in L2 regularized logistic regression?
l(θ | X, y) - λ θ^T θ (or equivalent forms).
What is the function g(θ) being minimized if we define θ̃ = arg max[-g(θ)] for L2 regularization?
g(θ) = -l(θ | X, y) + λ θ^T θ.
The L2 regularized logistic regression estimate θ̃ corresponds to what Bayesian point estimate?
The Maximum a Posteriori (MAP) estimate.
What prior distribution on θ corresponds to L2 regularization with penalty λ θ^T θ?
A Gaussian prior: θ ~ N(0, σ²I), where σ² is related to 1/λ (specifically N(0, (1/(2λ))I) if the penalty is exactly λθ^Tθ).
Let g(θ) = -log(π(θ|X, y)_kernel) for the posterior associated with L2 regularization. What is its gradient ∇_θ g(θ)?
∇θ g(θ) = -Σ{i=1}^n (y_i - π_i) x_i + λθ (assuming regularization (λ/2)θ^Tθ in the negative log posterior).
Let g(θ) = -log(π(θ|X, y)_kernel). What is its Hessian H(θ) = ∇²g(θ)?
H(θ) = Σ_{i=1}^n π_i(1 - π_i) x_i x_i^T + λI (assuming regularization (λ/2)θ^Tθ).
What quantities derived from g(θ) are needed to apply Laplace’s approximation to the posterior?
The mode θ̃ (which minimizes g(θ)) and the Hessian evaluated at the mode, H(θ̃).
In frequentist prediction for logistic regression, how is the probability π* for a new observation x* estimated?
Using a plug-in estimate: π* ≈ σ(θ̃^T x*), where θ̃ is the MLE or regularized MLE.
What shape do the contours of predicted probability have in the covariate space for frequentist logistic regression?
Linear (they form parallel hyperplanes).
What is the main drawback of using a single point estimate (like MLE or MAP) for prediction uncertainty?
It doesn’t fully capture uncertainty about θ; predictions might be overconfident, especially far from the training data.
How is Laplace approximation used to approximate the posterior distribution in Bayesian logistic regression?
The posterior π(θ|X, y) is approximated by a multivariate Gaussian distribution N(θ̃, H(θ̃)⁻¹), where θ̃ is the posterior mode and H(θ̃) is the Hessian of the negative log posterior at the mode.
How can we obtain samples from the approximate posterior using Laplace approximation?
By drawing samples θ_i ~ N(θ̃, H(θ̃)⁻¹).
How does visualizing decision boundaries σ(θ^T x*) = 0.5 differ between frequentist and Bayesian (using posterior samples) approaches?
Frequentist shows one boundary based on θ̃. Bayesian shows multiple boundaries, one for each posterior sample θ_i, indicating uncertainty.
How is the posterior predictive probability π* = P(y=1 | x, X, y) calculated in Bayesian logistic regression?
By integrating over the posterior: π* = ∫ σ(θ^T x*) π(θ | X, y) dθ.
How is the posterior predictive probability π* approximated using M posterior samples {θ_i}?
Using Monte Carlo integration: π* ≈ (1/M) Σ_{i=1}^M σ(θ_i^T x*).