Bayesian linear regression Flashcards

1
Q
  1. In Bayesian Linear Regression, how does Bayes’ rule allow us to move from a likelihood p(t|w,X) to a posterior p(w|X,t)?
A

Bayes’ rule states p(w|X,t) = p(t|w,X)p(w) / p(t|X), turning the likelihood (how probable the data is given the parameters) and the prior (our belief about parameters before seeing data) into a posterior (updated belief after seeing the data). The denominator p(t|X) normalizes so that the posterior integrates to 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
  1. Suppose we have the model t = wᵀx + ε, with ε ~ N(0,σ²). If we place a Gaussian prior on w, why is the resulting posterior for w also Gaussian (conjugacy)?
A

A Gaussian prior is conjugate to a Gaussian likelihood. Mathematically, multiplying a Gaussian prior by a Gaussian likelihood yields another Gaussian form in w. Hence the posterior distribution for w remains Gaussian with updated mean and covariance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
  1. Give an example problem: We want to predict 2012 Olympic 100m winning time using Bayesian linear regression with a prior on w. What are the key steps?
A

(1) Choose a Gaussian prior p(w) = N(0,S). (2) Specify the Gaussian likelihood p(t|w,X,σ²)=N(Xw,σ²I). (3) Apply Bayes’ rule to get the posterior p(w|X,t) ~ N(µ,Σ). (4) Use the posterior to form a predictive distribution for t_new at x_new, which is also Gaussian, N(x_newᵀµ, σ² + x_newᵀ Σ x_new).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
  1. What do the terms mean and Σ represent in the posterior p(w|X,t) = N(µ, Σ)?
A

µ is the posterior mean of w, the ‘most central’ or expected parameter vector after seeing the data. Σ is the posterior covariance matrix, showing how uncertain we still are about each parameter and how parameters co-vary after observing the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  1. Show a short problem: We have a prior p(w)=N(0,I) and data xₙ with outputs tₙ. The likelihood is p(t|w)=N(Xw,σ²I). How do we find µ and Σ for the posterior?
A

The posterior is N(µ, Σ) with:\n- Σ = [ (1/σ²) XᵀX + I ]⁻¹ (because the prior covariance is I)\n- µ = (1/σ²) Σ Xᵀ t. \nThese follow from matching terms in the Gaussian exponent for the prior × likelihood.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
  1. Why might a practitioner prefer a Bayesian linear regression approach over a point-estimate method like ordinary least squares for very limited data?
A

Because in Bayesian regression, the prior can guide the model when data is scarce, preventing overfitting. The posterior also expresses the remaining uncertainty, which can be crucial for making cautious predictions when data is limited.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  1. How do we form a predictive distribution for a new input x_new in Bayesian linear regression?
A

First, note that p(w|X,t) is Gaussian with mean µ and covariance Σ. Then the predictive distribution p(t_new|x_new,X,t) is obtained by integrating out w: it’s also Gaussian with mean x_newᵀ µ and variance σ² + x_newᵀ Σ x_new (the extra x_newᵀ Σ x_new term accounts for parameter uncertainty).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
  1. Provide a short challenge: Suppose you have a 2D prior covariance S for w and a noise level σ². Show how to compute Σ in the posterior if we have design matrix X and target t.
A

We use Σ = [ (1/σ²) XᵀX + S⁻¹ ]⁻¹. So you:\n(1) Invert S to get S⁻¹.\n(2) Form (1/σ²) XᵀX.\n(3) Sum them to get (1/σ²)XᵀX + S⁻¹.\n(4) Invert this sum to get Σ.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
  1. Why is p(t_new|X,t,x_new) a Gaussian centered at x_newᵀ µ with variance σ² + x_newᵀ Σ x_new?
A

Because t_new = wᵀx_new + ε. Once w has distribution N(µ,Σ), wᵀx_new is a Gaussian random variable with mean x_newᵀµ and variance x_newᵀΣx_new. Adding Gaussian noise ε ~ N(0,σ²) results in an overall Gaussian with mean x_newᵀµ and variance σ² + x_newᵀΣx_new.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
  1. Explain with an example why the variance σ² + x_newᵀ Σ x_new typically grows as x_new goes further from the region of observed data.
A

When x_new is far from the training region, the matrix expression x_newᵀ Σ x_new tends to be larger (because Σ describes how uncertain w is, and out-of-sample extrapolations amplify that uncertainty). Hence predictive variance grows, reflecting less confidence in the prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  1. In the Bayesian formula p(w|X,t) ∝ p(t|X,w) p(w), why is the term p(t|X) omitted when finding the posterior?
A

p(t|X) is the marginal likelihood (or evidence) and does not depend on w. It is just a normalizing constant ensuring the posterior integrates to 1. For finding the functional form of p(w|X,t), we can ignore any factor that does not involve w.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  1. Suppose we adopt a Gaussian prior p(w)=N(0, S) with a huge covariance S. What effect does this have on the posterior?
A

A large prior covariance means we impose very little constraint on w (i.e., ‘weak’ or ‘vague’ prior). The data then dominates in determining the posterior, so the posterior should be close to what you’d get by maximum likelihood if enough data is available.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
  1. Give an example problem: You have only 5 data points but strongly believe the slope parameter should be near 0. How would you encode this in a prior, and what is its effect?
A

You could use a Gaussian prior p(w) with a small variance on the slope term, centered at 0, indicating strong belief that w₁ is near 0. This effectively shrinks slope estimates toward 0 in the posterior when data is limited, reducing overfitting by favoring smaller slopes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
  1. How does the posterior covariance Σ help diagnose whether your model parameters are identifiable from the data?
A

Large diagonal entries in Σ indicate high uncertainty for certain parameters, suggesting the data doesn’t strongly constrain them. Off-diagonal terms show correlations between parameters. If Σ is large or near-singular, it means multiple parameter combinations explain the data similarly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
  1. Pose a mini-challenge: If someone tried to use a uniform prior on w (an improper prior), how would the Bayesian regression formulas adapt or simplify?
A

A uniform prior is effectively p(w) ∝ constant, so p(w|X,t) ∝ p(t|X,w). This reduces to a standard maximum-likelihood solution, giving the same ŵ as ordinary least squares, losing the advantage of a well-defined covariance from the prior. The posterior remains a Gaussian with Σ = (1/σ² XᵀX)⁻¹.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
  1. Why is Bayesian averaging (i.e., integrating over the posterior) often more robust than using just the single MAP or MLE estimate of w?
A

Because different w values can provide similarly good fits. Averaging predictions over all w weighted by their posterior probability helps avoid the risk of an unlucky single parameter choice, improving generalization and better capturing true uncertainty.

17
Q
  1. Show a short example problem: We have two candidate parameters w₁, w₂ with posterior weights 0.4 and 0.6. w₁ predicts t_new=10s, w₂ predicts t_new=9.5s. What’s the Bayesian prediction?
A

We weight each prediction by its posterior probability: t_Bayes = 0.410 + 0.69.5 = 9.8 seconds. This is a discrete version of the integral over the whole posterior distribution for w.

18
Q
  1. If your posterior distribution on w is p(w|X,t)=N(µ,Σ), how do you draw random parameter samples from that posterior, and why might you want to?
A

We can sample w ~ N(µ,Σ) using methods like the Cholesky decomposition or standard library routines. Doing this yields plausible parameter draws consistent with the data and prior. We might do it to create an ensemble of linear models, check posterior predictions, or visualize parameter variability.

19
Q
  1. How does Bayesian linear regression help with the Olympic data example when extrapolating far beyond the last training year?
A

It provides a predictive distribution rather than a point estimate. That distribution typically broadens when x_new is far beyond the data range (because x_newᵀΣ x_new grows). We get not just a single predicted winning time but also a credible interval reflecting uncertainty in w.

20
Q
  1. Summarize the main advantages of Bayesian linear regression over classical regression, tying it to the course material.
A

(1) It integrates prior knowledge about parameters through p(w). (2) The posterior p(w|X,t) captures parameter uncertainty, yielding more reliable predictions. (3) The predictive distribution includes both data noise and parameter uncertainty. (4) It naturally extends to model selection and hierarchical models by comparing or updating priors. Overall, Bayesian methods provide a more complete uncertainty picture.