Probabilistic Linear Regression Flashcards
Chapter 2
- In the probabilistic approach to linear regression, we say the true model is t = wᵀx + ε. How does assuming a Gaussian distribution for ε help us quantify predictions and parameter uncertainty?
By modeling the noise ε as Gaussian with mean 0 and variance σ², we can derive likelihood expressions and use them to find parameter estimates that maximize the probability of the observed data. This also yields formulas for variances and covariances of those estimates, enabling confidence intervals and uncertainty quantification for predictions.
- Given a dataset of pairs (xₙ, tₙ) and a linear model tₙ = wᵀxₙ + εₙ with Gaussian noise N(0, σ²), how do we write down the likelihood for the entire dataset?
We assume the data points are independent. The total likelihood is the product of individual Gaussian likelihoods for each point: L = ∏ₙ p(tₙ|w, σ²) = ∏ₙ N(tₙ | wᵀxₙ, σ²). Typically, we maximize the log-likelihood, log L = ∑ₙ log N(tₙ | wᵀxₙ, σ²).
- Show an example problem: Suppose we have 5 data points {(xₙ, tₙ)} in one dimension (x ∈ ℝ) with t = w₀ + w₁x + ε. How would you set up and solve for w₀, w₁, and σ² using maximum likelihood?
(1) Form the design matrix X with a column of 1s and a column of x-values. (2) Write down the log-likelihood assuming p(t|X,w,σ²) = N(Xw, σ²I). (3) Differentiate wrt w to get ŵ = (XᵀX)⁻¹ Xᵀ t. (4) Plug ŵ back into the expression for σ² to find σ̂² = (1/N) (t - Xŵ)ᵀ (t - Xŵ).
- How does the probabilistic view differ from the classical least-squares view in linear regression, and why do both end up yielding the same formula for ŵ?
The probabilistic view interprets the noise as Gaussian and maximizes the likelihood of observing the data; the classical least-squares view minimizes the sum of squared residuals. Both yield the same normal equations because maximizing exp(-residual²/(2σ²)) is equivalent to minimizing residual², leading to ŵ = (XᵀX)⁻¹ Xᵀ t.
- Why is modeling the errors εₙ explicitly so important if all we end up doing is the same normal equation solution?
Because it lets us quantify uncertainty. Merely minimizing sums of squares finds ŵ, but the probabilistic view also gives us estimates for σ², confidence intervals for ŵ, and predictive distributions for new t-values. This is crucial for risk assessment and understanding the reliability of our predictions.
- Consider predicting the 2012 men’s 100m Olympic winning time with a linear model. The data has random noise around a trend. How does including a Gaussian noise term ε with variance σ² inform our confidence in that 2012 prediction? (What is var{t_new} = )
By estimating σ² from historical data, we can compute var{t_new} = σ² x_newᵀ (XᵀX)⁻¹ x_new. This formula shows how uncertainty in the parameters propagates to uncertainty in the new prediction. When x_new is far from the bulk of the training data, the uncertainty grows.
- What does it mean that the maximum-likelihood estimate ŵ is ‘unbiased,’ and how do we formally show it in the probabilistic regression model?
‘Unbiased’ means E[ŵ] = w, i.e., on average across many datasets generated from the true w, the estimate ŵ recovers that true w. Formally, we show E[ŵ] = E[(XᵀX)⁻¹Xᵀ t] = (XᵀX)⁻¹Xᵀ E[t], and E[t] = Xw. Hence E[ŵ] = w.
- Provide an example problem showing why σ̂² = (1/N)(t - Xŵ)ᵀ(t - Xŵ) is biased for small samples. What is the exact expectation of σ̂² under the true model?
If we simulate tₙ = wᵀxₙ + Gaussian(0, σ²) for a small dataset, we typically find σ̂² < σ². Formally, E[σ̂²] = σ² (1 - D/N), so σ̂² is systematically lower than σ² when D < N. That is because ŵ itself is fit to the same data, artificially reducing the residual sums.
- If our dataset has N points and D parameters, what is the intuitive explanation for why σ̂² underestimates the true variance σ² by the factor (1 - D/N)?
When we solve for ŵ, the regression line fits part of the data ‘too well,’ because ŵ is chosen to minimize residuals. We use the same data to estimate σ², so the apparent residual variance is reduced. We lose D degrees of freedom in matching w, thus we correct by the factor (1 - D/N).
- Show a short practical problem: If we have D=2 parameters and N=10 data points, how would you adjust the biased estimator σ̂² to get an unbiased estimate of σ²?
Since E[σ̂²] = σ²(1 - D/N) = σ²(1 - 2/10) = σ²(0.8), you can multiply σ̂² by 1/(1 - 2/10) = 1/0.8 = 1.25 to get an unbiased estimate of σ². In other words, use σ̃² = (N/(N - D)) σ̂².
- How do we calculate the covariance of the estimated parameters, cov{ŵ}, and what does it tell us about ŵ?
Under the Gaussian noise assumption, cov{ŵ} = σ² (XᵀX)⁻¹. The diagonal elements show how much each parameter can vary; large diagonal entries mean low precision (high uncertainty). Off-diagonal elements indicate correlation between parameters (how they move together to maintain a good fit).
- Provide an example problem where x-values are very close together, and show how it affects (XᵀX) and thus cov{ŵ}.
If all x-values in a dataset are nearly the same, XᵀX becomes nearly singular, making (XᵀX)⁻¹ huge. Numerically, suppose x = [1.0, 1.1, 1.05,…], then the design matrix columns are almost linearly dependent. The result is a very large cov{ŵ}, meaning the parameters are not well identifiable from the data.
- How do we form a predictive distribution for a new input x_new based on ŵ and σ̂², and what is the variance of that prediction? (What is the variance of a new datapoint prediction)
We have t_new = wᵀx_new + ε, and w is estimated by ŵ. The predictive variance is var{t_new} = σ̂² + x_newᵀ cov{ŵ} x_new = σ̂² + σ̂² x_newᵀ (XᵀX)⁻¹ x_new = σ̂²[1 + x_newᵀ (XᵀX)⁻¹ x_new]. The first term reflects noise in the outcome; the second term is uncertainty in ŵ.
- Give a mini-challenge: Suppose you train a polynomial model of degree 3 and a polynomial model of degree 8 on the same data. Both produce a best-fit line, but the degree-8 model has huge cov{ŵ}. Why might the simpler model sometimes yield tighter predictions far from the training set?
High-degree polynomials can ‘bend’ to fit noise, leading to poor identifiability of coefficients (large cov{ŵ}). When extrapolating far from the data, small changes in high-degree coefficients can swing predictions wildly. By contrast, the simpler degree-3 model might maintain stable coefficients and hence smaller predictive variance far from training points.
- How does cross-validation still remain a reliable method for model selection even when we switch from a purely least-squares perspective to a probabilistic (Gaussian) perspective?
Cross-validation directly tests predictive performance on held-out data. While a probabilistic approach gives parameter uncertainties and likelihoods, over-complex models can inflate the training likelihood. CV bypasses that by empirically assessing out-of-sample error, helping you choose a balance between complexity and generalization.
- Show a short real-world scenario: You want to forecast next year’s sales based on advertising spend, price, and competitor data. Explain how the probabilistic approach with linear regression would inform your confidence in the forecast.
1) Collect (xₙ, tₙ). 2) Fit ŵ and σ̂² by maximizing Gaussian likelihood. 3) Use cov{ŵ} = σ̂² (XᵀX)⁻¹ to measure parameter uncertainty. 4) Predict next year’s sales with t_new = ŵᵀ x_new, but also compute var{t_new} = x_newᵀ cov{ŵ} x_new + σ̂². This gives a predictive range and expresses how certain or uncertain the model is about next year’s sales.
- Why does the predictive variance var{t_new} typically increase the further x_new is from the bulk of the training data?
Because x_newᵀ (XᵀX)⁻¹ x_new grows when x_new lies farther from where the design matrix X provides strong coverage. This term inflates the total predictive variance. With fewer nearby data to anchor the fit, small changes in ŵ are magnified, increasing overall uncertainty.
- Suppose you have a linear model t = wᵀx + ε, and you sample many parameter vectors q from N(ŵ, cov{ŵ}). How does examining the distribution of t_new = qᵀ x_new help?
By taking many q samples, you see how t_new varies across plausible parameter values consistent with your data. This produces a distribution over t_new, illustrating all likely outcomes rather than a single point estimate. It’s a form of approximate Bayesian model averaging within the maximum-likelihood framework.
- How would you apply the probabilistic linear regression framework to the men’s 100m sprint times if you suspected multiple features (year, track condition, temperature)? Provide an example approach.
(1) Collect historical data: each instance xₙ = (1, yearₙ, track_conditionₙ, temperatureₙ), and tₙ = winning_timeₙ. (2) Construct X and solve ŵ = (XᵀX)⁻¹ Xᵀ t. (3) Estimate σ̂² from residuals. (4) For any new scenario (year, track condition, temperature), predict time as ŵᵀ x_new and compute the predictive variance. This handles multiple input dimensions in the same Gaussian-likelihood framework.
- Summarize in your own words the key benefits of taking a probabilistic approach to linear regression and how it improves on plain least squares when dealing with real data with uncertainty.
It provides (1) a natural way to estimate and interpret noise variance, (2) a principled derivation of parameter estimates as maximum likelihood, (3) formulas for parameter uncertainty (cov{ŵ}), and (4) predictive distributions (with means and variances) for new inputs. These enhancements are crucial for risk management, confidence intervals, and any scenario where understanding uncertainty is as important as the prediction itself.
What is the formula for a linear model?
tn = w0 + w1xn,1 + w2xn,2 + w3xn,3 + … + wDxn,D
Represents the prediction of response variable tn based on weighted inputs.
What does the vector xn represent in a linear model?
xn = [1, xn,1, xn,2, …, xn,D]
It includes a constant term and the input features.
What is the matrix X in the context of a linear regression model?
X = [[1, x1,1, x1,2, …, x1,D], [1, x2,1, x2,2, …, x2,D], …, [1, xN,1, xN,2, …, xN,D]]
It is the design matrix containing all input features for each observation.
What does the vector t represent in linear regression?
t = [t1, t2, …, tN]
It contains the actual response values for each observation.
What is the purpose of modeling errors in linear regression?
Errors tell us how confident our predictions should be
They indicate the variability of predictions from the model.
What is the assumption about the noise term ϵn in the model?
It is different for each n
This means each observation may have a unique error term.
What type of noise is assumed in the linear regression model?
Additive noise
The model assumes that noise is added to the predictions.
What is the Gaussian noise model equation?
p(ϵ|µ, σ²) = (1/σ√(2π)) exp(-1/(2σ²)(ϵ - µ)²)
This describes the distribution of the noise term in the model.
What parameters are involved in the Gaussian noise model?
Mean µ and Variance σ²
These parameters define the shape of the Gaussian distribution.
What is likelihood in the context of linear regression?
Likelihood is the value obtained when evaluating the density function at t = tn
It indicates how probable the observed data is under the model.
How is likelihood used in evaluating models?
The higher the likelihood value, the better the model fits the data
It helps assess the quality of different models.
What is the joint likelihood expression for multiple observations?
p(t1, …, tN|w, σ², x1, …, xN)
This combines the likelihoods of individual observations.
What do we seek to maximize when optimizing likelihood in linear regression?
We want to maximize the likelihood values for the parameters w and σ²
This is similar to minimizing the loss function.
True or False: The likelihood for continuous random variables is a probability.
False
Likelihood is not a probability; it is a measure of fit.
Fill in the blank: The model assumes that the noise is an _______ term in the model.
additive
This indicates how the model incorporates variability.
What is the Gaussian likelihood for each input-response pair?
p(tn|w, xn, σ²) = N(w^T xn, σ²)
What is the joint likelihood for multiple input-response pairs?
p(t1, …, tN|w, σ², x1, …, xN) = ΠN n=1 p(tn|w, xn, σ²)
What assumption is made about the tn’s in likelihood optimization?
The tn’s are assumed to be independent.
How is the likelihood maximization mathematically expressed?
argmax w, σ² ΠN n=1 p(tn|w, xn, σ²)
Why do we optimize the natural log likelihood instead of the likelihood itself?
It is easier to optimize the log likelihood.
What happens to log(z) when z increases?
log(z) increases.
What is the log likelihood expression after some rearranging?
log L = -N log(σ√(2π)) - (1/(2σ²)) ΣN n=1 (tn - w^T xn)²
What is the expression for the multi-variate Gaussian?
p(y|µ, Σ) = N(µ, Σ) = (1/(2π)^(K/2)|Σ|^(1/2)) exp(-1/2 (y - µ)^T Σ^(-1) (y - µ))
In multi-variate Gaussian, what does K represent?
K is the number of variables.
What is the determinant of the covariance matrix denoted as?
|Σ|
What is the expression for estimating optimum parameters wb?
wb = (X^T X)^(-1) X^T t
How do you compute the optimum σ²?
σ² = (1/N) (t - Xwb)^T (t - Xwb)
What is a key question regarding confidence in parameter estimates?
How good are our estimates wb and σ²?
What does the average value of a random variable X represent?
It is denoted as ˜x.
What is the formula for averaging S samples?
˜x ≈ (1/S) ΣS s=1 xs
What happens to the sample-based approximation of ˜x as more samples are taken?
It gets better.
What is the formula for the expected value in discrete cases?
˜x = E_{p(x)}{x} = Σx xp(x)
What is an example of computing the expected value?
If X is the outcome of rolling a die, then ˜x = Σx xP(X = x) = 3.5.
What is the expectation of a discrete random variable X?
E_p(x) {x} = Σ x P(X = x)
This represents the average value of X weighted by its probability distribution.
What is the expectation of a continuous random variable X?
E_p(x) {x} = ∫ x p(x) dx
This integral computes the average value of X over a continuous range.
What is the formula for mean, µ, in terms of expectation?
µ = E_p(x) {x}
The mean is the expected value of the random variable.
What is the formula for variance, σ²?
σ² = E_p(x) {(x - µ)²} = E_p(x) {x²} - (E_p(x) {x})²
Variance measures the spread of the random variable around its mean.
What does the covariance of parameters tell us?
cov{wb} tells us how well defined the parameters are by the data
It indicates how much parameters can vary while still providing a good model.
What is the relationship between expectation and a linear transformation of a function?
E_p(x) {kf(x)} = kE_p(x) {f(x)}
This property shows that expectation is linear.
True or False: E_p(x) {f(x)} = f(E_p(x) {x}) for all functions f.
False
Expectation does not generally commute with functions of the random variable.
What is the formula for covariance of estimated parameters wb?
cov{wb} = E_p(t|X,w,σ²) [wb wb^T] - E_p(t|X,w,σ²) {wb} E_p(t|X,w,σ²) {wb}^T
This formula captures the relationship between parameter estimates.
What is the expected value of the parameter estimate wb in terms of true value w?
E_p(t|X,w,σ²) {wb} = w
This indicates that wb is an unbiased estimator of the true parameter w.
What is the formula for the expected value of a function of a random variable?
E_p(x) {f(x)} = ∫ f(x)p(x) dx
This integrates the function f weighted by the probability density p(x).
What does the term p(t|X, w, σ²) represent in the model?
p(t|X, w, σ²) = N(Xw, σ²I)
This indicates that the outputs t are normally distributed around the linear model Xw with variance σ².
What does the notation N(µ, σ²) signify?
N(µ, σ²) indicates a normal distribution with mean µ and variance σ²
This is a standard notation for describing Gaussian distributions.
Fill in the blank: The mean of a uniform distribution between a and b is _______.
(b + a) / 2
This represents the average of the endpoints in a uniform distribution.
What is the formula for the expected value of the square of a random variable?
E_p(x) {x²}
This is used in the calculation of variance.
What is the covariance matrix for parameters in a linear model?
cov{wb} = σ² (X^T X)⁻¹
This indicates how the variance of the estimates of parameters is related to the design matrix X.
What does the term σc² represent in the context of parameter estimates?
σc² is the estimate of the variance of the errors in the model
This is calculated based on the residuals of the model.
What is the identity used for the trace of a product of matrices?
Tr(AB) = Tr(BA)
This identity is useful in various mathematical proofs and derivations involving matrices.
What does Ep(t|X,w,σ²) n bσ² o equal?
σ²(1 - D/N)
D is the number of columns in X, and N is the number of observations.
In general, what is the relationship between D and N?
D < N
This implies that 1 - D/N < 1.
What is the implication of σc² being less than σ²?
σc² is biased and will generally be too low.
This is because it is based on wb, which is usually closer to the data than w.
As N increases, what happens to Ep(t|X,w,σ²) n σc² o?
It approaches σ².
This indicates that with more data, the bias in σc² diminishes.
What is the computed expectation of wb?
Ep(t|X,w,σ²) {wb} = w
This shows that wb is an unbiased estimator.
What is the covariance of the parameter estimate wb?
cov{wb} = σ²(XTX)⁻¹
This expression helps us understand the variability of the estimated parameters.
What is the model equation used for predictions?
t = w^T x + ϵ
Here, t is the target variable, w is the weight vector, x is the input feature vector, and ϵ is the error term.
What is the formula for predicting a new value tnew?
tnew = wb^T xnew
This formula uses the estimated weights and a new input to make predictions.
What is the formula for the variance of the prediction var{tnew}?
var{tnew} = σ² x^T new (XTX)⁻¹ xnew
This indicates how uncertainty in the parameters affects the predictions.
What happens to the predictive variance if the model is too simple?
It may underestimate true variability, leading to high var{tnew}.
This is because the model cannot capture all the noise in the data.
What happens to the predictive variance if the model is too complex?
Parameters are not well defined, leading to high cov{wb}.
This results in increased uncertainty in predictions.
What is the role of σc² in the variance of predictions?
It substitutes for σ² when true variance is unknown.
This substitution can lead to overestimation or underestimation of prediction variance.
What is the prediction for the 2012 Olympic model?
tnew = 9.5947, var{tnew} = 0.0073
This represents a specific result from the linear model prediction.
What is the prediction for the 2016 Olympic model?
tnew = 9.5414, var{tnew} = 0.0080
This shows a slight change in predictions with associated variance.
What happens to predictive variance as we get further from training data?
Predictive variance increases.
This indicates a loss of confidence in predictions as inputs diverge from the training set.
What alternative to training loss can be used for model choice?
Cross-validation.
This method helps evaluate model performance on unseen data.
What happens to predictive variance as we get further from the training data?
Predictive variance increases
Is training loss a good measure for model choice?
No
What is an alternative to training loss for model choice?
Cross-validation
Can likelihood L or log L be used to choose models?
No
What is an important consideration when using more complex models?
More complex models can always get closer to the data
What did the decision to model the noise lead to?
Introduced likelihood and maximised it to find wb and σc2
What can we now quantify after modeling the noise?
Uncertainty in our parameters and predictions
Why is quantifying uncertainty important?
It is very important in all applications
What does going Bayesian require us to forget?
Single parameter values
What should we consider for model uncertainty?
Random variables
What is the form of the random variable p(q)?
N (wb, cov{wb})
What do samples of the random variable q represent?
Models
What can be computed from each sample of q?
A prediction
What is the purpose of looking at the distribution of predictions?
To analyze the variability of predictions
What is the expectation of tnew represented as?
Ep(q) {tnew} = ∫ tnew N (wb, cov{wb}) dtnew
What is the significance of σc2 in the context of modeling?
It is assumed to be fixed
What is a key takeaway from the summary of modeling noise?
We can quantify uncertainty in parameters and predictions