By modeling the noise ε as Gaussian with mean 0 and variance σ², we can derive likelihood expressions and use them to find parameter estimates that maximize the probability of the observed data. This also yields formulas for variances and covariances of those estimates, enabling confidence intervals and uncertainty quantification for predictions.
We assume the data points are independent. The total likelihood is the product of individual Gaussian likelihoods for each point: L = ∏ₙ p(tₙ|w, σ²) = ∏ₙ N(tₙ | wᵀxₙ, σ²). Typically, we maximize the log-likelihood, log L = ∑ₙ log N(tₙ | wᵀxₙ, σ²).
(1) Form the design matrix X with a column of 1s and a column of x-values. (2) Write down the log-likelihood assuming p(t|X,w,σ²) = N(Xw, σ²I). (3) Differentiate wrt w to get ŵ = (XᵀX)⁻¹ Xᵀ t. (4) Plug ŵ back into the expression for σ² to find σ̂² = (1/N) (t - Xŵ)ᵀ (t - Xŵ).
The probabilistic view interprets the noise as Gaussian and maximizes the likelihood of observing the data; the classical least-squares view minimizes the sum of squared residuals. Both yield the same normal equations because maximizing exp(-residual²/(2σ²)) is equivalent to minimizing residual², leading to ŵ = (XᵀX)⁻¹ Xᵀ t.
Because it lets us quantify uncertainty. Merely minimizing sums of squares finds ŵ, but the probabilistic view also gives us estimates for σ², confidence intervals for ŵ, and predictive distributions for new t-values. This is crucial for risk assessment and understanding the reliability of our predictions.
By estimating σ² from historical data, we can compute var{t_new} = σ² x_newᵀ (XᵀX)⁻¹ x_new. This formula shows how uncertainty in the parameters propagates to uncertainty in the new prediction. When x_new is far from the bulk of the training data, the uncertainty grows.
‘Unbiased’ means E[ŵ] = w, i.e., on average across many datasets generated from the true w, the estimate ŵ recovers that true w. Formally, we show E[ŵ] = E[(XᵀX)⁻¹Xᵀ t] = (XᵀX)⁻¹Xᵀ E[t], and E[t] = Xw. Hence E[ŵ] = w.
If we simulate tₙ = wᵀxₙ + Gaussian(0, σ²) for a small dataset, we typically find σ̂² < σ². Formally, E[σ̂²] = σ² (1 - D/N), so σ̂² is systematically lower than σ² when D < N. That is because ŵ itself is fit to the same data, artificially reducing the residual sums.
When we solve for ŵ, the regression line fits part of the data ‘too well,’ because ŵ is chosen to minimize residuals. We use the same data to estimate σ², so the apparent residual variance is reduced. We lose D degrees of freedom in matching w, thus we correct by the factor (1 - D/N).
Since E[σ̂²] = σ²(1 - D/N) = σ²(1 - 2/10) = σ²(0.8), you can multiply σ̂² by 1/(1 - 2/10) = 1/0.8 = 1.25 to get an unbiased estimate of σ². In other words, use σ̃² = (N/(N - D)) σ̂².
Under the Gaussian noise assumption, cov{ŵ} = σ² (XᵀX)⁻¹. The diagonal elements show how much each parameter can vary; large diagonal entries mean low precision (high uncertainty). Off-diagonal elements indicate correlation between parameters (how they move together to maintain a good fit).
If all x-values in a dataset are nearly the same, XᵀX becomes nearly singular, making (XᵀX)⁻¹ huge. Numerically, suppose x = [1.0, 1.1, 1.05,…], then the design matrix columns are almost linearly dependent. The result is a very large cov{ŵ}, meaning the parameters are not well identifiable from the data.
We have t_new = wᵀx_new + ε, and w is estimated by ŵ. The predictive variance is var{t_new} = σ̂² + x_newᵀ cov{ŵ} x_new = σ̂² + σ̂² x_newᵀ (XᵀX)⁻¹ x_new = σ̂²[1 + x_newᵀ (XᵀX)⁻¹ x_new]. The first term reflects noise in the outcome; the second term is uncertainty in ŵ.
High-degree polynomials can ‘bend’ to fit noise, leading to poor identifiability of coefficients (large cov{ŵ}). When extrapolating far from the data, small changes in high-degree coefficients can swing predictions wildly. By contrast, the simpler degree-3 model might maintain stable coefficients and hence smaller predictive variance far from training points.
Cross-validation directly tests predictive performance on held-out data. While a probabilistic approach gives parameter uncertainties and likelihoods, over-complex models can inflate the training likelihood. CV bypasses that by empirically assessing out-of-sample error, helping you choose a balance between complexity and generalization.
1) Collect (xₙ, tₙ). 2) Fit ŵ and σ̂² by maximizing Gaussian likelihood. 3) Use cov{ŵ} = σ̂² (XᵀX)⁻¹ to measure parameter uncertainty. 4) Predict next year’s sales with t_new = ŵᵀ x_new, but also compute var{t_new} = x_newᵀ cov{ŵ} x_new + σ̂². This gives a predictive range and expresses how certain or uncertain the model is about next year’s sales.
Because x_newᵀ (XᵀX)⁻¹ x_new grows when x_new lies farther from where the design matrix X provides strong coverage. This term inflates the total predictive variance. With fewer nearby data to anchor the fit, small changes in ŵ are magnified, increasing overall uncertainty.
By taking many q samples, you see how t_new varies across plausible parameter values consistent with your data. This produces a distribution over t_new, illustrating all likely outcomes rather than a single point estimate. It’s a form of approximate Bayesian model averaging within the maximum-likelihood framework.
(1) Collect historical data: each instance xₙ = (1, yearₙ, track_conditionₙ, temperatureₙ), and tₙ = winning_timeₙ. (2) Construct X and solve ŵ = (XᵀX)⁻¹ Xᵀ t. (3) Estimate σ̂² from residuals. (4) For any new scenario (year, track condition, temperature), predict time as ŵᵀ x_new and compute the predictive variance. This handles multiple input dimensions in the same Gaussian-likelihood framework.
It provides (1) a natural way to estimate and interpret noise variance, (2) a principled derivation of parameter estimates as maximum likelihood, (3) formulas for parameter uncertainty (cov{ŵ}), and (4) predictive distributions (with means and variances) for new inputs. These enhancements are crucial for risk management, confidence intervals, and any scenario where understanding uncertainty is as important as the prediction itself.
What is the formula for a linear model?
tn = w0 + w1xn,1 + w2xn,2 + w3xn,3 + … + wDxn,D
Represents the prediction of response variable tn based on weighted inputs.
What does the vector xn represent in a linear model?
xn = [1, xn,1, xn,2, …, xn,D]
It includes a constant term and the input features.
What is the matrix X in the context of a linear regression model?
X = [[1, x1,1, x1,2, …, x1,D], [1, x2,1, x2,2, …, x2,D], …, [1, xN,1, xN,2, …, xN,D]]
It is the design matrix containing all input features for each observation.
What does the vector t represent in linear regression?
t = [t1, t2, …, tN]
It contains the actual response values for each observation.