Topic 7: Sparse modelling and regularisation Flashcards

Question 1

Q

Describe shrinkage estimation

Answer

A

Shrinkage estimation refers to a statistical technique where estimates are adjusted or “shrunk” toward a central value (like the overall mean or zero) to reduce variance and improve accuracy, especially in high-dimensional settings.

In MLE, estimates are often unbiased but can have high variance, leading to unstable results.

Maximum Likelihood Estimation provides nearly unbiased estimates of nearly minimum variance, and does so in an automatic way. But, this has been shown to be a problem, as unbiasedness can be difficult when there are hundreds or thousands of parameters to estimate at the same time. MLE is good for low-dimensional problems.

Shrinkage estimation, we introduce biases, to reduce the variance and therefore improve the overall performance. Shrinkage estimation is good/necessary in high-dimensional.

Shrinkage estimators trade variance for bias to address the shortcomings of having too little data and too much data.

Question 2

Q

Describe james-Stein estimator

Answer

A

The James-Stein rule describes a shrinkage estimator, each MLE value x_i being shrunk by factor hat B toward the grand mean hat M.

Step 1: Calculate the grand mean (M)
Step 2: Calculate the shrinkage factor (B)
Step 3: Calculate James-Stein estimates

Works better for high dimensions, it makes a trade-off between bias and variance that ends up reducing the overall estimation error.

It works better for N ≥ 4.

Old:
James and Stein proved in 1961 hat μ^{MLE} total squared error risk exceeds hat μ^{JS} no matter what mu may be. There are still some good reasons for sticking with in hat μ^{MLE} low-dimensional problems.

EQUATION
https://docs.google.com/document/d/1lZkRkzYM16vkQL2gsuDF4Tj429b0uqVO3No0nP_0IKA/edit?tab=t.0

For N ≥ 4, the James-Stein estimator has lower squared error risk than the ML estimate, no matter what the value of x is. Here N is the number of things we want to estimate.

EQUATION

It works better because it makes a trade-off between bias and variance that ends up reducing the overall estimation error.

Question 3

Q

Describe least squares estimator

Answer

A

We have a collection of observations, but we don’t know the values of the coefficients β_0, β_1, …, β_k. These need to be estimated from the data.

The least squares principle provides a way of choosing the coefficients effectively by minimising the sum of the squared errors. We need to choose the values of β_0, β_1, …, β_k$ that minimise the sum of squared errors:

EQUATION
https://docs.google.com/document/d/1clmGFausdiGoGmyd5VURL_qDY-o5p9-9UFRqNrWUzQc/edit?tab=t.0

This is called the least squares estimation, because it gives the least value for the sum of squared errors.

To find the best estimates of the coefficients, it’s called “training” the model to the data.

When referring to the estimated coefficients, we’ll use this notation: hat β_0, …, hat β_k.

Question 4

Q

Describe ridge regression

Answer

A

The ridge regression is a linear regression with a squared loss and a penalty of B. Ridge shrinks the regression coefficients towards zero, making the model simpler and preventing large fluctuations due to overfitting.

Goal: Limit the capacity of the model, so it only learns patterns
for which it has good evidence.

Effects of ridge regression:
- Biased parameter estimates:
Estimated coefficients are shrunk towards zero.
- Avoids overfitting.
- Potentially better performance on new datasets.

EQUATION
https://docs.google.com/document/d/15bJWnQNBRoqTnUqNfddBTFMRL1nd1NaTTc3XYJxuKZE/edit?tab=t.0

Question 5

Q

Describe standardisation

Answer

A

It’s a pre-processing step where we transform the predictors (independent variables). The mean and standard deviation are now of a common scale, and can be compared.

In e.g. ridge regression it’s important to standardise the predictors, as the penalty term in ridge regression, because we’re dealing with the sum of squared coefficients.

If we don’t standardise the predictors in ridge regression it can lead to disproportionally values.

Ridge regression is a form shrinkage estimation, and we use standardisation to make sure it shrinks correctly in its penalty term.

Question 6

Q

Describe best-subset regression

Answer

A

Best subsets regression is an efficient way to identify models that adequately fit your data with as few predictors as possible. Models that contain a subset of predictors may estimate the regression coefficients and predict future responses with smaller variance than the model that includes all predictors.

The idea is to build a model using a subset of the variables. In actuality, what we’re after is to use the smallest subset that can adequately explain the variation in the response, both for inference and for prediction purposes.

Best-subset regression works only for when we have a relatively small p, so when we have few predictor variables (lower than 40). It will be too computationally expensive.

To combat this, other methods were invented, such as the forward stepwise regression

We will have a loss function for fitting out linear model (e.g. sum of squares, negative log-likelihood).

Question 7

Q

Describe forward stepwise regression

Answer

A

Stepwise procedures have been around for a very long time. They were originally devised in times when data sets were quite modest in size, in particular in terms of the number of variables.

Best-subset (16.1):

At each step m, it looks at ALL possible combinations of m variables
For m=2, it examines every possible pair of variables
For m=3, it examines every possible trio of variables
And so on until M variables

Forward Stepwise (16.2):

More “greedy” approach
At each step m, it only adds ONE new variable to the previous set
It keeps the previously selected variables and just asks “what’s the best next variable to add?”
Much more computationally efficient

Question 8

Q

Describe lasso regression

Answer

A

The stepwise model-selection methods from before are useful, if we anticipate a model using a relatively small number of variables, even if the pool of available variables is very large. But, if we expect a larger number of variables to play a role, these methods might not be that good.

Lasso is a convex optimisation problem, because the loss and constraint are convex.

Most of the parameters will be 0, as a quick way to select the best variables (those that are not 0).

The constraint in lasso treats all the coefficients equally, it usually makes sense for all the elements of $x$ to be in the same units. If not, we typically standardise the predictors beforehand, so that each has variance one.

EQUATION
https://docs.google.com/document/d/13jp8go_KgyA9RsahYx0UvAEOkpW2oF9D5oDZlKS-2jU/edit?tab=t.0

Question 9

Q

Describe regularisation path

Answer

A

What we see is that early steps have a big impact on the R2, while later steps hardly have any at all.

The regularisation path shows how the coefficients of a model change as the penalty strength ($\lambda$) in regularisation increases or decreases. It provides a way to visualise the impact of regularisation on model parameters.

Here is an example:
https://docs.google.com/document/d/1ROa5_VGiXcT9gG2Jr1EAovpKf4tcG–ly3hPH50PIDE/edit?tab=t.0

Each colored line represents a different feature/variable from the spam dataset
Y-axis: Coefficient values (how much each feature contributes to the prediction)
X-axis: R² on training data (shows how well the model fits the training data)
Top numbers (0.00 to 3.47): These are the values of t (the regularization parameter)

feature of LASSO - it gradually “activates” variables

Question 10

Q

Describe l1 and l2 norms and regularisation

Answer

A

The norm of a vector is a way of measuring its length.

L2 norm: the l2 norm is shown as a circle
L1 norm: the 1 norm has sharp edges and corners

EQUATION
https://docs.google.com/document/d/1y8l85xydf7G39qBnArQs3DvXb99TCruV4cobSJ1gN1w/edit?tab=t.0

Ridge regression (ℓ2 regularisation) constrains the weights to be in a ball
centered at the origin. The solution is the closest point from the unconstrained solution that satisfies the constraint. The weights will be smaller, but generally non-ze:
PICTURE

LASSO regression (ℓ1 regularisation) imposes a constraint with sharp corners.
There’s a high probability that many parameters will be exactly zero. This produces a sparse solution:

PICTURE

Differences in table:
TABLE

Topic 7: Sparse modelling and regularisation Flashcards

(10 cards)