Topic 7: Sparse modelling and regularisation Flashcards

1
Q

Describe shrinkage estimation

A

Maximum Likelihood Estimation provides nearly unbiased estimates of nearly minimum variance, and does so in an automatic way. But, this has been shown to be a problem, as unbiasedness can be difficult when there are hundreds or thousands of parameters to estimate at the same time. MLE is good for low-dimensional problems.

Shrinkage estimation, we introduce biases, to deduce the variance and therefore improve the overall performance. Shrinkage estimation is good/necessary in high-dimensional.

Shrinkage estimators trade variance for bias to address the shortcomings of having too little data and too much data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe james-Stein estimator

A

James and Stein proved in 1961 hat μ^{MLE} total squared error risk exceeds hat μ^{JS} no matter what mu may be. There are still some good reasons for sticking with in hat μ^{MLE} low-dimensional problems.

EQUATION
https://docs.google.com/document/d/1lZkRkzYM16vkQL2gsuDF4Tj429b0uqVO3No0nP_0IKA/edit?tab=t.0

For N ≥ 4, the James-Stein estimator has lower squared error risk than the ML estimate, no matter what the value of x is. Here N is the number of things we want to estimate.

EQUATION

It works better because it makes a trade-off between bias and variance that ends up reducing the overall estimation error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe least squares estimator

A

We have a collection of observations, but we don’t know the values of the coefficients β_0, β_1, …, β_k. These need to be estimated from the data.

The least squares principle provides a way of choosing the coefficients effectively by minimising the sum of the squared errors. We need to choose the values of β_0, β_1, …, β_k$ that minimise the sum of squared errors:

EQUATION
https://docs.google.com/document/d/1clmGFausdiGoGmyd5VURL_qDY-o5p9-9UFRqNrWUzQc/edit?tab=t.0

This is called the least squares estimation, because it gives the least value for the sum of squared errors.

To find the best estimates of the coefficients, it’s called “training” the model to the data.

When referring to the estimated coefficients, we’ll use this notation: hat β_0, …, hat β_k.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe ridge regression

A

Key idea of regularisation:

Limit the capacity of the model, so it only learns patterns
for which it has good evidence.

Effects of ridge regression:
- Biased parameter estimates:
Estimated coefficients are shrunk towards zero.
- The model requires “stronger evidence” to grow the weights.
- Avoids overfitting.
- Potentially better performance on new datasets.

EQUATION
https://docs.google.com/document/d/15bJWnQNBRoqTnUqNfddBTFMRL1nd1NaTTc3XYJxuKZE/edit?tab=t.0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe standardisation

A

It’s a pre-processing step where we transform the predictors (independent variables). The mean and standard deviation are now of a common scale, and can be compared.

In e.g. ridge regression it’s important to standardise the predictors, as the penalty term in ridge regression, because we’re dealing with the sum of squared coefficients.

If we don’t standardise the predictors in ridge regression it can lead to disproportionally values.

Ridge regression is a form shrinkage estimation, and we use standardisation to make sure it shrinks correctly in its penalty term.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe best-subset regression

A

Best subsets regression is an efficient way to identify models that adequately fit your data with as few predictors as possible. Models that contain a subset of predictors may estimate the regression coefficients and predict future responses with smaller variance than the model that includes all predictors.

The idea is to build a model using a subset of the variables. In actuality, what we’re after is to use the smallest subset that can adequately explain the variation in the response, both for inference and for prediction purposes.

We will have a loss function for fitting out linear model (e.g. sum of squares, negative log-likelihood).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe forward stepwise regression

A

Stepwise procedures have been around for a very long time. They were originally devised in times when data sets were quite modest in size, in particular in terms of the number of variables.

Best-subset (16.1):

  • At each step m, it looks at ALL possible combinations of m variables
  • For m=2, it examines every possible pair of variables
  • For m=3, it examines every possible trio of variables
  • And so on until M variables

Forward Stepwise (16.2):

  • More “greedy” approach
  • At each step m, it only adds ONE new variable to the previous set
  • It keeps the previously selected variables and just asks “what’s the best next variable to add?”
  • Much more computationally efficient
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe lasso regression

A

The stepwise model-selection methods from before are useful, if we anticipate a model using a relatively small number of variables, even if the pool of available variables is very large. But, if we expect a larger number of variables to play a role, these methods might not be that good.

Lasso is a convex optimisation problem, because the loss and constraint are convex.

Most of the parameters will be 0, as a quick way to select the best variables (those that are not 0).

The constraint in lasso treats all the coefficients equally, it usually makes sense for all the elements of $x$ to be in the same units. If not, we typically standardise the predictors beforehand, so that each has variance one.

EQUATION
https://docs.google.com/document/d/13jp8go_KgyA9RsahYx0UvAEOkpW2oF9D5oDZlKS-2jU/edit?tab=t.0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe regularisation path

A

Here is an example:
https://docs.google.com/document/d/1ROa5_VGiXcT9gG2Jr1EAovpKf4tcG–ly3hPH50PIDE/edit?tab=t.0

  • Each colored line represents a different feature/variable from the spam dataset
  • Y-axis: Coefficient values (how much each feature contributes to the prediction)
  • X-axis: R² on training data (shows how well the model fits the training data)
  • Top numbers (0.00 to 3.47): These are the values of t (the regularization parameter)

feature of LASSO - it gradually “activates” variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe l1 and l2 norms and regularisation

A

The norm of a vector is a way of measuring its length.

  • L2 norm: measures the Euclidean length of β
  • L1 norm: measures the sum of absolute values of β

EQUATION
https://docs.google.com/document/d/1y8l85xydf7G39qBnArQs3DvXb99TCruV4cobSJ1gN1w/edit?tab=t.0

Ridge regression (ℓ2 regularisation) constrains the weights to be in a ball
centered at the origin. The solution is the closest point from the unconstrained solution that satisfies the constraint. The weights will be smaller, but generally non-ze:
PICTURE

LASSO regression (ℓ1 regularisation) imposes a constraint with sharp corners.
There’s a high probability that many parameters will be exactly zero. This produces a sparse solution:

PICTURE

Differences in table:
TABLE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly