Topic 7: Sparse modelling and regularisation Flashcards
Describe shrinkage estimation
Maximum Likelihood Estimation provides nearly unbiased estimates of nearly minimum variance, and does so in an automatic way. But, this has been shown to be a problem, as unbiasedness can be difficult when there are hundreds or thousands of parameters to estimate at the same time. MLE is good for low-dimensional problems.
Shrinkage estimation, we introduce biases, to deduce the variance and therefore improve the overall performance. Shrinkage estimation is good/necessary in high-dimensional.
Shrinkage estimators trade variance for bias to address the shortcomings of having too little data and too much data.
Describe james-Stein estimator
James and Stein proved in 1961 hat μ^{MLE} total squared error risk exceeds hat μ^{JS} no matter what mu may be. There are still some good reasons for sticking with in hat μ^{MLE} low-dimensional problems.
EQUATION
https://docs.google.com/document/d/1lZkRkzYM16vkQL2gsuDF4Tj429b0uqVO3No0nP_0IKA/edit?tab=t.0
For N ≥ 4, the James-Stein estimator has lower squared error risk than the ML estimate, no matter what the value of x is. Here N is the number of things we want to estimate.
EQUATION
It works better because it makes a trade-off between bias and variance that ends up reducing the overall estimation error.
Describe least squares estimator
We have a collection of observations, but we don’t know the values of the coefficients β_0, β_1, …, β_k. These need to be estimated from the data.
The least squares principle provides a way of choosing the coefficients effectively by minimising the sum of the squared errors. We need to choose the values of β_0, β_1, …, β_k$ that minimise the sum of squared errors:
EQUATION
https://docs.google.com/document/d/1clmGFausdiGoGmyd5VURL_qDY-o5p9-9UFRqNrWUzQc/edit?tab=t.0
This is called the least squares estimation, because it gives the least value for the sum of squared errors.
To find the best estimates of the coefficients, it’s called “training” the model to the data.
When referring to the estimated coefficients, we’ll use this notation: hat β_0, …, hat β_k.
Describe ridge regression
Key idea of regularisation:
Limit the capacity of the model, so it only learns patterns
for which it has good evidence.
Effects of ridge regression:
- Biased parameter estimates:
Estimated coefficients are shrunk towards zero.
- The model requires “stronger evidence” to grow the weights.
- Avoids overfitting.
- Potentially better performance on new datasets.
EQUATION
https://docs.google.com/document/d/15bJWnQNBRoqTnUqNfddBTFMRL1nd1NaTTc3XYJxuKZE/edit?tab=t.0
Describe standardisation
It’s a pre-processing step where we transform the predictors (independent variables). The mean and standard deviation are now of a common scale, and can be compared.
In e.g. ridge regression it’s important to standardise the predictors, as the penalty term in ridge regression, because we’re dealing with the sum of squared coefficients.
If we don’t standardise the predictors in ridge regression it can lead to disproportionally values.
Ridge regression is a form shrinkage estimation, and we use standardisation to make sure it shrinks correctly in its penalty term.
Describe best-subset regression
Best subsets regression is an efficient way to identify models that adequately fit your data with as few predictors as possible. Models that contain a subset of predictors may estimate the regression coefficients and predict future responses with smaller variance than the model that includes all predictors.
The idea is to build a model using a subset of the variables. In actuality, what we’re after is to use the smallest subset that can adequately explain the variation in the response, both for inference and for prediction purposes.
We will have a loss function for fitting out linear model (e.g. sum of squares, negative log-likelihood).
Describe forward stepwise regression
Stepwise procedures have been around for a very long time. They were originally devised in times when data sets were quite modest in size, in particular in terms of the number of variables.
Best-subset (16.1):
- At each step m, it looks at ALL possible combinations of m variables
- For m=2, it examines every possible pair of variables
- For m=3, it examines every possible trio of variables
- And so on until M variables
Forward Stepwise (16.2):
- More “greedy” approach
- At each step m, it only adds ONE new variable to the previous set
- It keeps the previously selected variables and just asks “what’s the best next variable to add?”
- Much more computationally efficient
Describe lasso regression
The stepwise model-selection methods from before are useful, if we anticipate a model using a relatively small number of variables, even if the pool of available variables is very large. But, if we expect a larger number of variables to play a role, these methods might not be that good.
Lasso is a convex optimisation problem, because the loss and constraint are convex.
Most of the parameters will be 0, as a quick way to select the best variables (those that are not 0).
The constraint in lasso treats all the coefficients equally, it usually makes sense for all the elements of $x$ to be in the same units. If not, we typically standardise the predictors beforehand, so that each has variance one.
EQUATION
https://docs.google.com/document/d/13jp8go_KgyA9RsahYx0UvAEOkpW2oF9D5oDZlKS-2jU/edit?tab=t.0
Describe regularisation path
Here is an example:
https://docs.google.com/document/d/1ROa5_VGiXcT9gG2Jr1EAovpKf4tcG–ly3hPH50PIDE/edit?tab=t.0
- Each colored line represents a different feature/variable from the spam dataset
- Y-axis: Coefficient values (how much each feature contributes to the prediction)
- X-axis: R² on training data (shows how well the model fits the training data)
- Top numbers (0.00 to 3.47): These are the values of t (the regularization parameter)
feature of LASSO - it gradually “activates” variables
Describe l1 and l2 norms and regularisation
The norm of a vector is a way of measuring its length.
- L2 norm: measures the Euclidean length of β
- L1 norm: measures the sum of absolute values of β
EQUATION
https://docs.google.com/document/d/1y8l85xydf7G39qBnArQs3DvXb99TCruV4cobSJ1gN1w/edit?tab=t.0
Ridge regression (ℓ2 regularisation) constrains the weights to be in a ball
centered at the origin. The solution is the closest point from the unconstrained solution that satisfies the constraint. The weights will be smaller, but generally non-ze:
PICTURE
LASSO regression (ℓ1 regularisation) imposes a constraint with sharp corners.
There’s a high probability that many parameters will be exactly zero. This produces a sparse solution:
PICTURE
Differences in table:
TABLE