Regularization Flashcards

1
Q

What is the key difference between Ridge regression and Lasso regression in terms of coefficient shrinkage and variable selection?

A

Ridge regression shrinks coefficients but does not set them to zero, so it does not perform variable selection. Lasso regression, on the other hand, can shrink some coefficients exactly to zero, performing variable selection. This makes Lasso more interpretable but less stable under multicollinearity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does Elastic Net regression combine the advantages of Ridge and Lasso regression? Provide a mathematical formulation.

A

Elastic Net regression adds both L1 and L2 penalties to the loss function, combining the strengths of both Ridge and Lasso. The mathematical formulation is: β̂_enet = argmin β { 1/2 ∑ (y_i - x_i’β)^2 + λ [α∑ β_j^2 + (1 - α)∑ |β_j|] }, where α controls the balance between the L1 (Lasso) and L2 (Ridge) penalties.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the role of the penalty parameter (λ) in Ridge regression. How does it control the bias-variance tradeoff?

A

In Ridge regression, the penalty parameter λ controls the amount of shrinkage applied to the coefficients. A larger λ increases shrinkage, reducing variance but increasing bias. This helps prevent overfitting, especially when the model is complex.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the advantage of using Lasso regression over Ridge regression when dealing with a large number of predictors? Illustrate using an example.

A

Lasso regression can perform variable selection by shrinking irrelevant predictors’ coefficients to zero. For example, in a high-dimensional dataset with many predictors, Lasso will select only the most relevant variables, making the model simpler and more interpretable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe the Least Angle Regression (LAR) algorithm and explain how it is related to Lasso regression.

A

LAR is a sequential variable selection algorithm that adds variables to the model based on their correlation with the residuals. It is closely related to Lasso regression, and both can produce similar results. LAR, however, does not involve regularization, while Lasso does.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the computational challenges associated with Elastic Net regression, and how are they typically addressed?

A

Elastic Net requires tuning two hyperparameters, λ and α, which makes it computationally expensive. This is typically addressed through cross-validation or grid search to find the optimal values for these parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In the context of regularized logistic regression, explain the effect of adding L1 and L2 penalties. What is the role of the pathwise coordinate descent algorithm?

A

In regularized logistic regression, adding L1 (Lasso) or L2 (Ridge) penalties to the log-likelihood function controls the complexity of the model. The pathwise coordinate descent algorithm efficiently computes solutions by iteratively updating the coefficients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why is standardization of predictors important in regularization techniques like Ridge, Lasso, and Elastic Net?

A

Standardization ensures that each predictor has a mean of 0 and variance of 1, which is essential in regularization techniques. Without standardization, variables with larger scales would dominate the penalty terms, leading to biased coefficient estimates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Discuss the limitations of Lasso regression when predictors are highly correlated. How does Elastic Net address this issue?

A

When predictors are highly correlated, Lasso may arbitrarily select one variable and ignore the others, leading to instability. Elastic Net, by combining L1 and L2 penalties, shrinks correlated variables together, offering more stability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain how the bias-variance tradeoff manifests in Ridge regression and why it is important for model performance.

A

Ridge regression reduces variance by shrinking coefficients, which mitigates overfitting. However, this shrinkage introduces bias because even important variables are penalized, but it ultimately improves model performance by reducing overall prediction error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the effect of the mixing parameter (α) in Elastic Net regression? How does it determine the balance between L1 and L2 regularization?

A

The mixing parameter α in Elastic Net controls the balance between Lasso (L1) and Ridge (L2) penalties. When α = 1, the model behaves like Ridge regression, and when α = 0, it behaves like Lasso. A value between 0 and 1 combines both.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does Ridge regression handle multicollinearity? Why is it more stable compared to Lasso in this context?

A

Ridge regression is effective in handling multicollinearity because it shrinks the coefficients of correlated variables without setting them to zero. This reduces the variance in estimates and stabilizes the model, unlike Lasso, which tends to ignore one of the correlated variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the geometric interpretation of Least Angle Regression (LAR) in terms of selecting predictors?

A

In LAR, predictors are represented as vectors, and the algorithm selects variables based on their correlation with the residuals. The fitted values are projections onto these vectors, providing a geometric view of how predictors are added to the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does the selection of the penalty parameter (λ) in Lasso regression affect the resulting model’s interpretability and performance?

A

In Lasso regression, a larger λ increases the number of coefficients that are shrunk to zero, leading to simpler, more interpretable models. However, too large a λ may result in underfitting, where important variables are excluded.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain how cross-validation is used in the selection of the penalty parameter (λ) in Ridge and Lasso regression.

A

Cross-validation is used to select the penalty parameter λ by splitting the data into training and validation sets. The model is trained on different values of λ, and the one that minimizes prediction error on the validation set is chosen.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

In regularized logistic regression, how does the log-likelihood function change when a penalty is added? Provide the mathematical expression.

A

The regularized log-likelihood function in logistic regression is: β̂ = argmax β { ∑ [ y_i (β_0 + x_i’β) - log(1 + e^(β_0 + x_i’β)) ] - λ penalty(β) }, where the penalty term can be L1, L2, or a combination (Elastic Net).

17
Q

What are the advantages of using Elastic Net over both Ridge and Lasso when the dataset contains a mix of relevant and irrelevant variables?

A

Elastic Net is advantageous when the dataset has a mix of relevant and irrelevant variables. It combines the variable selection ability of Lasso with the stability of Ridge, making it ideal for handling multicollinearity and high-dimensional data.