Regularization Flashcards
What is the key difference between Ridge regression and Lasso regression in terms of coefficient shrinkage and variable selection?
Ridge regression shrinks coefficients but does not set them to zero, so it does not perform variable selection. Lasso regression, on the other hand, can shrink some coefficients exactly to zero, performing variable selection. This makes Lasso more interpretable but less stable under multicollinearity.
How does Elastic Net regression combine the advantages of Ridge and Lasso regression? Provide a mathematical formulation.
Elastic Net regression adds both L1 and L2 penalties to the loss function, combining the strengths of both Ridge and Lasso. The mathematical formulation is: β̂_enet = argmin β { 1/2 ∑ (y_i - x_i’β)^2 + λ [α∑ β_j^2 + (1 - α)∑ |β_j|] }, where α controls the balance between the L1 (Lasso) and L2 (Ridge) penalties.
Explain the role of the penalty parameter (λ) in Ridge regression. How does it control the bias-variance tradeoff?
In Ridge regression, the penalty parameter λ controls the amount of shrinkage applied to the coefficients. A larger λ increases shrinkage, reducing variance but increasing bias. This helps prevent overfitting, especially when the model is complex.
What is the advantage of using Lasso regression over Ridge regression when dealing with a large number of predictors? Illustrate using an example.
Lasso regression can perform variable selection by shrinking irrelevant predictors’ coefficients to zero. For example, in a high-dimensional dataset with many predictors, Lasso will select only the most relevant variables, making the model simpler and more interpretable.
Describe the Least Angle Regression (LAR) algorithm and explain how it is related to Lasso regression.
LAR is a sequential variable selection algorithm that adds variables to the model based on their correlation with the residuals. It is closely related to Lasso regression, and both can produce similar results. LAR, however, does not involve regularization, while Lasso does.
What are the computational challenges associated with Elastic Net regression, and how are they typically addressed?
Elastic Net requires tuning two hyperparameters, λ and α, which makes it computationally expensive. This is typically addressed through cross-validation or grid search to find the optimal values for these parameters.
In the context of regularized logistic regression, explain the effect of adding L1 and L2 penalties. What is the role of the pathwise coordinate descent algorithm?
In regularized logistic regression, adding L1 (Lasso) or L2 (Ridge) penalties to the log-likelihood function controls the complexity of the model. The pathwise coordinate descent algorithm efficiently computes solutions by iteratively updating the coefficients.
Why is standardization of predictors important in regularization techniques like Ridge, Lasso, and Elastic Net?
Standardization ensures that each predictor has a mean of 0 and variance of 1, which is essential in regularization techniques. Without standardization, variables with larger scales would dominate the penalty terms, leading to biased coefficient estimates.
Discuss the limitations of Lasso regression when predictors are highly correlated. How does Elastic Net address this issue?
When predictors are highly correlated, Lasso may arbitrarily select one variable and ignore the others, leading to instability. Elastic Net, by combining L1 and L2 penalties, shrinks correlated variables together, offering more stability.
Explain how the bias-variance tradeoff manifests in Ridge regression and why it is important for model performance.
Ridge regression reduces variance by shrinking coefficients, which mitigates overfitting. However, this shrinkage introduces bias because even important variables are penalized, but it ultimately improves model performance by reducing overall prediction error.
What is the effect of the mixing parameter (α) in Elastic Net regression? How does it determine the balance between L1 and L2 regularization?
The mixing parameter α in Elastic Net controls the balance between Lasso (L1) and Ridge (L2) penalties. When α = 1, the model behaves like Ridge regression, and when α = 0, it behaves like Lasso. A value between 0 and 1 combines both.
How does Ridge regression handle multicollinearity? Why is it more stable compared to Lasso in this context?
Ridge regression is effective in handling multicollinearity because it shrinks the coefficients of correlated variables without setting them to zero. This reduces the variance in estimates and stabilizes the model, unlike Lasso, which tends to ignore one of the correlated variables.
What is the geometric interpretation of Least Angle Regression (LAR) in terms of selecting predictors?
In LAR, predictors are represented as vectors, and the algorithm selects variables based on their correlation with the residuals. The fitted values are projections onto these vectors, providing a geometric view of how predictors are added to the model.
How does the selection of the penalty parameter (λ) in Lasso regression affect the resulting model’s interpretability and performance?
In Lasso regression, a larger λ increases the number of coefficients that are shrunk to zero, leading to simpler, more interpretable models. However, too large a λ may result in underfitting, where important variables are excluded.
Explain how cross-validation is used in the selection of the penalty parameter (λ) in Ridge and Lasso regression.
Cross-validation is used to select the penalty parameter λ by splitting the data into training and validation sets. The model is trained on different values of λ, and the one that minimizes prediction error on the validation set is chosen.