3.3 Stepwise Selection and Regularization Flashcards
Explain the difference between AIC and BIC in the context of stepwise selection. Why does BIC tend to favor models with fewer predictors compared to AIC?
- AIC = SSE* + 2p
- BIC = SSE*+ ln(n)p
Bic tends to favor models with fewer predictors compared to AIC because when n>= 8, ln(n)>2, for this reason penalize more the increment of predictors in the model.
makes the penalty for additional predictors larger in BIC than in AIC
Compare and contrast forward selection and backward selection procedures. How does each method handle the hierarchical principle when dealing with interaction terms?
Forward selection starts with no predictors and adds them one by one, while backward selection starts with all predictors and removes them one by one.
Both methods respect the hierarchical principle: forward selection will not consider adding an interaction term until both main effects are in the model, while backward selection will not remove a main effect if its interaction term is still in the model.
What is binarization in the context of stepwise selection? How does it affect the handling of categorical variables, and what are its potential consequences for model interpretation?
Binarization is the process of creating dummy variables for categorical predictors manually, rather than letting the modeling function do it automatically. This allows stepwise selection to treat each level of a categorical variable independently.
However, it can lead to merging of factor levels based solely on statistical criteria, potentially resulting in models that are difficult to interpret.
Describe the main differences between ridge regression, lasso regression, and elastic net regression in terms of:
- Their penalty terms
- Their ability to perform variable selection
- Their handling of the hierarchical principle
Penalty terms:
-Ridge: λΣ bj^2
-Lasso: λΣ |bj|
-Elastic Net: λ(αΣ |bj| + (1-α)Σ bj^2)
Variable selection:
-Ridge: Does not perform variable selection
-Lasso: Can perform variable selection
-Elastic Net: Can perform variable selection
Hierarchical principle:
-Ridge: Always respects it
-Lasso: Can violate it
-Elastic Net: Can violate it
Why is it crucial to scale predictors before applying regularization methods?
Scaling predictors is crucial before applying regularization methods because predictors with larger scales would contribute less to the penalty term, leading to unfair shrinkage. Scaling puts all predictors on the same scale, ensuring fair treatment in the regularization process.
Explain the concept of cross-validation in the context of regularization. How does it help in selecting the optimal λ value?
Cross-validation (CV) in regularization involves dividing the training set into k folds, fitting the model on k-1 folds, and calculating the error on the left-out fold. This process is repeated k times, and the average error is used as the cross-validation error. By performing this process for different λ values, we can select the λ that minimizes the cross-validation error.
Interpret a cross-validation plot for lasso regression. Explain the significance of:
- The left dashed line
- The right dashed line
- The numbers at the top of the plot
- The left dashed line: λ with the minimun CV error
- The right dashed line: right most λ whitin the one-standar-error interval of the minimun CV error model. corresponds to the λ chosen by the one-standard-error rule.
- The numbers at the top of the plot: number of predictor at each λ value
What is the one-standard-error rule in cross-validation? How does it balance model simplicity and performance?
One-standar-error rule select the highest λ in the centered interval of one-standar error from the λ with minimun CV error. It select the least flexible model (a simpler and interpretable one) with roughly the same performance in term of error quality.
The one-standard-error rule selects the most parsimonious model whose CV error is within one standard error of the minimum CV error. This balances model simplicity and performance by choosing a simpler model (larger λ) that still performs comparably to the best model.
How do ridge regression, lasso regression, and elastic net regression handle high-dimensional data? What advantage do they have over traditional stepwise selection in these scenarios?
Ridge, lasso, and elastic net regressions handle high-dimensional data well by shrinking coefficients towards zero, effectively reducing the impact of less important predictors. These methods are useful in high-dimensional settings, where the number of predictors is large relative to the number of observations. They have an advantage over stepwise selection in these scenarios because they can work with all predictors simultaneously, whereas stepwise selection may struggle.
Explain how elastic net regression combines features of both ridge and lasso regression. What is the role of alpha, and how does it affect variable selection?
Elastic net regression combines ridge and lasso by using a weighted average of their penalty terms, controlled by α . When
α =0, it is equivalent to ridge regression; when α=1, it is equivalent to lasso. As α increases, elastic net encourages more variable selection, behaving more like lasso.
Compare the coefficient plots for ridge regression and lasso regression as lambda increases. How do the methods differ in their treatment of less important predictors?
As λ increases, ridge regression causes the coefficients to shrink towards zero, although they do not become exactly zero. In contrast, lasso regression allows the coefficients to become exactly zero, effectively removing predictors from the model.
Explain how lasso regression can violate the hierarchical principle. Why does this issue not occur with ridge regression?
Lasso regression can violate the hierarchical principle because it can set individual coefficients to zero independently. It might retain an interaction term while setting one of its main effects to zero. Ridge regression does not have this issue because it practically never sets coefficients to exactly zero for finite λ.
Discuss the difference in how stepwise selection and regularization methods approach model flexibility. How does each method aim to balance model fit and complexity?
Stepwise selection stays within the MLR framework, using an algorithm to identify good predictors to keep and bad predictors to remove. Regularization, on the other hand, places more emphasis on the more important predictors when estimating coefficients. Both methods aim to find a balance between model complexity and predictive performance, but regularization provides a more continuous way of adjusting model flexibility.