Using a Holdout Set and LASSO Flashcards
Define a houldout set.
A selection of the original data that is not used for estimation.
Why do we use a holdout set?
To help reinforce external validity for live data instead of finding the best model for the original data.
What are the steps to evalutating a prediction using a holdout set?
- Split the original data into a larger work set and a smaller holdout set
- Further split the work set into training sets for k-fold cross-validation
- Build models and select the best model using k-fold cross-validation
- Re-estimate the best model using all observations in the work set
- Take the estimated best model and apply it to the holdout set
- Evaluate the prediction using the holdout set
T/F: The main reason for model selection problem is that we can try out every potentail combination.
False: We cannot try out every single combination, there would be too much!
What are the two methods to build models?
- By hand - specifying variables and model
- Using smart algorithms
What are the pros and cons of using LASSO?
Pros: no need to use ouside info for the model
Cons: may be sensitive to overfitting, hard to interpret
What is the LASSO method?
LASSO - least absolute shrinkage and selection operator
- A method to select variables to include in a linear regression to produce good predictions and avoid overfitting
What are the two accomplishments that LASSO does at the same time?
- It selects a subset of the right-hand-side variables, dropping the other variables.
- It shrinks coefficients for some variables that it keeps in the regression
Define the Tuning Parameter.
the weight for the penalty term vs OLS fit. This helps strengthen variable selection.
Note: A lambda of 0 means the regression is OLS.
What happens when we have an aggressive threshold? A lenient one?
Agressive: higher threshold and fewer variables are left in the regression
Lenient: lower threshold, and more variables left in the regression
T/F: LASSO modifies the way regression coefficients are estimated by adding a penalty term for too many coefficients
True
T/F: Big data leads to larger estimation error
False: it makes a smaller estimation error
T/F: Lasso creates biased estimates.
True. This is because it shrinks the coefficients, which creates slight bias in the estimates.
Why is it okay for LASSO to create bias in its estimates? Does this bias mean LASSO is inferior to OLS?
Remember that lambda searches for the smallest total loss, so it optimizes both bias and variance. Although OLS does not create biased estiments, its variance may increase the total loss. So, LASSO is not inferior to OLS because of this bias.
What steps of model building does LASSO substitute? What do we decide?
We decide the initial right hand variables.
LASSO decides the lambda and which variables to include in the final regression.