Part II: Resampling + bias/variance Flashcards
What is variance
The difference between fits on different datasets
What is bias
The general fit of the data (Least squares)
Does overfitting have high variance or bias?
Variance. Introduces noise to the dataset
Does underfitting have high variance or bias?
High bias, the model does not learn the relationship between the predicted and actual values
What is the minimum number of observations?
dimensions = observations
Basically if we go up in dimensions we need to go exponentially up in observations to have the same flexibility. Therefore we prefer less variables.
Which resampling method can be used?
Leave 1 out cross validation
k-fold cross validation
bootstrapping
What is Leave 1 out cross validation
So we are splitting the data n-1 times where each time we remove 1 observation, do train on the rest, and test with the one observation.
what is k fold cross validation?
We make k folds (groups). Every group will act as test set once and training sets the remaining time. It’s less computationally expensive, and the results are pretty much as good as leave one out. Often K=5 or K=10 is used.
What is stepwise selection?
Forward: Starting from an empty model and then adds one predictor at a time that improves the model the most (aka test all predictors one at a time and keep the one that improves the model most).
Backwards: Start from a full model with all predictors, and then drop them one at a time.
What is the drawback from forward and backwards selection?
Forward: that we do not see if we have some variables that work very well together to find an optimal model, because we are always looking at them one at a time.
Backwards: This method will show us if there’s any predictors that work well together but not apart for example.
What is shrinkage/regularization methods?
Making penalty based on the complexity of the model. We fit the model containing all p predictors but constraining the coefficient estimates towards 0. Shrinking the coefficient estimates can significantly reduce their variance. So basically we have all predictors but we are lowering the impact of less important variables by constraining the coefficient estimates.
What is Lasso?
shrinkage/regularization method that shrinks coefficients to 0. Effectively removing variables for autoselection.
What is ridge?
shrinkage/regularization method that shrinks coefficients towards, but never reaching, 0.
Shrinks coefficients to make variables less important but never removing them entirely.
When to use lasso over ridge?
Lasso produces simpler and more interpretable models. Predictive performance depends on the data. If many variables with no (independent) association to the response, lasso will work better than ridge. If not, ridge would work better.
What is does lambda refer to in shrinkage methods?
The scaling factor to fit the penalty