module 12 Flashcards
example of overfitting
nonlinear model has been trained too well on the training dataset, the test dataset has worse performance
Some Ways to overfit a linear regression model
Irrelevant explanatory variables
Collinear explanatory variables
Some ways to underfit a linear regression model
Leaving out an important explanatory variable
A parsimonious model
Aims to strike a balance between overfitting and underfitting the model to the training dataset
does this by: Having a low enough number of explanatory variables to avoid overfitting while
Having a high enough model fit to avoid underfitting
adjusted R^2
used to measure model parsimoniousness
(n-1/n-p-1)
Interpreting the adjusted R^2
The higher the adjusted R^2 of a model, the more parismonious we say that the model is, and therefore, the less likely the model is to be overfit to the training dataset
number of every possible model
2^p possible models
p is possible explanatory variables
Heuristic techniques
Backwards Elimination Algorithm
Forward Selection Algorithm
Regularized linear regression model
take the objective function from our basic linear regression model and add a type of penalty term
Penalty term
penalizes models that have too many explanatory variables that don’t bring enough predictive power to the model
Goal of the penalty term
Goal 1: interpretation of what to leave out
Goal 2: leave out variables that lead to overfitting
LASSO Regression L1
(stands for least absolute shrinkage and selection operator)
The sum of absolute values for all coefficients
Clearer slope interpretation with LASSO regression
If slope is set to be 0, LASSO regression model is suggesting that this slope’s corresponding variable can be left out of model.
Ridge Regression (L2 Penalty Term)
the square root of the sum of the squared values
Less clear slope interpretation with ridge regression
The resulting slopes found w ridge regression provide much less of a clear indication as to which explanatory variables should be left out of the model
Benefits of ridge regression
In the presence of multicollinearity, ridge regression slopes can be more trusted than those that would have been returned by an nonregularized linear regression model
the predicted impact of these variables is more likely to get evenly distributed in the slopes
Elastic net regression (L1 and L2 Penalty Term)
Combines the strengths of Lasso and Ridge by balancing between feature selection and coefficient shrinkage.
The impact of the alpha parameter
Alpha is set high, the L1 term becomes higher, the results slopes will tend to look more like LASSO regression
Alpha is set low, the L2 term becomes higher, the results slopes will tend to look more like ridge regression and more focus on balancing the weight in collinear slopes
Cross validation techniques
Techniques involve creating multiple sets of training and test datasets from the full dataset
Leave one out cross validation
K fold cross validation
Leave one out cross validation
Has every observation appear in a test dataset exactly once
Every observation will be in a training dataset n-1 times
Benefits of LOOCV
accurate test data performance
the corresponding test dataset predictions that we make will similarly reflect this higher accuracy that the full model might have achieved
No randomness
Low model variability
Drawbacks of LOOCV
Computationally expensive
More variable test data predictions
Inflation of model performance
K-fold cross-validation
Every observation appear in a test dataset once
Every observation will appear in a training dataset k-1 times
Benefits of k-fold cross-validation
Less computationally complex than LOOCV
More accurate test data performance
Less inflation of model performance than LOOCV
Drawbacks of k-fold cross-validation
More computationally complex than train-test-split method
Less accurate test data performance than LOOCV
Randomness
AIC
Akaike information criterion (AIC)
- 2 * LLF + 2k
K = number of slopes in the model
LLF = the optimal log likelihood function value of the model
AIC Interpretation
the lower the AIC score of a model is, the more parismonious the model is considered to be
BIC
Bayes information criterion (BIC)
-2 * LLF + ln(n) * k
BIC Interpretation
the lower the BIC score of a model is, the more parsimonious the model is considered to be
AIC vs BIC
The only difference is the penalty term in the equation 2 for AIC, ln(n) for BIC
Downsides of BIC
encourages smaller number of slopes, may come at the expense of training dataset fit, causing you to select a model that has a worse training data
Downsides of AIC
AIC doesn’t penalize a high number of slopes as much as BIC score does. AIC score can be unhelpful for the purpose of model selection.