module 12 Flashcards by Megwob S

example of overfitting

nonlinear model has been trained too well on the training dataset, the test dataset has worse performance

How well did you know this?

Not at all

Perfectly

Some Ways to overfit a linear regression model

Irrelevant explanatory variables
Collinear explanatory variables

How well did you know this?

Not at all

Perfectly

Some ways to underfit a linear regression model

Leaving out an important explanatory variable

How well did you know this?

Not at all

Perfectly

A parsimonious model

Aims to strike a balance between overfitting and underfitting the model to the training dataset
does this by: Having a low enough number of explanatory variables to avoid overfitting while
Having a high enough model fit to avoid underfitting

How well did you know this?

Not at all

Perfectly

adjusted R^2

used to measure model parsimoniousness
(n-1/n-p-1)

How well did you know this?

Not at all

Perfectly

Interpreting the adjusted R^2

The higher the adjusted R^2 of a model, the more parismonious we say that the model is, and therefore, the less likely the model is to be overfit to the training dataset

How well did you know this?

Not at all

Perfectly

number of every possible model

2^p possible models
p is possible explanatory variables

How well did you know this?

Not at all

Perfectly

Heuristic techniques

Backwards Elimination Algorithm
Forward Selection Algorithm

How well did you know this?

Not at all

Perfectly

Regularized linear regression model

take the objective function from our basic linear regression model and add a type of penalty term

How well did you know this?

Not at all

Perfectly

Penalty term

penalizes models that have too many explanatory variables that don’t bring enough predictive power to the model

How well did you know this?

Not at all

Perfectly

Goal of the penalty term

Goal 1: interpretation of what to leave out
Goal 2: leave out variables that lead to overfitting

How well did you know this?

Not at all

Perfectly

LASSO Regression L1

(stands for least absolute shrinkage and selection operator)
The sum of absolute values for all coefficients

How well did you know this?

Not at all

Perfectly

Clearer slope interpretation with LASSO regression

If slope is set to be 0, LASSO regression model is suggesting that this slope’s corresponding variable can be left out of model.

How well did you know this?

Not at all

Perfectly

Ridge Regression (L2 Penalty Term)

the square root of the sum of the squared values

How well did you know this?

Not at all

Perfectly

Less clear slope interpretation with ridge regression

The resulting slopes found w ridge regression provide much less of a clear indication as to which explanatory variables should be left out of the model

How well did you know this?

Not at all

Perfectly

Benefits of ridge regression

In the presence of multicollinearity, ridge regression slopes can be more trusted than those that would have been returned by an nonregularized linear regression model
the predicted impact of these variables is more likely to get evenly distributed in the slopes

Elastic net regression (L1 and L2 Penalty Term)

Combines the strengths of Lasso and Ridge by balancing between feature selection and coefficient shrinkage.

The impact of the alpha parameter

Alpha is set high, the L1 term becomes higher, the results slopes will tend to look more like LASSO regression

Alpha is set low, the L2 term becomes higher, the results slopes will tend to look more like ridge regression and more focus on balancing the weight in collinear slopes

Cross validation techniques

Techniques involve creating multiple sets of training and test datasets from the full dataset
Leave one out cross validation
K fold cross validation

Leave one out cross validation

Has every observation appear in a test dataset exactly once
Every observation will be in a training dataset n-1 times

Benefits of LOOCV

accurate test data performance
the corresponding test dataset predictions that we make will similarly reflect this higher accuracy that the full model might have achieved
No randomness
Low model variability

Drawbacks of LOOCV

Computationally expensive
More variable test data predictions
Inflation of model performance

K-fold cross-validation

Every observation appear in a test dataset once
Every observation will appear in a training dataset k-1 times

Benefits of k-fold cross-validation

Less computationally complex than LOOCV
More accurate test data performance
Less inflation of model performance than LOOCV

Drawbacks of k-fold cross-validation

More computationally complex than train-test-split method Less accurate test data performance than LOOCV Randomness

AIC

Akaike information criterion (AIC) - 2 * LLF + 2k K = number of slopes in the model LLF = the optimal log likelihood function value of the model

AIC Interpretation

the lower the AIC score of a model is, the more parismonious the model is considered to be

BIC

Bayes information criterion (BIC) -2 * LLF + ln(n) * k

BIC Interpretation

the lower the BIC score of a model is, the more parsimonious the model is considered to be

AIC vs BIC

The only difference is the penalty term in the equation 2 for AIC, ln(n) for BIC

Downsides of BIC

encourages smaller number of slopes, may come at the expense of training dataset fit, causing you to select a model that has a worse training data

Downsides of AIC

AIC doesn’t penalize a high number of slopes as much as BIC score does. AIC score can be unhelpful for the purpose of model selection.