Weeks 5-8 Flashcards
What is the formula for AIC?
-2 * Log-likelihood(theta-hat(mle)) + 2d
It is the training error + model complexity penalty
What is the formula for BIC?
-2 * Log-likelihood(theta-hat(mle)) + LogN*d
What are the differences between AIC and BIC?
- BIC puts a heavier penalty on model complexity.
- AIC works better when n is small, while BIC takes into account sample size.
What is Holdout Validation? And what error does it approximate?
Holdout Validation is reserving a portion of the data for validation after the model is trained.
It approximates Prediction Error
How does the size of the training set vs validation set impact holdout validation?
Smaller training set: Tends to produce simpler models.
Smaller validation set: Produces more complex models and has a poor approximation of the error.
What is Crossvalidation? And what error does it approximate?
It divides the data into K sets (K>=2) and for the 6th set, fit the model on the other K-1 parts and then predict on the 6th set. Do this K times.
It estimates Prediction Error and is poor estimate of Expected Prediction Error (because the folds are so highly dependent of each other).
What should you consider when choosing K in crossvalidation?
K = N (known as Leave-one-out validation) is the best approximation of expected prediction error but might be computationally costly for large N.
What are the assumptions made when using any validation methods?
The data is I.I.D: independent and identically distributed. Also (In Crossvalidation), the K splits are random.
What are the limitations of the validation methods.
When choosing between many models, there is a tendency to overfit to the validation set (i.e., low bias^2 but high variance).
What is regularisation?
Any modification we make to a learning algorithm that is intended to reduce the generalisation error, but not its training error.
What is Ridge Regression? And what is its formula?
Restricts the coefficients B to be less than some pre-specified control parameter lambda > 0. The formula optimises an error term and a model complexity term.
B = argmin { sum( yi - B0 - sum(Bjxij) )^2 + lambda*sum(Bj^2) }
The second term penalises model complexity, dependent on the pre chosen lambda.
How does ridge regression compare to linear regression?
- Leads to a larger bias but smaller variance (i.e., generalises better to new data).
- It can be shown there is a nonzero lambda that makes the expected prediction error smaller than linear regression.
What is LASSO Regression? And what is its formula?
It adds a penalisation term on the sum of absolute values of the coefficients to normal least squares.
B = argmin{ sum( yi - B0 - sum(Bjxij) )^2 + lambda*sum(|Bj|) }
It can also be used for variable selection, since it tends to generate 0 for some coefficients for large enough values of lambda.
Compare LASSO and Ridge Regression.
- Usually Ridge gets a better prediction error than LASSO.
- LASSO can be used for variable selection.
- LASSO can outperform Ridge when many of the ‘true’ coefficients are zero.
How do Ridge and LASSO Regression incorporate prior knowledge on the true coefficients?
They incorporate prior knowledge of the distributions (Ridge = Gaussian; LASSO = Laplacian) into Bayes Theorem.
P(B|D) ~ P(D|B) * P(B)
Posterior ~ Likelihood * Prior