Week 5 Flashcards
Cp, AIC, BIC, and Adjusted R2
• These techniques adjust the training error for the model size, and can be used to select among a set of models with different numbers of variables.
Mallow’s Cp
where d is the total # of parameters used and ˆσ2 is an
estimate of the variance of the error associated with each response measurement
Cp = 1/n (RSS + 2dσˆ2)
d
total # of parameters used and ˆσ
AIC
a large class of models fit
AIC = −2 log L + 2 · d
L
he maximized value of the likelihood function
for the estimated model.
Gaussian errors
maximum likelihood and least squares are the same thing, and Cp and AIC are equivalent.
BIC
will tend to take on a small value for a model with a low test error, and so generally we select the model that has the lowest BIC value
1/n (RSS + log(n)dσˆ2)
Adjusted R2 =
a model with a low test error, a large value of adjusted R2 indicates a model with a small test error
1 − [RSS/(n − d − 1)]/[TSS/(n − 1)]
Validation and Cross-Validation
• Each of the procedures returns a sequence of models Mk indexed by model size k = 0, 1, 2, . . .. Our job here is to select ˆk. Once selected, we will return model Mkˆ
• We compute the validation set error or the cross-validation error for each model Mk under consideration, and then select the k for which the resulting estimated test error is smallest.
doesn’t require an estimate of the error
variance σ2
ridge regression
coefficient estimates βˆR
RSS + λΣβ2j
λ
≥ 0 is a tuning parameter
shrinkage penalty
is small when β1, . . . , βp are close to zero, and so it
has the effect of shrinking the estimates of βj towards zero
||β||2
denotes the `2 norm (pronounced “ell
2”) of a vector
scaling of predictors
In other words, regardless of how the jth predictor is scaled, Xjβˆ j will remain the same. Therefore, it is best to apply ridge regression after standardizing the predictors,
standardizing the predictors
x˜ij = xij /√ 1/nΣ(xij − xj )2