Chapter 3 Flashcards
What may cause there to be a non unique solution to betahat?(3)
Reminder Bhat=X^TX*X^Ty
-P>=n
-Multicollinearity problem
Why? As chances are X^TX will not be invertible
-Also situation where n is only slightly larger than p hence close to singular (non-invertible) but this can cause large variance thus predictive power is lowered.
3 methods that help solve the non-invertible X^TX problem.(3)
- Subset selection: identify a “good” subset of p∗< p explanatory variables, then fit a model using least squares on these p∗predictors.
- Regularisation: modify the least squares loss function L(β) so it prescribes a cost to large values of βand hence shrinks the estimates for the regression coefficients towards zero. This can improve predictive performance.
- Dimension reduction: reduce the dimension of the set of explanatory variables by constructing m < p “good” linear combinations of the p predictors, then fit a model using least squares on these m predictors.
What are two types of subset selection? When would you employ one over the other?(3)
- Best subset selection
- Automated stepwise selection (back and forward)
-Best subset harder for larger p as number of models for consideration grows geometrically with the number of explanatory variables i.e. 2^p so 8 parameters would leads to 256 possible models.
-ASS therefore good alternative as performs a guided search through all 2^p possibilities such that only specific models are investigated, however, this does come at a cost of potentially not achieving best model plus forward and backward can often arrive at diff conclusions
Suggested both are conducted so at least an appreciation for this can be had.
Algorithm for best subset selection.(3)
- Fit the null model, M0, which contains no explanatory variables, and is simply ˆy = ybar.
- For k = 1 , 2, . . . , p :
(a) Fit all
(p
k) models that contain exactly k explanatory variables.
(b) Select the model amongst these which has the smallest residual sum of squares SSE , or equivalently, the largest coefficient of determination R^2. Call this model Mk. - Select a single “best” model from amongst M0, . . . , Mp.
Why may R^2 not be a useful comparison between models?What are 3 alternatives which help this problem?(3)
R^2 hard to compare for varying p as by definition (being amount of variability of explained) all p would have maximal variation-though at a trade off against model complexity
Hence 3 proposals are:
Adjusted `R^2adj = [(1 −SSE) /(n −k −1)]/[SST /(n −1)]
which adjusts R^2 to penalise model complexity (i.e. large k). We would choose the model for which R^2adj is largest.
Mallows Cp statistic= 1n (SSE + 2 kˆσ2)
where the estimate of the error variance ˆσ2 was defined as {1/(n −q)}(y −ˆy)^T (y −ˆy) (3.4). We would choose the model for which Cp is smallest.
Bayes information criterion (BIC)= (1/n){SSE + log(n)kˆσ2}
up to an irrelevant additive constant. We would choose the model for which BIC is smallest.
Which penalises model complexity the most BIC or Mallows Cp?(1)
BIC as 27 thus BIC.
Algorithm for forward stepwise model selection.(3)
- Fit the null model, M0, which contains no explanatory variables.
- For k = 0 , 1, . . . , p −1:
(a) Fit the p −k models that augment the explanatory variables in Mk with one additional predictor.
(b) Select the model amongst these which has the smallest SSE or the largest R^2. Call this model Mk+1. - Select a single “best” model from amongst M0, . . . ,Mp.
Algorithm for backward elimination.(3)
- Fit the full model, Mp, which contains all p explanatory
variables. - For k = p, p −1, . . . , 1:
(a) Fit the k models that contain all but one of the explanatory variables in Mk.
(b) Select the model amongst these which has the smallest SSE or the largest R2. Call this model Mk−1. - Select a single “best” model from amongst M0, . . . ,Mp.
One additional benefit forward selection offers over backward elimination.(1)
An additional advantage of forward stepwise
selection, is that it can be applied in the
high-dimensional setting when p ≥n. However, in this case, only the models M0, . . . , Mn−1 are considered since least squares would fail to yield a unique solution to the normal equations for the models with n or more predictors.
What are two kinds of error?Which is more interesting?Which is more common?(4)
•Test error: the average error that results from predicting the response for an observation that was not used in model-fitting. This is called out-of-sample validation.
The data used in model-fitting are called training data. The data used to assess the predictive performance are called validation data or test data.
•Training error: the average error that results from predicting the response for an observation that was used in model-fitting. This is called in-sample validation and
uses only training data.
-Test data is more interesting as is used on unseen data thus more indicative of performance of model. Training data usually underestimates test data error, due to double use in model fitting,
-However, training data straightforward to calculate so more commonly used.
Most common test of training error in regression?(1)
MSE
(1/n)*n∑i=1(yi −ˆyi)^2
where ˆyi = Xi,·ˆβ
ith row of design matrix X
What are two examples of out-of-sample validation?(2)
-validation set approach
-cross-validation
cross validation solves issues of different outcomes per sample (limits on predictive validity) and performance appearing poor due to only a subset being used yo fit the model (in validation set approach)
What is a validation set approach?(3)
-An example of an out-of-sample validation method
-The idea is to (randomly) split the data into two: the training data and the test data. The training data is used to estimate the model parameters and give the fitted model. Then the test data is used to compute the test error and measure overall performance.
-For ease of explanation, suppose our training data comprise (y1, x 1), . . . , (yn1, x n1) where n1 < n and that our test data comprise (yn1+1, x n1+1), . . . , (yn, x n). Then we would estimate the test error with
MSE = 1/(n −n1)n∑i=n1+1{(yi −ˆyi)^2} in which the regression coefficients in the fitted value ˆyi = Xi,·ˆβT were estimated using the training data.
What is cross validation and how does this differ from validation set approach?(2)
k-fold cross validation is similar to the validation set approach except we randomly divide the data into k, rather than two, groups or folds, this time of approximately equal size. The last k −1 folds are used as the training data then the first fold is used as a test set. We compute the test error rate based on this first fold. The process is then repeated using the second fold as the test set, then the third fold, and so on.
Eventually this procedure gives us k estimates of the test error rate which are averaged to give the overall test error.
k=5 or k=10 are typical choices.
What is leave-one-out cross-validation?(1)
A special case of cross validation where k=n, can be computationally prohibitive because requires fitting the model n times.