Chapter 3 Flashcards by Shannon Quinn

What may cause there to be a non unique solution to betahat?(3)

Reminder Bhat=X^TX*X^Ty
-P>=n
-Multicollinearity problem
Why? As chances are X^TX will not be invertible
-Also situation where n is only slightly larger than p hence close to singular (non-invertible) but this can cause large variance thus predictive power is lowered.

How well did you know this?

Not at all

Perfectly

3 methods that help solve the non-invertible X^TX problem.(3)

Subset selection: identify a “good” subset of p∗< p explanatory variables, then fit a model using least squares on these p∗predictors.
Regularisation: modify the least squares loss function L(β) so it prescribes a cost to large values of βand hence shrinks the estimates for the regression coefficients towards zero. This can improve predictive performance.
Dimension reduction: reduce the dimension of the set of explanatory variables by constructing m < p “good” linear combinations of the p predictors, then fit a model using least squares on these m predictors.

How well did you know this?

Not at all

Perfectly

What are two types of subset selection? When would you employ one over the other?(3)

Best subset selection
Automated stepwise selection (back and forward)

-Best subset harder for larger p as number of models for consideration grows geometrically with the number of explanatory variables i.e. 2^p so 8 parameters would leads to 256 possible models.
-ASS therefore good alternative as performs a guided search through all 2^p possibilities such that only specific models are investigated, however, this does come at a cost of potentially not achieving best model plus forward and backward can often arrive at diff conclusions
Suggested both are conducted so at least an appreciation for this can be had.

How well did you know this?

Not at all

Perfectly

Algorithm for best subset selection.(3)

Fit the null model, M0, which contains no explanatory variables, and is simply ˆy = ybar.
For k = 1 , 2, . . . , p :
(a) Fit all
(p
k) models that contain exactly k explanatory variables.
(b) Select the model amongst these which has the smallest residual sum of squares SSE , or equivalently, the largest coefficient of determination R^2. Call this model Mk.
Select a single “best” model from amongst M0, . . . , Mp.

How well did you know this?

Not at all

Perfectly

Why may R^2 not be a useful comparison between models?What are 3 alternatives which help this problem?(3)

R^2 hard to compare for varying p as by definition (being amount of variability of explained) all p would have maximal variation-though at a trade off against model complexity
Hence 3 proposals are:
Adjusted `R^2adj = [(1 −SSE) /(n −k −1)]/[SST /(n −1)]
which adjusts R^2 to penalise model complexity (i.e. large k). We would choose the model for which R^2adj is largest.
Mallows Cp statistic= 1n (SSE + 2 kˆσ2)
where the estimate of the error variance ˆσ2 was defined as {1/(n −q)}(y −ˆy)^T (y −ˆy) (3.4). We would choose the model for which Cp is smallest.
Bayes information criterion (BIC)= (1/n){SSE + log(n)kˆσ2}
up to an irrelevant additive constant. We would choose the model for which BIC is smallest.

How well did you know this?

Not at all

Perfectly

Which penalises model complexity the most BIC or Mallows Cp?(1)

BIC as 27 thus BIC.

How well did you know this?

Not at all

Perfectly

Algorithm for forward stepwise model selection.(3)

Fit the null model, M0, which contains no explanatory variables.
For k = 0 , 1, . . . , p −1:
(a) Fit the p −k models that augment the explanatory variables in Mk with one additional predictor.
(b) Select the model amongst these which has the smallest SSE or the largest R^2. Call this model Mk+1.
Select a single “best” model from amongst M0, . . . ,Mp.

How well did you know this?

Not at all

Perfectly

Algorithm for backward elimination.(3)

Fit the full model, Mp, which contains all p explanatory
variables.
For k = p, p −1, . . . , 1:
(a) Fit the k models that contain all but one of the explanatory variables in Mk.
(b) Select the model amongst these which has the smallest SSE or the largest R2. Call this model Mk−1.
Select a single “best” model from amongst M0, . . . ,Mp.

How well did you know this?

Not at all

Perfectly

One additional benefit forward selection offers over backward elimination.(1)

An additional advantage of forward stepwise
selection, is that it can be applied in the
high-dimensional setting when p ≥n. However, in this case, only the models M0, . . . , Mn−1 are considered since least squares would fail to yield a unique solution to the normal equations for the models with n or more predictors.

How well did you know this?

Not at all

Perfectly

What are two kinds of error?Which is more interesting?Which is more common?(4)

•Test error: the average error that results from predicting the response for an observation that was not used in model-fitting. This is called out-of-sample validation.
The data used in model-fitting are called training data. The data used to assess the predictive performance are called validation data or test data.
•Training error: the average error that results from predicting the response for an observation that was used in model-fitting. This is called in-sample validation and
uses only training data.
-Test data is more interesting as is used on unseen data thus more indicative of performance of model. Training data usually underestimates test data error, due to double use in model fitting,
-However, training data straightforward to calculate so more commonly used.

How well did you know this?

Not at all

Perfectly

Most common test of training error in regression?(1)

MSE
(1/n)*n∑i=1(yi −ˆyi)^2
where ˆyi = Xi,·ˆβ
ith row of design matrix X

How well did you know this?

Not at all

Perfectly

What are two examples of out-of-sample validation?(2)

-validation set approach
-cross-validation
cross validation solves issues of different outcomes per sample (limits on predictive validity) and performance appearing poor due to only a subset being used yo fit the model (in validation set approach)

How well did you know this?

Not at all

Perfectly

What is a validation set approach?(3)

-An example of an out-of-sample validation method
-The idea is to (randomly) split the data into two: the training data and the test data. The training data is used to estimate the model parameters and give the fitted model. Then the test data is used to compute the test error and measure overall performance.
-For ease of explanation, suppose our training data comprise (y1, x 1), . . . , (yn1, x n1) where n1 < n and that our test data comprise (yn1+1, x n1+1), . . . , (yn, x n). Then we would estimate the test error with
MSE = 1/(n −n1)n∑i=n1+1{(yi −ˆyi)^2} in which the regression coefficients in the fitted value ˆyi = Xi,·ˆβT were estimated using the training data.

How well did you know this?

Not at all

Perfectly

What is cross validation and how does this differ from validation set approach?(2)

k-fold cross validation is similar to the validation set approach except we randomly divide the data into k, rather than two, groups or folds, this time of approximately equal size. The last k −1 folds are used as the training data then the first fold is used as a test set. We compute the test error rate based on this first fold. The process is then repeated using the second fold as the test set, then the third fold, and so on.
Eventually this procedure gives us k estimates of the test error rate which are averaged to give the overall test error.
k=5 or k=10 are typical choices.

How well did you know this?

Not at all

Perfectly

What is leave-one-out cross-validation?(1)

A special case of cross validation where k=n, can be computationally prohibitive because requires fitting the model n times.

How well did you know this?

Not at all

Perfectly

What is another name for regularisation methods? Give 3 examples of regularisation methods. (3)

Study These Flashcards

Shrinkage methods (shrink least squares estimate coeffecients towards zero hence the name)

ridge regression
the LASSO
elastic net (hybrid)

What is the least squares estimate for B0hat after mean-centering?(1)

Study These Flashcards

ybar ie the mean.

What assumption is made for regularisation methods? Why?(2)

Study These Flashcards

Standardised transformation of X1-
With these methods we typically only want to shrink the coefficients of the explanatory variables in β1, not β0, which is simply a measure of the mean response when
xi1 = xi2 = . . . = xip = 0. Shrinkage methods typically lead to non-trivial relationships between the scale of the regression coefficients and the scale of the explanatory variables thus assumption made in order to mean centre the explanatory variables and put them on a common scale.

What is the least squares loss function for beta1?(1)

Study These Flashcards

L(β1) = ( y −X1β1)^T (y −X1β1)

Why is the ridge regression classified as a constrained optimisation problem?(2)

Study These Flashcards

Rather than simply looking to minimise L(Beta1), we look to do this under the constraint of
‖β1‖=√(βT1*β1) = ( p∑k=1{β^2k} )^1/2
does not exceed some threshold size, say τ. The idea is
that this will rule out choices of β1 that might be regarded as unrealistic.

Define the ridge regression estimate.(2)

Study These Flashcards

LEARN PROOF!!!
The β1 which minimises L^r(β1) is given by the solution to the equations
(X^T1 X1 + λIp)β1 = X^T1 y.
For λ> 0 these equations are guaranteed to have a unique solution, denoted ˆβr1, and this is
the ridge regression estimate
ˆβ1^r = (X^T1 X1 + λIp)^−1X^T1 y.

Need to know estimations of least squares result 3.1.1!!!

Study These Flashcards

Memorise day before!!!

What are the mean and variance of the ridge regression estimator?(2)

Study These Flashcards

LEARN PROOF FOR BOTH!!!-PROVE VAR NOT INCLUDED SO DO THIS OR SEARCH THIS
E(ˆβr1) = (X^T1 X1 + λIp)^−1X^T1 X1β1
Var(ˆβr1) = σ^2(X^T1 X1 + λIp)^−1X^T1X1*(X^T1 X1 + λIp)^−1.

Is the ridge regression estimator and unbiased estimator for Beta1 when lambda>0?

Study These Flashcards

No, however use is that it can still be used irrespective of sizes of n and p, even when u cannot invert

What is lambda in ridge regression?(1)

tuning, use cross-validation techniques to pick an appropriate value. lambda comes from lagrange multiplier approach.

standard deviation formula.(1)

sqrt(sum(x-xbar)^2/[n-1])

Difference between LASSO and ridge regression and why one may use LASSO over ridge?(2)

Shrinkage in ridge is good but still never exactly 0 for predictors thus includes all p parameter. LASSO includes shrinkage and subset selection through equating some predictors to 0 thus can be easier for interpretation. Note also LASSO peanlises l1 where as ridge l2 penalty.

What is the constraint for LASSO?(1)

we impose the constraint that the l1-norm of β1 :||β1||1=sum j=1 to p(|βj|) does not exceed some threshold size, say τ.

What is the LASSO estimator?(1)

L^l(β1) = L(β1) + λ||β1||1, where λ ≥ 0 is chosen later to impose the constraint that ||β1||1 ≤ τ . The value of β1 which minimises L^l(β1), denoted βˆl1, is the LASSO estimator.

Comment on small lamda in LASSO.(1)

For small positive values, the method behaves very like least squares; indeed taking λ = 0 corresponds precisely to least squares. However, as the value of λ > 0 gets large, explanatory variables begin to drop from the fitted model. Again, a common strategy for choosing a value for λ is to use cross validation.

What is the idea behind the elastic net?When is it useful?(2)

-Ridge is better for shrinkage and regularisation generally where as LASSO is the only one able to perform variable selection. Hence an approach is to consider a loss function which considers both: Le(β1) = L(β1) + λ2(β1(2 + λ1(β1(1 This is the idea behind the elastic net (above). Clearly we get ridge regression in the special case when λ1 = 0 and the LASSO when λ2 = 0. So the elastic net is a generalisation of both approaches and combines their positive attributes. -This is particularly helpful when we have many more variables than observations, p>n, which often occurs in big data settings.

What is the motivation behind dimension reduction techniques?What are two examples?(3)

We are able to learn the effects of the (smaller number of) transformed variables more precisely hence reduce variation in parameter estimate, increasing predictive validity. - Principal component regression (PCR) - Partial least squares

What is Bdr?(1)

βdr1 = Cθ1 Denote by C the (p × m) matrix with columns c1,...,cm then Z1 = X1C (c1....cm are the linear combinations of x)-c chosen to max var of z which is the pc

Describe rationale behind PCR.(3)

- We apply PCA to reduce the dimension of the matrix X1 of explanatory variables. We know that the first few principal components give the directions in Rp in which the explanatory variables vary the most. We might therefore think it reasonable to assume that these directions are linearly associated with the response variable. This is the rationale underpinning PCR. Whilst it is not guaranteed to be true, it is often a reasonable approximation. - Often standardise, may use a common scale to control variables which may dominate variation thus the PCR - Cross-validation used for m selection.

In what way is partial least squares superior to PCR?What is the difference?(2)

No guarantee that the directions in which X1 varies the most are effective in explaining variation in the response. This problem is addressed by partial least squares (PLS) which, roughly speaking, looks for directions that explain variation in both the explanatory variables X1 and the response y. PLS maxmises product of variances rather than var of z for the jth transformed variable, instead of maximising the variance of z(j) and the squared correlation between y and z(j). This has the effect of assigning more weight to variables that have a strong linear association with the response.

How would you be able to compare cross-validation, say for PCR or PLS analysis?(1)

Use the same subsets (or folds) of variables during cross-validation. We could get around this problem by making use of the segments argument in the plsr and pcr functions to select the same subsets in each case.

What is the hat matrix?(1)

X(X T X)^−1X^T

Chapter 3 Flashcards

(37 cards)