Block 1: Regression and variable selection Flashcards

Question 1

Q

Define the least squares, and state the Gauss-Markov theorem

Answer

A

βˆ = (X^T X )^{−1} X^T Y

and is the best linear unbiased estimator (BLUE), but others can have a smaller mse.

Question 2

Q

What is the purpose of Ridge regression?

Answer

A

get a good fit and keep control on the overall size of the elements of β.
reduce variance of coefficients by reducing large values of parameters with constraint sum βj <= t
make (XT X + λI) more invertible as condition number measures “invertibility”: K(XT X + λI) = max evalue / min evalue >= K(XT X)

Question 3

Q

State βˆridge and its expectation

Answer

A

βˆridge = (X^T X + λIp)^{−1} X^T Y

with E[βˆridge] = (X^T X + λIp)^{−1} X^T E[Y] = (X^T X + λIp)^{−1} X^T Xβ

Question 4

Q

What are the difference between Lasso and Ridge?

Answer

A

the penalties constraints: Sum βj^2 < t (ridge with L2 norm) and Sum |βj| < t (lasso with L1 norm)
Lasso can shrink and set coefficients to 0 (variable selection) whereas ridge regression only does shrinkage

Question 5

Q

State the residual sum of squares of a multivariate linear model

Answer

A

RSS(β) = e^T e

and ∂^2RSS(β) / ∂β∂β^T = 2X^TX

Question 6

Q

What is the z-score?

Answer

A

z_j = β^_j / [ sigma^hat sqrt(v_j) ]
where v_j is the jth diagonal entry of (X^T X)^{-1}
sigma^hat 2 = (1 / (n-p)) sum_{i=1}^n (Y_i - Y^hat_i)2

Question 7

Q

What is the p-value and its interpretation?

Answer

A

Probability of getting a more extreme value than z_j is p_{z_j} = 2 Φ (z_j)
where Φ is the CDF of the normal distribution and Z ~(0,1)

Let H_0 : β_j = 0 we have that z_j ~ t_{n-p}
If p < 0.01 strong evidence to reject H_0, for p < 0.05 some evidence to reject H0

Question 8

Q

What is the F score and its interpretation?

Answer

A

A way to compare one model M1 with p1+1 parameters to another model M0 with p0+1 parameters (contained in M1).
F = [ (RSS0 - RSS1) / (p1 - p0) ] / [ RSS1 / (n-p1-1) ]
q99 = qf(0.99, df1 = p1 - p0, df2 = n-p1-1)
- Reject H0 if F > 99% quantile of F
(can start with M1 all variables and M0 containing no variables and adding variables one by one).

Question 9

Q

How is ridge regression a shrinkage method?

Answer

A

Using the SVD decomposition X=UDV^T (U, V orthogonal):

Xβˆridge = UD (D^2 +λ Ip)^(-1) DU^T Y

which shrinks the jth coordinates with small dj^2 (jth element of diagonal D)

Question 10

Q

How is ridge regression a shrinkage method?

Answer

A

Using the SVD decomposition X=UDV^T (U, V orthogonal):

Xβˆridge = UD (D^2 +λ Ip)^(-1) DU^T Y = sum (j=1 to p) uj dj^2 / (dj^2 + λ) uj Y

which shrinks the jth coordinates with small dj^2 (jth element of diagonal D), ie. small variance (because less stable)

Question 11

Q

What is the link between SVD and Principal Components?

Answer

A

the sample covariance matrix of centred data X is S = X^T X /n
Therefore, the variance of X in direction vj is dj^2 / n , where vj is the jth eignevector from eigendecomposition of X^T X, ie. the jth principal component

Question 12

Q

What is the hat matrix and its degrees of freedom?

Answer

A

Hλ = (X^T X + λ Ip) ^{-1} X^T s.t. β^ridge = HY = Y^ (fitted values)
and the effective degrees of freedom is df (λ) = tr{Hλ} = sum (j=1 to p) dj^2 / (dj^2 + λ)

Question 13

Q

Recall vector and matrix differentiation formulas

Answer

A

if y = x^T a then ∂y / ∂x = a

- if y = x^T A x then ∂y / ∂x = 2Ax

Question 14

Q

What is the the least squares expectation and variance and the assymtpotical distributions?

Answer

A

E(βˆ) = β
var(βˆ) = σ^2 (XT X)^(-1)

βˆ ≈ N(β, σ^2 (XT X)^(-1) )

Question 15

Q

What is lasso coefficients for orthogonal X?

Answer

A

βjˆ lasso = sgn(βjˆls) (βjˆ ls - λ)+

ie. βjˆ lasso = 0 if βjˆls - λ <= 0

Question 16

Q

What is principal components regression?

Answer

Study These Flashcards

A

Similar to ridge regression but sets to 0 instead of shrinking.

Question 17

Q

What is the least angle regression?

Answer

Study These Flashcards

A

A method for variable selection with algorithm similar to forward stepwise regression:

start with all coefficients βj=0 and residual r= Y
partially enter all variables and set βj = βj^ls = for Xj most correlated with residual r
for step k: update residuals rk = Y - XAk βk with updated βk and XAk matrix with previously selected variables only
travel in direction δk = (XAk^T XAk)^(-1) XAk^T rk as follow: βk(α) = βk + α δk and so fitted result f^k(α) = f^k + α uk where uk = XAk δk
add Xi if its correlation with updated rk is as high as correlation of XAk with rk

Question 18

Q

How to carry a stepwise regression (backward and for?

Answer

Study These Flashcards

A

compute correlation of each variable xi with response y (p-values with F test for example)
for backward: delete xi with lowest correlation (ie. highest p-value / P(>|t|)) one by one stop when all variables xi are statistically significant at level 5% ( p-value / P(>|t|) greater that 0.05 )
for forward: select xi which improves the fit the most one by one until the fit stops improving or until the introduced variable is not statistically significant at level 5%

Question 19

Q

Compare Lasso and LAR

Answer

Study These Flashcards

A

purpose is the same but LAR more computationally efficient

Question 20

Q

What is the Principal Component Regression?

Answer

Study These Flashcards

A

carry SVD decomposition to get evalues and evectors of sample variance X^TX/n
change of basis: zm = Xvm, where {vm} is basis of evectors and keep first M variables with largest evalues (largest variance)
carry a least squares linear regression with zm to get fitted values:
y^pcr = ȳ1 + sum (m=1 to M)θ^m zm = ȳ1 + X sum (m=1 to M)θ^m vm with θ^m = /

Question 21

Q

Compare Ridge regression and PCR

Answer

Study These Flashcards

A

Same technique of identifying variables with small variance but PCR removes coefficients whereas Ridge regression shrinks coefficients

Question 22

Q

Mention diagnostic plots to evaluate fit

Answer

Study These Flashcards

A

Residual vs fitted values (should be contant)
sqrt( |std residuals| ) vs for log(fitted values) if variance is larger on rhs (should be contant)
qq plot: standardised residuals vs theoretical quantiles (should be y=x)

Block 1: Regression and variable selection Flashcards

(22 cards)