Block 1: Regression and variable selection Flashcards

1
Q

Define the least squares, and state the Gauss-Markov theorem

A

βˆ = (X^T X )^{−1} X^T Y

and is the best linear unbiased estimator (BLUE), but others can have a smaller mse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the purpose of Ridge regression?

A
  • get a good fit and keep control on the overall size of the elements of β.
  • reduce variance of coefficients by reducing large values of parameters with constraint sum βj <= t
  • make (XT X + λI) more invertible as condition number measures “invertibility”: K(XT X + λI) = max evalue / min evalue >= K(XT X)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

State βˆridge and its expectation

A

βˆridge = (X^T X + λIp)^{−1} X^T Y

with E[βˆridge] = (X^T X + λIp)^{−1} X^T E[Y] = (X^T X + λIp)^{−1} X^T Xβ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the difference between Lasso and Ridge?

A
  • the penalties constraints: Sum βj^2 < t (ridge with L2 norm) and Sum |βj| < t (lasso with L1 norm)
  • Lasso can shrink and set coefficients to 0 (variable selection) whereas ridge regression only does shrinkage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

State the residual sum of squares of a multivariate linear model

A

RSS(β) = e^T e

and ∂^2RSS(β) / ∂β∂β^T = 2X^TX

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the z-score?

A

z_j = β^_j / [ sigma^hat sqrt(v_j) ]
where v_j is the jth diagonal entry of (X^T X)^{-1}
sigma^hat 2 = (1 / (n-p)) sum_{i=1}^n (Y_i - Y^hat_i)2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the p-value and its interpretation?

A

Probability of getting a more extreme value than z_j is p_{z_j} = 2 Φ (z_j)
where Φ is the CDF of the normal distribution and Z ~(0,1)

Let H_0 : β_j = 0 we have that z_j ~ t_{n-p}
If p < 0.01 strong evidence to reject H_0, for p < 0.05 some evidence to reject H0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the F score and its interpretation?

A

A way to compare one model M1 with p1+1 parameters to another model M0 with p0+1 parameters (contained in M1).
F = [ (RSS0 - RSS1) / (p1 - p0) ] / [ RSS1 / (n-p1-1) ]
q99 = qf(0.99, df1 = p1 - p0, df2 = n-p1-1)
- Reject H0 if F > 99% quantile of F
(can start with M1 all variables and M0 containing no variables and adding variables one by one).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How is ridge regression a shrinkage method?

A

Using the SVD decomposition X=UDV^T (U, V orthogonal):

Xβˆridge = UD (D^2 +λ Ip)^(-1) DU^T Y

which shrinks the jth coordinates with small dj^2 (jth element of diagonal D)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is ridge regression a shrinkage method?

A

Using the SVD decomposition X=UDV^T (U, V orthogonal):

Xβˆridge = UD (D^2 +λ Ip)^(-1) DU^T Y = sum (j=1 to p) uj dj^2 / (dj^2 + λ) uj Y

which shrinks the jth coordinates with small dj^2 (jth element of diagonal D), ie. small variance (because less stable)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the link between SVD and Principal Components?

A

the sample covariance matrix of centred data X is S = X^T X /n
Therefore, the variance of X in direction vj is dj^2 / n , where vj is the jth eignevector from eigendecomposition of X^T X, ie. the jth principal component

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the hat matrix and its degrees of freedom?

A

Hλ = (X^T X + λ Ip) ^{-1} X^T s.t. β^ridge = HY = Y^ (fitted values)
and the effective degrees of freedom is df (λ) = tr{Hλ} = sum (j=1 to p) dj^2 / (dj^2 + λ)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Recall vector and matrix differentiation formulas

A
  • if y = x^T a then ∂y / ∂x = a

- if y = x^T A x then ∂y / ∂x = 2Ax

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the the least squares expectation and variance and the assymtpotical distributions?

A
E(βˆ) = β
var(βˆ) = σ^2 (XT X)^(-1)

βˆ ≈ N(β, σ^2 (XT X)^(-1) )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is lasso coefficients for orthogonal X?

A

βjˆ lasso = sgn(βjˆls) (βjˆ ls - λ)+

ie. βjˆ lasso = 0 if βjˆls - λ <= 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is principal components regression?

A

Similar to ridge regression but sets to 0 instead of shrinking.

17
Q

What is the least angle regression?

A

A method for variable selection with algorithm similar to forward stepwise regression:

  • start with all coefficients βj=0 and residual r= Y
  • partially enter all variables and set βj = βj^ls = for Xj most correlated with residual r
  • for step k: update residuals rk = Y - XAk βk with updated βk and XAk matrix with previously selected variables only
  • travel in direction δk = (XAk^T XAk)^(-1) XAk^T rk as follow: βk(α) = βk + α δk and so fitted result f^k(α) = f^k + α uk where uk = XAk δk
  • add Xi if its correlation with updated rk is as high as correlation of XAk with rk
18
Q

How to carry a stepwise regression (backward and for?

A
  • compute correlation of each variable xi with response y (p-values with F test for example)
  • for backward: delete xi with lowest correlation (ie. highest p-value / P(>|t|)) one by one stop when all variables xi are statistically significant at level 5% ( p-value / P(>|t|) greater that 0.05 )
  • for forward: select xi which improves the fit the most one by one until the fit stops improving or until the introduced variable is not statistically significant at level 5%
19
Q

Compare Lasso and LAR

A

purpose is the same but LAR more computationally efficient

20
Q

What is the Principal Component Regression?

A
  • carry SVD decomposition to get evalues and evectors of sample variance X^TX/n
  • change of basis: zm = Xvm, where {vm} is basis of evectors and keep first M variables with largest evalues (largest variance)
  • carry a least squares linear regression with zm to get fitted values:
    y^pcr = ȳ1 + sum (m=1 to M)θ^m zm = ȳ1 + X sum (m=1 to M)θ^m vm with θ^m = /
21
Q

Compare Ridge regression and PCR

A

Same technique of identifying variables with small variance but PCR removes coefficients whereas Ridge regression shrinks coefficients

22
Q

Mention diagnostic plots to evaluate fit

A
  • Residual vs fitted values (should be contant)
  • sqrt( |std residuals| ) vs for log(fitted values) if variance is larger on rhs (should be contant)
  • qq plot: standardised residuals vs theoretical quantiles (should be y=x)