Block 1: Regression and variable selection Flashcards
Define the least squares, and state the Gauss-Markov theorem
βˆ = (X^T X )^{−1} X^T Y
and is the best linear unbiased estimator (BLUE), but others can have a smaller mse.
What is the purpose of Ridge regression?
- get a good fit and keep control on the overall size of the elements of β.
- reduce variance of coefficients by reducing large values of parameters with constraint sum βj <= t
- make (XT X + λI) more invertible as condition number measures “invertibility”: K(XT X + λI) = max evalue / min evalue >= K(XT X)
State βˆridge and its expectation
βˆridge = (X^T X + λIp)^{−1} X^T Y
with E[βˆridge] = (X^T X + λIp)^{−1} X^T E[Y] = (X^T X + λIp)^{−1} X^T Xβ
What are the difference between Lasso and Ridge?
- the penalties constraints: Sum βj^2 < t (ridge with L2 norm) and Sum |βj| < t (lasso with L1 norm)
- Lasso can shrink and set coefficients to 0 (variable selection) whereas ridge regression only does shrinkage
State the residual sum of squares of a multivariate linear model
RSS(β) = e^T e
and ∂^2RSS(β) / ∂β∂β^T = 2X^TX
What is the z-score?
z_j = β^_j / [ sigma^hat sqrt(v_j) ]
where v_j is the jth diagonal entry of (X^T X)^{-1}
sigma^hat 2 = (1 / (n-p)) sum_{i=1}^n (Y_i - Y^hat_i)2
What is the p-value and its interpretation?
Probability of getting a more extreme value than z_j is p_{z_j} = 2 Φ (z_j)
where Φ is the CDF of the normal distribution and Z ~(0,1)
Let H_0 : β_j = 0 we have that z_j ~ t_{n-p}
If p < 0.01 strong evidence to reject H_0, for p < 0.05 some evidence to reject H0
What is the F score and its interpretation?
A way to compare one model M1 with p1+1 parameters to another model M0 with p0+1 parameters (contained in M1).
F = [ (RSS0 - RSS1) / (p1 - p0) ] / [ RSS1 / (n-p1-1) ]
q99 = qf(0.99, df1 = p1 - p0, df2 = n-p1-1)
- Reject H0 if F > 99% quantile of F
(can start with M1 all variables and M0 containing no variables and adding variables one by one).
How is ridge regression a shrinkage method?
Using the SVD decomposition X=UDV^T (U, V orthogonal):
Xβˆridge = UD (D^2 +λ Ip)^(-1) DU^T Y
which shrinks the jth coordinates with small dj^2 (jth element of diagonal D)
How is ridge regression a shrinkage method?
Using the SVD decomposition X=UDV^T (U, V orthogonal):
Xβˆridge = UD (D^2 +λ Ip)^(-1) DU^T Y = sum (j=1 to p) uj dj^2 / (dj^2 + λ) uj Y
which shrinks the jth coordinates with small dj^2 (jth element of diagonal D), ie. small variance (because less stable)
What is the link between SVD and Principal Components?
the sample covariance matrix of centred data X is S = X^T X /n
Therefore, the variance of X in direction vj is dj^2 / n , where vj is the jth eignevector from eigendecomposition of X^T X, ie. the jth principal component
What is the hat matrix and its degrees of freedom?
Hλ = (X^T X + λ Ip) ^{-1} X^T s.t. β^ridge = HY = Y^ (fitted values)
and the effective degrees of freedom is df (λ) = tr{Hλ} = sum (j=1 to p) dj^2 / (dj^2 + λ)
Recall vector and matrix differentiation formulas
- if y = x^T a then ∂y / ∂x = a
- if y = x^T A x then ∂y / ∂x = 2Ax
What is the the least squares expectation and variance and the assymtpotical distributions?
E(βˆ) = β var(βˆ) = σ^2 (XT X)^(-1)
βˆ ≈ N(β, σ^2 (XT X)^(-1) )
What is lasso coefficients for orthogonal X?
βjˆ lasso = sgn(βjˆls) (βjˆ ls - λ)+
ie. βjˆ lasso = 0 if βjˆls - λ <= 0