Block 3: Modern Regression Flashcards
Formula of linear basis expansion in X by restricted number of transformations Mj
f(X) = sum(j=1 to p) sum(m=1 to Mj) βj,m hj,m(Xj)
for p: dimension of data
for Mj: number of flexible functions of Xj
What would be the degrees of freedom for general piecewise polynomial? for continuous piecewise cubic functions for 3 regions?
d = Number of parameters - number of constraints
example continuous piecewise cubic functions for 3 regions (ie. 3 indicator functions hm, m=1,2,3):
- 12 parameters (4 parameters per region)
- 6 constraints (3 constraints: continuity of function, 1st derivative and 2nd derivative at 2 knots)
d= 12 - 6 = 6
Define Natural Cubic Spline basis
Cubic splines, with K knots ξk (need to select number of knots and their placement), and linear end regions defined by K basis functions:
N1(X) = 1
N2(X) = X
Nk+2(X) = dk(X) - dK-1(X) with dk= ((X-ξk)^3+ - (X-ξK)^3+) / (ξK - ξk) (watch out difference of K and k)
gives 4 degrees of freedom
Define smoothing splines solution (equivalent)
f(x) = sum(j=1 to n) Nj(x) θj with minimising solution to RSS(θ,λ) = (y-Nθ)^T (y-Nθ) + λθ^T θ Ω θ
Ridge regression solution: θ^ = (N^TN + λΩ)^(-1) N^Ty giving f^(x) = sum(j=1 to n) Nj(x) θ^j = Sλ y
where Sλ = N(N^TN + λΩ)^(-1) N^T is the smoother matrix similar to the hat matrix (symmetric and positive) where degrees of freedom is tr(Sλ)
What is the purpose of smoothing splines
Aim is to find f minimising RSS(f,λ) = sum(i=1 to n) {yi-f(xi)}^2 + λ ∫ {f’‘(t)}^2 dt
- 1st term measures goodness of fit (λ =0: nearly perfect fit)
- 2nd term penalises curvature (λ = ∞: simple ls linear regression)
λ controls the trade-off
Give two other forms of Sλ
- Reinsch form: Sλ = (I + λK)^(-1) where K doesn’t depend on K
- eigendecomposition: Sλ = sum(k=1 to n) ρk(λ) uk uk^T where ρk(λ) = 1 / (1+λdk) where dk is evalue of K and therefore don’t depent on λ
How to choose λ from cross-validation?
λ* minimiser of CV(f^λ(x)) = 1/n sum(i=1 to n) {(yi - f^λ(xi)) / (1-Sλ(i,i))}^2
Define Kernel smoothing
Aim (same as spline smoothing): density estimation and non-parametric regression
Define a kernel function K
- K(x) >= 0 for all x
- integrates to 1
- symmetric
Define the kernel density estimator
f^n,h,K(x) = 1/nh sum(i=1 to n) K(Xi-x/h)
average of kernels centered on x given data Xi and bandwidth h (important choice: too small => high var and low bias / too large = low var and high bias)
How to choose h from bias and variance of rectangular kernel?
nhf^n,h(x) ~ Bin(n,p) average number of points in the interval with p = P( x-h/2 f(x) as h-> 0 and var(f^n,h(x)) = np(1-p) / (nh)^2 -> 1/nh f(x) as h -> ∞
Best choice of h:
- minimise of MSE by Taylor expansion of F(x+h/2) and F(x-h/2): MSE = f(x) /nh + (1/24)^2 {f’‘(x)}^2 h^4
- differentiate wrt h gives h* = (C* / n)^(1/5) with C* depending of f(x) and f’‘(x) ?
What is the expectation and variance with a general kernel?
- express with integral and use Taylor series of f(x+hv) similarly but also depend on f(x)
Mention types of kernels
- rectangular
- triangular
- Epanechnivok (parabolic)
- Gaussian / normal
What is the Nadaraya-Watson kernel regression estimating E(Y|X) ?
E^(Y|X=x) = (sum(i=1 to n) Yi Kh(x-Xi)) / (sum(i=1 to n) Kh(x-Xi))
but still depends on h (which impact a lot on the variability of result)
Steps of Local Polynomial Regression
Aim: estimate polynomial mx0(x) = sum(j=0 to p) βj(x0) (x-x0)^j centered at x0
- construct Wi,i = Kh(Xi - x0) contribution to polynomial mx0(x) and Xi,j = (Xi - x0)^j
- set β^(x0) = (XT W X)^(-1) XT W Y to minimise weighted least-squares problem RSS(x0)
- same variance as NW and bias is smaller (generally better for odd p because even polynomials screw up at the extremes - bias quadratic)