Block 3: Modern Regression Flashcards
(26 cards)
Formula of linear basis expansion in X by restricted number of transformations Mj
f(X) = sum(j=1 to p) sum(m=1 to Mj) βj,m hj,m(Xj)
for p: dimension of data
for Mj: number of flexible functions of Xj
What would be the degrees of freedom for general piecewise polynomial? for continuous piecewise cubic functions for 3 regions?
d = Number of parameters - number of constraints
example continuous piecewise cubic functions for 3 regions (ie. 3 indicator functions hm, m=1,2,3):
- 12 parameters (4 parameters per region)
- 6 constraints (3 constraints: continuity of function, 1st derivative and 2nd derivative at 2 knots)
d= 12 - 6 = 6
Define Natural Cubic Spline basis
Cubic splines, with K knots ξk (need to select number of knots and their placement), and linear end regions defined by K basis functions:
N1(X) = 1
N2(X) = X
Nk+2(X) = dk(X) - dK-1(X) with dk= ((X-ξk)^3+ - (X-ξK)^3+) / (ξK - ξk) (watch out difference of K and k)
gives 4 degrees of freedom
Define smoothing splines solution (equivalent)
f(x) = sum(j=1 to n) Nj(x) θj with minimising solution to RSS(θ,λ) = (y-Nθ)^T (y-Nθ) + λθ^T θ Ω θ
Ridge regression solution: θ^ = (N^TN + λΩ)^(-1) N^Ty giving f^(x) = sum(j=1 to n) Nj(x) θ^j = Sλ y
where Sλ = N(N^TN + λΩ)^(-1) N^T is the smoother matrix similar to the hat matrix (symmetric and positive) where degrees of freedom is tr(Sλ)
What is the purpose of smoothing splines
Aim is to find f minimising RSS(f,λ) = sum(i=1 to n) {yi-f(xi)}^2 + λ ∫ {f’‘(t)}^2 dt
- 1st term measures goodness of fit (λ =0: nearly perfect fit)
- 2nd term penalises curvature (λ = ∞: simple ls linear regression)
λ controls the trade-off
Give two other forms of Sλ
- Reinsch form: Sλ = (I + λK)^(-1) where K doesn’t depend on K
- eigendecomposition: Sλ = sum(k=1 to n) ρk(λ) uk uk^T where ρk(λ) = 1 / (1+λdk) where dk is evalue of K and therefore don’t depent on λ
How to choose λ from cross-validation?
λ* minimiser of CV(f^λ(x)) = 1/n sum(i=1 to n) {(yi - f^λ(xi)) / (1-Sλ(i,i))}^2
Define Kernel smoothing
Aim (same as spline smoothing): density estimation and non-parametric regression
Define a kernel function K
- K(x) >= 0 for all x
- integrates to 1
- symmetric
Define the kernel density estimator
f^n,h,K(x) = 1/nh sum(i=1 to n) K(Xi-x/h)
average of kernels centered on x given data Xi and bandwidth h (important choice: too small => high var and low bias / too large = low var and high bias)
How to choose h from bias and variance of rectangular kernel?
nhf^n,h(x) ~ Bin(n,p) average number of points in the interval with p = P( x-h/2 f(x) as h-> 0 and var(f^n,h(x)) = np(1-p) / (nh)^2 -> 1/nh f(x) as h -> ∞
Best choice of h:
- minimise of MSE by Taylor expansion of F(x+h/2) and F(x-h/2): MSE = f(x) /nh + (1/24)^2 {f’‘(x)}^2 h^4
- differentiate wrt h gives h* = (C* / n)^(1/5) with C* depending of f(x) and f’‘(x) ?
What is the expectation and variance with a general kernel?
- express with integral and use Taylor series of f(x+hv) similarly but also depend on f(x)
Mention types of kernels
- rectangular
- triangular
- Epanechnivok (parabolic)
- Gaussian / normal
What is the Nadaraya-Watson kernel regression estimating E(Y|X) ?
E^(Y|X=x) = (sum(i=1 to n) Yi Kh(x-Xi)) / (sum(i=1 to n) Kh(x-Xi))
but still depends on h (which impact a lot on the variability of result)
Steps of Local Polynomial Regression
Aim: estimate polynomial mx0(x) = sum(j=0 to p) βj(x0) (x-x0)^j centered at x0
- construct Wi,i = Kh(Xi - x0) contribution to polynomial mx0(x) and Xi,j = (Xi - x0)^j
- set β^(x0) = (XT W X)^(-1) XT W Y to minimise weighted least-squares problem RSS(x0)
- same variance as NW and bias is smaller (generally better for odd p because even polynomials screw up at the extremes - bias quadratic)
Define orthogonal series
1D: f(x) = sum(ν) fν ρν(x) where fν = = ∫f(x) ρν(x) dx and ∫ρν(x) ρμ(x) dx = δν,μ
2D: f(x,y) = sum(ν) sum(μ) fν,μ ρν(x) ρμ(y) where fν,μ = ∫∫ρν(x) ρμ(y) dxd(y)
same information but expressed in orthogonal basis functions {ρν},
Steps of orthogonal series estimators
- compute estimator f^ν = 1/n sum(i=1 to n) ρν(Xi) (and f^ν,μ = 1/n sum(i=1 to n) ρν(Xi) ρμ(Yi) )
- replace f^ν in f(x) to get f^(x) unbiased when ν goes from -∞ to ∞ but bias{f^(x)} = - sum(|ν|>m) fν ρν(x) when |ν| < m
Define Haar wavelets
- Haar father wavelet: φ(x) = 1 if x ∈ (0,1), 0 otherwise
{φ(x-l)}l∈Z forms an orthonormal basis for V0 and scaled and translated wavelet φj,k(x) = 2^(j/2) φ(2^j x - k) form an orthonormal basis for Vj - Haar mother wavelet: ψ(x) = 1 if x ∈ (0,1/2), -1 if x ∈ (1/2,1), 0 otherwise
{ψ(x-l)}l∈Z forms an basis for W0 and scaled and translated wavelet ψj,k(x) = 2^(j/2) ψ(2^j x - k) form an orthonormal basis for Wj
where Vj+1 = Vj (+) Wj since with ψ(x) orthogonal to φ(x)
Define Multiresolution Analysis
MRA is a framework for examining functions at different scales / resolution j simultaneously. We form the space Vj containing functions with resolution <= j, with properties:
- … ⊂ V-2 ⊂ V-1 ⊂ V0 ⊂ V1 ⊂ V2 ⊂ …
- union: ∪ Vj = L2(R)
- intersection: ∩ Vj = {0}
- scaling adds resolution: f(x) ∈ Vj => f(2x) ∈ Vj+1 for all j
- translation keeps resolution in V0: f(x) ∈ V0 => f(x-k) ∈ V0 for all k
Give the Fourier Basis Example
Rk = = 1/n sum(j= to n) exp{-2πikXj} ∫ Kh(y) exp{2πikXj} dy = K~h,k 1/n sum(j= to n) exp{-2πikXj}
where K~h,k = ∫ Kh(y) exp{2πikXj} dy is the fourier coefficient
Explain and express detail loss
From fine scale wavelet representation to coarser scale, we lose detail as:
f(x) = fj0(x) + f∞(x) = sum(k∈Z) cj0,k φj0,k(x) + sum(j=j0 to ∞) sum(k∈Z) dj,k ψj,k(x)
so fj1(x) - fj0(x) = sum(j=j0 to j1) sum(k∈Z) dj,k ψj,k(x)
where the 2nd term is the lost detail
{cj,k} is the projection of f(x) onto Vj: cj,k = ∫f(x) φj,k(x) dx
{dj,k} are called the wavelet coefficients
Explain the plot of wavelet coefficients for a block signal
The wavelet coefficients are non-zeros at discontinuity and are even more visible for coefficients from different scales j
Explain vanishing moments
A wavelet has m vanishing moments if
∫ x^l ψ(x) dx = 0 for l=0,…,m-1
- in that case, can make polynomial of order < m, exactly zero
What is the relation between coarser father wavelet coefficient cj-1,k and associated wavelet coefficients dj-1,k
cj-1,k = (cj,2k+1 + cj,2k) / √2
dj-1,k = (cj,2k+1 - cj,2k) / √2
can write use pyramid algorithm (from y to c0 and d0) or express d = Wy where y are the coarser coefficients and d are the wavelet coefficients