Block 3: Modern Regression Flashcards by Camille Ksr

Formula of linear basis expansion in X by restricted number of transformations Mj

f(X) = sum(j=1 to p) sum(m=1 to Mj) βj,m hj,m(Xj)
for p: dimension of data
for Mj: number of flexible functions of Xj

How well did you know this?

Not at all

Perfectly

What would be the degrees of freedom for general piecewise polynomial? for continuous piecewise cubic functions for 3 regions?

d = Number of parameters - number of constraints

example continuous piecewise cubic functions for 3 regions (ie. 3 indicator functions hm, m=1,2,3):
- 12 parameters (4 parameters per region)
- 6 constraints (3 constraints: continuity of function, 1st derivative and 2nd derivative at 2 knots)
d= 12 - 6 = 6

How well did you know this?

Not at all

Perfectly

Define Natural Cubic Spline basis

Cubic splines, with K knots ξk (need to select number of knots and their placement), and linear end regions defined by K basis functions:
N1(X) = 1
N2(X) = X
Nk+2(X) = dk(X) - dK-1(X) with dk= ((X-ξk)^3+ - (X-ξK)^3+) / (ξK - ξk) (watch out difference of K and k)

gives 4 degrees of freedom

How well did you know this?

Not at all

Perfectly

Define smoothing splines solution (equivalent)

f(x) = sum(j=1 to n) Nj(x) θj with minimising solution to RSS(θ,λ) = (y-Nθ)^T (y-Nθ) + λθ^T θ Ω θ

Ridge regression solution: θ^ = (N^TN + λΩ)^(-1) N^Ty giving f^(x) = sum(j=1 to n) Nj(x) θ^j = Sλ y
where Sλ = N(N^TN + λΩ)^(-1) N^T is the smoother matrix similar to the hat matrix (symmetric and positive) where degrees of freedom is tr(Sλ)

How well did you know this?

Not at all

Perfectly

What is the purpose of smoothing splines

Aim is to find f minimising RSS(f,λ) = sum(i=1 to n) {yi-f(xi)}^2 + λ ∫ {f’‘(t)}^2 dt
- 1st term measures goodness of fit (λ =0: nearly perfect fit)
- 2nd term penalises curvature (λ = ∞: simple ls linear regression)
λ controls the trade-off

How well did you know this?

Not at all

Perfectly

Give two other forms of Sλ

Reinsch form: Sλ = (I + λK)^(-1) where K doesn’t depend on K
eigendecomposition: Sλ = sum(k=1 to n) ρk(λ) uk uk^T where ρk(λ) = 1 / (1+λdk) where dk is evalue of K and therefore don’t depent on λ

How well did you know this?

Not at all

Perfectly

How to choose λ from cross-validation?

λ* minimiser of CV(f^λ(x)) = 1/n sum(i=1 to n) {(yi - f^λ(xi)) / (1-Sλ(i,i))}^2

How well did you know this?

Not at all

Perfectly

Define Kernel smoothing

Aim (same as spline smoothing): density estimation and non-parametric regression

How well did you know this?

Not at all

Perfectly

Define a kernel function K

K(x) >= 0 for all x
integrates to 1
symmetric

How well did you know this?

Not at all

Perfectly

Define the kernel density estimator

f^n,h,K(x) = 1/nh sum(i=1 to n) K(Xi-x/h)
average of kernels centered on x given data Xi and bandwidth h (important choice: too small => high var and low bias / too large = low var and high bias)

How well did you know this?

Not at all

Perfectly

How to choose h from bias and variance of rectangular kernel?

nhf^n,h(x) ~ Bin(n,p) average number of points in the interval with p = P( x-h/2 f(x) as h-> 0 and var(f^n,h(x)) = np(1-p) / (nh)^2 -> 1/nh f(x) as h -> ∞

Best choice of h:

minimise of MSE by Taylor expansion of F(x+h/2) and F(x-h/2): MSE = f(x) /nh + (1/24)^2 {f’‘(x)}^2 h^4
differentiate wrt h gives h* = (C* / n)^(1/5) with C* depending of f(x) and f’‘(x) ?

How well did you know this?

Not at all

Perfectly

What is the expectation and variance with a general kernel?

express with integral and use Taylor series of f(x+hv) similarly but also depend on f(x)

How well did you know this?

Not at all

Perfectly

Mention types of kernels

rectangular
triangular
Epanechnivok (parabolic)
Gaussian / normal

How well did you know this?

Not at all

Perfectly

What is the Nadaraya-Watson kernel regression estimating E(Y|X) ?

E^(Y|X=x) = (sum(i=1 to n) Yi Kh(x-Xi)) / (sum(i=1 to n) Kh(x-Xi))
but still depends on h (which impact a lot on the variability of result)

How well did you know this?

Not at all

Perfectly

Steps of Local Polynomial Regression

Aim: estimate polynomial mx0(x) = sum(j=0 to p) βj(x0) (x-x0)^j centered at x0

construct Wi,i = Kh(Xi - x0) contribution to polynomial mx0(x) and Xi,j = (Xi - x0)^j
set β^(x0) = (XT W X)^(-1) XT W Y to minimise weighted least-squares problem RSS(x0)
same variance as NW and bias is smaller (generally better for odd p because even polynomials screw up at the extremes - bias quadratic)

How well did you know this?

Not at all

Perfectly

Define orthogonal series

Study These Flashcards

1D: f(x) = sum(ν) fν ρν(x) where fν = = ∫f(x) ρν(x) dx and ∫ρν(x) ρμ(x) dx = δν,μ
2D: f(x,y) = sum(ν) sum(μ) fν,μ ρν(x) ρμ(y) where fν,μ = ∫∫ρν(x) ρμ(y) dxd(y)
same information but expressed in orthogonal basis functions {ρν},

Steps of orthogonal series estimators

Study These Flashcards

compute estimator f^ν = 1/n sum(i=1 to n) ρν(Xi) (and f^ν,μ = 1/n sum(i=1 to n) ρν(Xi) ρμ(Yi) )
replace f^ν in f(x) to get f^(x) unbiased when ν goes from -∞ to ∞ but bias{f^(x)} = - sum(|ν|>m) fν ρν(x) when |ν| < m

Define Haar wavelets

Study These Flashcards

Haar father wavelet: φ(x) = 1 if x ∈ (0,1), 0 otherwise
{φ(x-l)}l∈Z forms an orthonormal basis for V0 and scaled and translated wavelet φj,k(x) = 2^(j/2) φ(2^j x - k) form an orthonormal basis for Vj
Haar mother wavelet: ψ(x) = 1 if x ∈ (0,1/2), -1 if x ∈ (1/2,1), 0 otherwise
{ψ(x-l)}l∈Z forms an basis for W0 and scaled and translated wavelet ψj,k(x) = 2^(j/2) ψ(2^j x - k) form an orthonormal basis for Wj
where Vj+1 = Vj (+) Wj since with ψ(x) orthogonal to φ(x)

Define Multiresolution Analysis

Study These Flashcards

MRA is a framework for examining functions at different scales / resolution j simultaneously. We form the space Vj containing functions with resolution <= j, with properties:

… ⊂ V-2 ⊂ V-1 ⊂ V0 ⊂ V1 ⊂ V2 ⊂ …
union: ∪ Vj = L2(R)
intersection: ∩ Vj = {0}
scaling adds resolution: f(x) ∈ Vj => f(2x) ∈ Vj+1 for all j
translation keeps resolution in V0: f(x) ∈ V0 => f(x-k) ∈ V0 for all k

Give the Fourier Basis Example

Study These Flashcards

Rk = = 1/n sum(j= to n) exp{-2πikXj} ∫ Kh(y) exp{2πikXj} dy = K~h,k 1/n sum(j= to n) exp{-2πikXj}
where K~h,k = ∫ Kh(y) exp{2πikXj} dy is the fourier coefficient

Explain and express detail loss

Study These Flashcards

From fine scale wavelet representation to coarser scale, we lose detail as:
f(x) = fj0(x) + f∞(x) = sum(k∈Z) cj0,k φj0,k(x) + sum(j=j0 to ∞) sum(k∈Z) dj,k ψj,k(x)
so fj1(x) - fj0(x) = sum(j=j0 to j1) sum(k∈Z) dj,k ψj,k(x)
where the 2nd term is the lost detail
{cj,k} is the projection of f(x) onto Vj: cj,k = ∫f(x) φj,k(x) dx
{dj,k} are called the wavelet coefficients

Explain the plot of wavelet coefficients for a block signal

Study These Flashcards

The wavelet coefficients are non-zeros at discontinuity and are even more visible for coefficients from different scales j

Explain vanishing moments

Study These Flashcards

A wavelet has m vanishing moments if
∫ x^l ψ(x) dx = 0 for l=0,…,m-1
- in that case, can make polynomial of order < m, exactly zero

What is the relation between coarser father wavelet coefficient cj-1,k and associated wavelet coefficients dj-1,k

Study These Flashcards

cj-1,k = (cj,2k+1 + cj,2k) / √2
dj-1,k = (cj,2k+1 - cj,2k) / √2
can write use pyramid algorithm (from y to c0 and d0) or express d = Wy where y are the coarser coefficients and d are the wavelet coefficients

Explain wavelet shrinkage

Wavelet transform (by orthogonal matrix W) preserves - zero mean - uncorrelated noise - norm of function by Parseval's theorem (making it more sparse) advantages: it deals with noise very well, with discontinuity (no assumption) and get sharper peaks

What is a good prior distribution for wavelet coefficients?

dj,. = γj N(0,τj^2) + (1-γj) δ(x) where δ(x) is the Dirac delta at 0 and γj ~ Bernoulli (pj) => can derive posterior for prior dj,.

Block 3: Modern Regression Flashcards

(26 cards)