Block 3: Modern Regression Flashcards

1
Q

Formula of linear basis expansion in X by restricted number of transformations Mj

A

f(X) = sum(j=1 to p) sum(m=1 to Mj) βj,m hj,m(Xj)
for p: dimension of data
for Mj: number of flexible functions of Xj

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What would be the degrees of freedom for general piecewise polynomial? for continuous piecewise cubic functions for 3 regions?

A

d = Number of parameters - number of constraints

example continuous piecewise cubic functions for 3 regions (ie. 3 indicator functions hm, m=1,2,3):
- 12 parameters (4 parameters per region)
- 6 constraints (3 constraints: continuity of function, 1st derivative and 2nd derivative at 2 knots)
d= 12 - 6 = 6

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define Natural Cubic Spline basis

A

Cubic splines, with K knots ξk (need to select number of knots and their placement), and linear end regions defined by K basis functions:
N1(X) = 1
N2(X) = X
Nk+2(X) = dk(X) - dK-1(X) with dk= ((X-ξk)^3+ - (X-ξK)^3+) / (ξK - ξk) (watch out difference of K and k)

gives 4 degrees of freedom

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define smoothing splines solution (equivalent)

A

f(x) = sum(j=1 to n) Nj(x) θj with minimising solution to RSS(θ,λ) = (y-Nθ)^T (y-Nθ) + λθ^T θ Ω θ

Ridge regression solution: θ^ = (N^TN + λΩ)^(-1) N^Ty giving f^(x) = sum(j=1 to n) Nj(x) θ^j = Sλ y
where Sλ = N(N^TN + λΩ)^(-1) N^T is the smoother matrix similar to the hat matrix (symmetric and positive) where degrees of freedom is tr(Sλ)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the purpose of smoothing splines

A

Aim is to find f minimising RSS(f,λ) = sum(i=1 to n) {yi-f(xi)}^2 + λ ∫ {f’‘(t)}^2 dt
- 1st term measures goodness of fit (λ =0: nearly perfect fit)
- 2nd term penalises curvature (λ = ∞: simple ls linear regression)
λ controls the trade-off

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Give two other forms of Sλ

A
  • Reinsch form: Sλ = (I + λK)^(-1) where K doesn’t depend on K
  • eigendecomposition: Sλ = sum(k=1 to n) ρk(λ) uk uk^T where ρk(λ) = 1 / (1+λdk) where dk is evalue of K and therefore don’t depent on λ
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to choose λ from cross-validation?

A

λ* minimiser of CV(f^λ(x)) = 1/n sum(i=1 to n) {(yi - f^λ(xi)) / (1-Sλ(i,i))}^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define Kernel smoothing

A

Aim (same as spline smoothing): density estimation and non-parametric regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define a kernel function K

A
  • K(x) >= 0 for all x
  • integrates to 1
  • symmetric
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define the kernel density estimator

A

f^n,h,K(x) = 1/nh sum(i=1 to n) K(Xi-x/h)
average of kernels centered on x given data Xi and bandwidth h (important choice: too small => high var and low bias / too large = low var and high bias)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How to choose h from bias and variance of rectangular kernel?

A

nhf^n,h(x) ~ Bin(n,p) average number of points in the interval with p = P( x-h/2 f(x) as h-> 0 and var(f^n,h(x)) = np(1-p) / (nh)^2 -> 1/nh f(x) as h -> ∞

Best choice of h:

  • minimise of MSE by Taylor expansion of F(x+h/2) and F(x-h/2): MSE = f(x) /nh + (1/24)^2 {f’‘(x)}^2 h^4
  • differentiate wrt h gives h* = (C* / n)^(1/5) with C* depending of f(x) and f’‘(x) ?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the expectation and variance with a general kernel?

A
  • express with integral and use Taylor series of f(x+hv) similarly but also depend on f(x)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Mention types of kernels

A
  • rectangular
  • triangular
  • Epanechnivok (parabolic)
  • Gaussian / normal
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the Nadaraya-Watson kernel regression estimating E(Y|X) ?

A

E^(Y|X=x) = (sum(i=1 to n) Yi Kh(x-Xi)) / (sum(i=1 to n) Kh(x-Xi))
but still depends on h (which impact a lot on the variability of result)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Steps of Local Polynomial Regression

A

Aim: estimate polynomial mx0(x) = sum(j=0 to p) βj(x0) (x-x0)^j centered at x0

  • construct Wi,i = Kh(Xi - x0) contribution to polynomial mx0(x) and Xi,j = (Xi - x0)^j
  • set β^(x0) = (XT W X)^(-1) XT W Y to minimise weighted least-squares problem RSS(x0)
  • same variance as NW and bias is smaller (generally better for odd p because even polynomials screw up at the extremes - bias quadratic)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Define orthogonal series

A

1D: f(x) = sum(ν) fν ρν(x) where fν = = ∫f(x) ρν(x) dx and ∫ρν(x) ρμ(x) dx = δν,μ
2D: f(x,y) = sum(ν) sum(μ) fν,μ ρν(x) ρμ(y) where fν,μ = ∫∫ρν(x) ρμ(y) dxd(y)
same information but expressed in orthogonal basis functions {ρν},

17
Q

Steps of orthogonal series estimators

A
  • compute estimator f^ν = 1/n sum(i=1 to n) ρν(Xi) (and f^ν,μ = 1/n sum(i=1 to n) ρν(Xi) ρμ(Yi) )
  • replace f^ν in f(x) to get f^(x) unbiased when ν goes from -∞ to ∞ but bias{f^(x)} = - sum(|ν|>m) fν ρν(x) when |ν| < m
18
Q

Define Haar wavelets

A
  • Haar father wavelet: φ(x) = 1 if x ∈ (0,1), 0 otherwise
    {φ(x-l)}l∈Z forms an orthonormal basis for V0 and scaled and translated wavelet φj,k(x) = 2^(j/2) φ(2^j x - k) form an orthonormal basis for Vj
  • Haar mother wavelet: ψ(x) = 1 if x ∈ (0,1/2), -1 if x ∈ (1/2,1), 0 otherwise
    {ψ(x-l)}l∈Z forms an basis for W0 and scaled and translated wavelet ψj,k(x) = 2^(j/2) ψ(2^j x - k) form an orthonormal basis for Wj
    where Vj+1 = Vj (+) Wj since with ψ(x) orthogonal to φ(x)
19
Q

Define Multiresolution Analysis

A

MRA is a framework for examining functions at different scales / resolution j simultaneously. We form the space Vj containing functions with resolution <= j, with properties:

  • … ⊂ V-2 ⊂ V-1 ⊂ V0 ⊂ V1 ⊂ V2 ⊂ …
  • union: ∪ Vj = L2(R)
  • intersection: ∩ Vj = {0}
  • scaling adds resolution: f(x) ∈ Vj => f(2x) ∈ Vj+1 for all j
  • translation keeps resolution in V0: f(x) ∈ V0 => f(x-k) ∈ V0 for all k
20
Q

Give the Fourier Basis Example

A

Rk = = 1/n sum(j= to n) exp{-2πikXj} ∫ Kh(y) exp{2πikXj} dy = K~h,k 1/n sum(j= to n) exp{-2πikXj}
where K~h,k = ∫ Kh(y) exp{2πikXj} dy is the fourier coefficient

21
Q

Explain and express detail loss

A

From fine scale wavelet representation to coarser scale, we lose detail as:
f(x) = fj0(x) + f∞(x) = sum(k∈Z) cj0,k φj0,k(x) + sum(j=j0 to ∞) sum(k∈Z) dj,k ψj,k(x)
so fj1(x) - fj0(x) = sum(j=j0 to j1) sum(k∈Z) dj,k ψj,k(x)
where the 2nd term is the lost detail
{cj,k} is the projection of f(x) onto Vj: cj,k = ∫f(x) φj,k(x) dx
{dj,k} are called the wavelet coefficients

22
Q

Explain the plot of wavelet coefficients for a block signal

A

The wavelet coefficients are non-zeros at discontinuity and are even more visible for coefficients from different scales j

23
Q

Explain vanishing moments

A

A wavelet has m vanishing moments if
∫ x^l ψ(x) dx = 0 for l=0,…,m-1
- in that case, can make polynomial of order < m, exactly zero

24
Q

What is the relation between coarser father wavelet coefficient cj-1,k and associated wavelet coefficients dj-1,k

A

cj-1,k = (cj,2k+1 + cj,2k) / √2
dj-1,k = (cj,2k+1 - cj,2k) / √2
can write use pyramid algorithm (from y to c0 and d0) or express d = Wy where y are the coarser coefficients and d are the wavelet coefficients

25
Q

Explain wavelet shrinkage

A

Wavelet transform (by orthogonal matrix W) preserves

  • zero mean
  • uncorrelated noise
  • norm of function by Parseval’s theorem (making it more sparse)

advantages: it deals with noise very well, with discontinuity (no assumption) and get sharper peaks

26
Q

What is a good prior distribution for wavelet coefficients?

A

dj,. = γj N(0,τj^2) + (1-γj) δ(x) where δ(x) is the Dirac delta at 0 and γj ~ Bernoulli (pj)
=> can derive posterior for prior dj,.