Regularisation Flashcards

1
Q

Goal of regularisation in NNs

A

Take a model that includes the generating process but also many other possible generating processes

And

Help it match the true data generating process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Denote regularized objective function

A

Where:
α € [0, inf)
Ω(θ) is norm penalty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Normal practise where selecting parameter norm penalty for NNs

A

Choose Ω to penalise only weights of affine transformations at each layer, leave biases as they typically require less data than weights to fit accurately

Largely because weights concern interactions of variables, biases concern single variables

Also, including biases can introduce a significant amount of underfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

w

A

Refers to all weights that should be affected by norm penalty while θ denotes all parameters including both w and unregularized parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

L2 called?

A

Partner norm penalty, AKA weight decay, RIDGE regression or Tikhonov regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Show how L2 works on gradient of objective function (let θ = w for simplicity)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Further simplify by making quadratic appromation of objective function in neighbourhood of value of weights that obtains minimal inregularized training costs

A

If objective function is truly quadratic then approximation is prefect and given by

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

L1 regularisation term

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Give regularized objective function and corresponding gradient for L1 (for SLRM)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe difference in grad of objective function from L2 to L1

A

Regularization contribution of gradient no longer scales linearly with each wi

Instead it is a constant factor * sign(wi)

Therefore we won’t necessarily see clean algebraic solutions to quadratic approximations of (unregularized) objective function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Taylor series expansion of grad of cost function for SLRM L1

A

We are assuming the regularised cost function is a truncated Taylor series expansion of a more complicated model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Decompose quadratic approximation of L1 regularised objective function for SLRM

A

(We assume hessian is diagonal and each value on trace is pos for simplicity)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Solution to

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Sparsity in context of regularization

A

Having parameters with optimal value of 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Solution to L2 regularisation for SLRM (with θ = w)

A

7.13

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Motivate feature selection using L1

A

(Image) therefore sparsity induced by L1 can be used as a feature selection mechanism

17
Q

Log laplace

A