Regularisation Flashcards
Goal of regularisation in NNs
Take a model that includes the generating process but also many other possible generating processes
And
Help it match the true data generating process
Denote regularized objective function
Where:
α € [0, inf)
Ω(θ) is norm penalty
Normal practise where selecting parameter norm penalty for NNs
Choose Ω to penalise only weights of affine transformations at each layer, leave biases as they typically require less data than weights to fit accurately
Largely because weights concern interactions of variables, biases concern single variables
Also, including biases can introduce a significant amount of underfitting
w
Refers to all weights that should be affected by norm penalty while θ denotes all parameters including both w and unregularized parameters
L2 called?
Partner norm penalty, AKA weight decay, RIDGE regression or Tikhonov regression
Show how L2 works on gradient of objective function (let θ = w for simplicity)
Further simplify by making quadratic appromation of objective function in neighbourhood of value of weights that obtains minimal inregularized training costs
If objective function is truly quadratic then approximation is prefect and given by
L1 regularisation term
Give regularized objective function and corresponding gradient for L1 (for SLRM)
Describe difference in grad of objective function from L2 to L1
Regularization contribution of gradient no longer scales linearly with each wi
Instead it is a constant factor * sign(wi)
Therefore we won’t necessarily see clean algebraic solutions to quadratic approximations of (unregularized) objective function
Taylor series expansion of grad of cost function for SLRM L1
We are assuming the regularised cost function is a truncated Taylor series expansion of a more complicated model
Decompose quadratic approximation of L1 regularised objective function for SLRM
(We assume hessian is diagonal and each value on trace is pos for simplicity)
Solution to
Sparsity in context of regularization
Having parameters with optimal value of 0
Solution to L2 regularisation for SLRM (with θ = w)
7.13