Regularization and Feature Selection Flashcards
Regularized Loss Minimization
Another learning paradigm, picking h defined by a vector(w) and a regularization function R: R^d –> R
argmon (Ls(w) +R(w))
R(w) is a measure of complexity of hypothesis h defined by w, it balances between low empirical risk and less complex hypothesis
L1 regularization, LASSO
Regularization function: R(w) = lambda ||w||1
l1 norm: ||w||1 = SUM |wi|
The learning rule is:
A(S) = argmin(Ls(w)+lambda ||w||1
- lambda regulates the tradeoff between the empirical risk Ls(w) or overfitting and the complexity (||w||1) of the model we pick
- ||w||1 measures the complexity of hypothesis defined by w
LASSO: linear regression with squared loss + l1 regularization
w=argmin lambda||w||1 + SUM(<w,xi> -yi)^2
l1 norm is a convex function and squared loss is convex –> problem can be solved efficiently
Tikhonov Regularization, RIDGE Regression, derivation of optimal solution for Ridge Regression
Regularization function: R(w) = lambda ||w||^2
l2 norm: ||w||^2 = SUM wi^2
The learning rule is:
A(S) = argmin(Ls(w)+lambda ||w||^2
- lambda regulates the tradeoff between the empirical risk Ls(w) or overfitting and the complexity (||w||^2) of the model we pick
- ||w||^2 measures the complexity of hypothesis defined by w
RIDGE: linear regression with squared loss + Tikhonov regularization
w=argmin( lambda||w||^2 + SUM(<w,xi> -yi)^2 )
Feature selection: subset selection, forward selection, backward selection, without and with validation data
Features selection: task of selecting a small number of features from a large pool, that will be used by the predictor. This prevent overfitting and a predictor can be done faster.
Its goal is to learn predictor using k «_space;d predictors
Select k features that minimize the empiriacl risk
Subset selection
Subset selection: it select the set composed by k features that minimized the training error.
Pseudocode:
Let
- I = 1…d
-given p = {i1,…,ik} proper subset of I : Hp = hypothesis/ model where only features wi1,….,wik are used
P(k) <– {J proper subset of I : |J| = k}
foreach p in P(k) do
- hp <– argmin Ls(h)
return h(k) <– argmin Ls(hp)
Forward selection
Greedy algorithm, start from the empty solution, add one feature at the time, until solution has cardinality k
sol <– empty;
while |sol|<k do
- foreach i in I\sol do
- - p <– sol union {i}
- - hp <– argmin Ls(h)
- sol <– sol union argmin Ls(hsolunion{i})
return sol
Backward selection
Greedy algorithm, start from the solution which includes all features, remove one feature at the time, until solution has cardinality k
Backward selection with validation:
for l <– 0 to k do
- p(l) <–{ J proper subset I : |J| = l}
- foreach p in P(l) do
- - hp <– argmin Ls(h)
- h(l) <– argmin Lv(hp)
return argmin Lv(h)
Backward selection with cross validation:
sol <– empty;
while |sol|< k do
- foreach i in I\sol do
- - p <– sol union {i}
- - hp <– argmin Ls(h)
- sol <– sol union argmin Lv(hsolunion{i})
return sol