Overfitting Flashcards

Question 1

Q

Bias-variance decomposition of the out-of -sample error

Answer

A

ED[ Eout(h) ] = bias + variance

bias = Ex[ (h_(x) - f(x))^2 ]
variance = Ex[ ED[ (hD(x) - h_(x))^2 ] ]

1) very small hypothesis set H:
- > high bias, low variance

2) very flexible model (high complexity):
- > low bias, high variance

Question 2

Q

Bias - variance diagram as a function of model complexity

Answer

A

increasing model complexity, the in-sample-error decreases and eventually reaches zero
out-of sample error, instead, has a minimum point, which corresponds to the best model order
before the minimum - > under fitting
after the minimum - > over fitting

Question 3

Q

Learning curves as a function of the dimension of the data-set

Answer

A

1) simple model h
- high bias of ED[Eout(h)]
- low variance of ED[Eout(h)]

2) complex model h
- low bias of ED[Eout(h)]
- high variance of ED[Eout(h)]

In both cases

ED[Eout] decreases with number of data
ED[Ein] increases with number of data
> model complexity should be selected based on the dimension of the data-set, not on the target complexity!

Question 4

Q

Possible tools to tackle the overfitting issue

Answer

A

1) regularization

2) (cross) validation

Question 5

Q

Regularization: key idea

Answer

A

many different tools, for example L2-regularization
key idea is to set a constraint on the minimization of Ein problem, to avoid the use of too many parameters
> the new optimization problem becomes:

w^reg = argmin(w) Ein(w)
s.t. w’ w < = C

where C is a budget

Question 6

Q

Formula for the minimum of the regularized linear regression problem

Answer

A

w^reg = (Z’ Z + λI )^-1 Z’ Y

where λ is a design parameter, that can be tuned by minimizing the cross-validation error

Question 7

Q

Validation: working principle

Answer

A

1) Partition the data set D into:
- training set Dtrain of size N-K
- validation set Dval of size K
- the partition must not depend on the data!

2) run the Learning algorithm to obtain g^- that minimizes Ein using the training set

3) compute the validation error Eval(g^-) corresponding to the obtained hypothesis using the validation set
the validation error is an unbiased estimate of Eout(g^-)

Question 8

Q

On the choice of the dimension of the validation set K

Answer

A

This choice is a design choice subject to a tradeoff:

High K gives low variance of the validation error ( O(1/sqrt(K) )
Low K gives higher dimension of the training set, that gives similar generalization wrt the original data-set
> cross validation is a good way to solve the problem

Question 9

Q

Leave-one-out cross validation

Answer

A

K = 1
there are N ways to partition the data set leaving one data as validation set
in this way, a good estimate of the out-of-sample error is obtained if N is large as
Ecv = 1/N * sum(n=1,N) Eval(g^-n)

Question 10

Q

How to use cross-validation for complexity selection

Answer

A

compute the cross-validation error Ecv for different models
the best model is the one with the lowest cross-validation error