Overfitting Flashcards
Bias-variance decomposition of the out-of -sample error
ED[ Eout(h) ] = bias + variance
bias = Ex[ (h_(x) - f(x))^2 ] variance = Ex[ ED[ (hD(x) - h_(x))^2 ] ]
1) very small hypothesis set H:
- > high bias, low variance
2) very flexible model (high complexity):
- > low bias, high variance
Bias - variance diagram as a function of model complexity
- increasing model complexity, the in-sample-error decreases and eventually reaches zero
- out-of sample error, instead, has a minimum point, which corresponds to the best model order
- before the minimum - > under fitting
- after the minimum - > over fitting
Learning curves as a function of the dimension of the data-set
1) simple model h
- high bias of ED[Eout(h)]
- low variance of ED[Eout(h)]
2) complex model h
- low bias of ED[Eout(h)]
- high variance of ED[Eout(h)]
In both cases
- ED[Eout] decreases with number of data
- ED[Ein] increases with number of data
- > model complexity should be selected based on the dimension of the data-set, not on the target complexity!
Possible tools to tackle the overfitting issue
1) regularization
2) (cross) validation
Regularization: key idea
- many different tools, for example L2-regularization
- key idea is to set a constraint on the minimization of Ein problem, to avoid the use of too many parameters
- > the new optimization problem becomes:
w^reg = argmin(w) Ein(w)
s.t. w’ w < = C
where C is a budget
Formula for the minimum of the regularized linear regression problem
w^reg = (Z’ Z + λI )^-1 Z’ Y
where λ is a design parameter, that can be tuned by minimizing the cross-validation error
Validation: working principle
1) Partition the data set D into:
- training set Dtrain of size N-K
- validation set Dval of size K
- the partition must not depend on the data!
2) run the Learning algorithm to obtain g^- that minimizes Ein using the training set
3) compute the validation error Eval(g^-) corresponding to the obtained hypothesis using the validation set
the validation error is an unbiased estimate of Eout(g^-)
On the choice of the dimension of the validation set K
This choice is a design choice subject to a tradeoff:
- High K gives low variance of the validation error ( O(1/sqrt(K) )
- Low K gives higher dimension of the training set, that gives similar generalization wrt the original data-set
- > cross validation is a good way to solve the problem
Leave-one-out cross validation
- K = 1
- there are N ways to partition the data set leaving one data as validation set
- in this way, a good estimate of the out-of-sample error is obtained if N is large as
Ecv = 1/N * sum(n=1,N) Eval(g^-n)
How to use cross-validation for complexity selection
- compute the cross-validation error Ecv for different models
- the best model is the one with the lowest cross-validation error