Session 4.1 Flashcards
Bias-Variance tradeoff
When trying the optimal model we are in fact trying to find…
the optimal tradeoff between bias and variance
Bias-Variance tradeoff
We can reduce variance by
by putting many models together and aggregating their outcomes
Bagging (or bootstrap aggregation) creates
multiple data sets from the original training data by bootstrapping – re-sample with repetition.
Runs several models and aggregates output with a voting system
Other ensemble methods
Random Forest
combines bagging with random selection of features (or predictors)
Other ensemble methods
Boosting
applies classifiers sequentially, assigning higher weights to observations that have been mis-classified by the previous methods
A table model
memorizes the training data and performs no generalization
Useless in practice! Previously unseen customers would all end up with
“0% likelihood of churning”
Generalization
is the property of a model or modeling process whereby
the model applies to data that were not used to build the model
If models do not generalize at all, they fit perfectly to the training data !
–> they overfit
Overfitting
is the tendency to tailor models to the training data, at the expense of generalization to previously unseen data points.
Holdout Validation
- Given only one data set, we split it into a a training set used for fitting the model and a test set used for evaluating the model
- Performance is evaluated based on accuracy in the test data a.k.a. “holdout accuracy”
- Holdout accuracy is an estimate of “generalization accuracy”
As a model gets more complex, it is allowed to pick up harmful spurious correlations
- These correlations do not represent characteristics of the population in general
- They may become harmful when they produce incorrect generalizations in the model
This phenomenon is not particular to decision trees
- It is also not because of atypical training data
- There is no general analytic way to avoid overfitting
Simplest method to limit tree size:
specify a minimum number of instances that must be present in a leaf
Just as with trees, as you increase the dimensionality,
you can perfectly fit larger and larger sets of arbitrary points
- Often, modelers manually prune the attributes in order to avoid overfitting
- There are ways to select attributes automatically
Why is overfitting bad?
A small imbalance in the training data can be ’learned’ by the tree and erroneously propagated
Why is the phenomenon of overfitting not particular to decision trees
- It is also not because of atypical training data
- There is no general analytic way to avoid overfitting