Ensemble methods Flashcards
In Ensemble Learning, how to combine K different models?
P(t|x) = Σ Πk(x) * P(t|x, k)
where Πk(x) = P(k|x) are input-dependant mixing coefficients
What is bagging (= bootstrap aggregating)?
It’s averaging the predictions of a set of models, each of which is trained on a different subset of the total dataset.
What is the expected error after bagging?
E_com = E_av / M, where :
- E_av is the average error of all individual models
- M is the number of model
What is a Random Forest?
- Training K different Decision Trees on different part of the dataset (using bootstrapping)
- For each node, randomly pick m < M different variables to consider
- No pruning is needed
By averaging the prediction of multiple high variance trees, we reduce both bias and variance (as compared with simple a simple Decision Tree).
What are Extremly Randomized Trees (ERTS)?
It’s like a Random Forest but in addition to randomly picking m attributes, the attribute test is also chosen randomly (2-way split, multiway split, etc.).
ERTS also include a depth hyperparameter.
What is boosting?
It’s sequentially training multiple classifiers by weighting the examples according to the previous classifier errors (misclassified datapoints gain more weight). The global prediction is determined by majority voting.
What is the AdaBoost algorithm?
1) Initialize each weight to 1/N
2) For m = 1, 2, … M :
- fit a classifier y_m(x) by minimizing Jm = Σ w_n^[m] * Id(y_m(x(n) ≠ t(n)))
- evaluate Em = Jm / Σ w_n^[m] and α_m = ln[ (1 - Em) / Em]
- update the weights: w_n^[m+1] = w_n^[m] * exp( α_m * Id(y_m(x(n) ≠ t(n))) )
3) y_M(x) = Sign( Σ α_m * y_m(x) )
What is a decision stump?
It’s a 1-level decision tree (i.e. a decision tree containing a single node).
What are weak learners?
It’s the base learners used for boosting (which doesn’t require complex models as base learners, as opposed to bagging).
What is the exponential error function formula?
E_M = Σ exp( - t(n) * f_M(x(n)) ), where f_M is a linear combination of multiple classifiers: f_M(x(n) = Σ α_m * y_m(x(n)) / 2
Which function is boosting approximating?
The log-odds ratio y(x) = ln[ P(t=+1|x) / P(t=-1|x) ] / 2
Compare bagging vs boosting.
- faster vs slower
- small vs higher error reduction
- works well with reasonable vs weak base classifiers
- doesn’t vs does overfit to wrong labels
- reduces variance vs bias
What is stacking?
It’s an ensemble method using different types of base models. The different outputs are then fed to another model, called Meta-classifier to make final prediction. K-fold cross validation is a good way of avoiding overfitting when stacking.