Regression Trees Flashcards
Gini Index
Impurity measure of node t.
i(t) = 1 - sum(p_jt)^2
p_jt is the realtive frequency of class j at node t.
Gini index of split is the sum of all i(t)s weighted by the relative amount of cases in each node.
We choose the attribute that provides the smallest GINI split measure.
Information gain
Measures the homogeneity of a node.
Entropy:
i(t) = sum(p_jt * log(p_jt))
Entropy gain of split is the difference between the entropy before the split and the sum of the entropy of each node after the split, weighted by hteir relative frequencies.
We choose the split that achieves the greatest reduction, and allows us to maximise the gain.
Split Info and Gain Ratio
Split info is the sum of the relative number of cases in the node times the log of the relative number of cases.
Gain ratio is the Entropy gain over the split info.
Stop criteria
- Minimum size of groups
- Minimum non-homogeneity of parent group
- Maximum number of iterations
- Minimum explanatory power
- Pruning
CART
Characteristics:
- Variables:
1 dependent variable (quantitative or qualitative), mixed explanatory variables. - Split type: Binary.
- Splitting rule: based on impurities.
Makes all possible subdivisions of explanatory variables, and chooses the that produces the maximum reduction of impurity. - Stop rule:
based on the number of cases on the leaf nodes. And optimisation based on pruning.
CHAID
- Variables: Qualitative dependent variables - Split type: can be on binary or multiple nodes - Splitting rule: based on Chi-square test for the null hypothesis of statistical independence between the dependent variable and the explanatory variable. - Stop rule: explicit and must relate to the maximum dimension of the tree, maximum number of levels or the minimum number of elements in a node.
C4.5/C5.0
Similar to CART, but differs in the following respects:
- The segmentation of the nodes is not binary.
- Predictors and their values are selected based on information gain.
- Stop criterion is the pruning, based on the assignment of an expected error at each leaf.
QUEST
Quick, Unbiased, Efficient, Statistical Tree
- Binary splits
- Choice of explanatory variables of the split is done before the first split.
- The association between each predictor variable and the target variable is calculated using ANOVA F or the Levene test (for continuous predictors, or ordinal) or Pearson Chi-square(nominal predictors).
Random Forest
The Random Forest method employs instead of a single tree, a set of
decision trees.
Each tree is implemented on an appropriate data resampling (Bootstrap with N samples), and on a subset of predictor variables.
X trees are estimated in this way and at the end the classification suggested by the majority of the trees is the final classification.
Pros and cons of Random Forest
Pros:
- High predictive performance/Generalisation
- Parameters easy to choose
Cons:
- Complexity/Processing time
- More difficult interpretation
XGBoost
The forest of trees is estimated sequentially, such that each new tree takes into account the prediction errors of the previous tree.
Pros:
- Computational speed
- Reduction of overfitting problems
- ability to easily define custom objective functions.
Objective function of XGBoost: Loss + Regularisation
Ensemble, Bagging, Boosting
- Ensemble: collection of models that are combined (e.g. with some kind of mean) in order to improve the final accuracy.
Types of Ensembles:
- Bagging: each predictor is independent and they are averaged through mean or voting.
- Boosting: each predictor is some kind of improvement over the previous iteration.