Regression Trees Flashcards

Question 1

Q

Gini Index

Answer

A

Impurity measure of node t.
i(t) = 1 - sum(p_jt)^2

p_jt is the realtive frequency of class j at node t.

Gini index of split is the sum of all i(t)s weighted by the relative amount of cases in each node.

We choose the attribute that provides the smallest GINI split measure.

Question 2

Q

Information gain

Answer

A

Measures the homogeneity of a node.

Entropy:
i(t) = sum(p_jt * log(p_jt))

Entropy gain of split is the difference between the entropy before the split and the sum of the entropy of each node after the split, weighted by hteir relative frequencies.

We choose the split that achieves the greatest reduction, and allows us to maximise the gain.

Question 3

Q

Split Info and Gain Ratio

Answer

A

Split info is the sum of the relative number of cases in the node times the log of the relative number of cases.
Gain ratio is the Entropy gain over the split info.

Question 4

Q

Stop criteria

Answer

A

Minimum size of groups
Minimum non-homogeneity of parent group
Maximum number of iterations
Minimum explanatory power
Pruning

Question 5

Q

CART

Answer

A

Characteristics:

Variables:
1 dependent variable (quantitative or qualitative), mixed explanatory variables.
Split type: Binary.
Splitting rule: based on impurities.
Makes all possible subdivisions of explanatory variables, and chooses the that produces the maximum reduction of impurity.
Stop rule:
based on the number of cases on the leaf nodes. And optimisation based on pruning.

Question 6

Q

CHAID

Answer

A

- Variables: 
Qualitative dependent variables 
- Split type: 
can be on binary or multiple nodes
- Splitting rule: 
based on Chi-square test for the null hypothesis of statistical independence between the dependent variable and the explanatory variable. 
- Stop rule: 
explicit and must relate to the maximum dimension of the tree, maximum number of levels or the minimum number of elements in a node.

Question 7

Q

C4.5/C5.0

Answer

A

Similar to CART, but differs in the following respects:

The segmentation of the nodes is not binary.
Predictors and their values are selected based on information gain.
Stop criterion is the pruning, based on the assignment of an expected error at each leaf.

Question 8

Q

QUEST

Answer

A

Quick, Unbiased, Efficient, Statistical Tree

Binary splits
Choice of explanatory variables of the split is done before the first split.
The association between each predictor variable and the target variable is calculated using ANOVA F or the Levene test (for continuous predictors, or ordinal) or Pearson Chi-square(nominal predictors).

Question 9

Q

Random Forest

Answer

A

The Random Forest method employs instead of a single tree, a set of
decision trees.

Each tree is implemented on an appropriate data resampling (Bootstrap with N samples), and on a subset of predictor variables.

X trees are estimated in this way and at the end the classification suggested by the majority of the trees is the final classification.

Question 10

Q

Pros and cons of Random Forest

Answer

A

Pros:

High predictive performance/Generalisation
Parameters easy to choose

Cons:

Complexity/Processing time
More difficult interpretation

Question 11

Q

XGBoost

Answer

A

The forest of trees is estimated sequentially, such that each new tree takes into account the prediction errors of the previous tree.

Pros:

Computational speed
Reduction of overfitting problems
ability to easily define custom objective functions.

Objective function of XGBoost: Loss + Regularisation

Question 12

Q

Ensemble, Bagging, Boosting

Answer

A

Ensemble: collection of models that are combined (e.g. with some kind of mean) in order to improve the final accuracy.

Types of Ensembles:

Bagging: each predictor is independent and they are averaged through mean or voting.
Boosting: each predictor is some kind of improvement over the previous iteration.

Regression Trees Flashcards

(12 cards)