Decision Trees Flashcards

Question 1

Q

Information Gain

Answer

A

Distributions Entropy in bits
entropy(p1,p2,p3,….,pn) = -p1logp1 - p2logp2….
gain(A) = info(D) - info_a(D)
info(D) = information before split
info_a(D) = after split = |D1|/|D| * info(D1) + |D2|/|D| * info(D2) +….

Question 2

Q

Highly Branching

Answer

A

Problematic: id, primary keys…
Each leaf node is pure,
Overfitting

Question 3

Q

Information Gain Ratio

Answer

A

Large when data is evenly spread
Small when all data belong to one branch
IntrinsicInfo = - sum(|Si|/|S| * log(Si|/|S|))
S = all, Si = number of each value of attribute
GainRatio(S,A) = Gain(S,A)/IntrinsicInfo

Question 4

Q

C4.5

Answer

A

Numerical Attributes
Sort values including labels
Check cut points, choose with best info gain.
Sort complexity O(n*logn). Sort of children from parent. O(n)

Question 5

Q

Gini Index

Answer

A

Binary splits

gini(T) = 1 - sum(pj^2)
pj is relative frequency of class j
If D split on a into D1 and D2 then,
gini_A(D) = |D1|/|D|*gini(D1) + |D2|/|D|*gini(D2)
dif_gini(A) = gini(D) - gini_A(D)

Question 6

Q

Overfitting in Trees

Answer

A

Preprunning.
Not split if goodness measure fall below threshold.
Statistical significance test
Stop when no statistically significant association btw attribute and class.
Postprunning.
remove from fully grown.
use data different from training data to decide best pruned tree
Subtree raising, subtree replacement.
Error on training data not a useful estimator (holdout)

Question 7

Q

Regression and Model Trees

Answer

A

Prediction computed as avg of numerical target variable
Leafs can contain a linear model to predict target value
Impurity measure? SDR = std(D) - sum(|Di|/|D|*std(Di))

Question 8

Q

Decision stumps

Answer

A

One level decision trees
Categorical: one branch for each attribute value.
One branch for one value, one branch for all others.
Numerical:
Two leaves defined by threshold.
Multiple splits

Decision Trees Flashcards

(8 cards)