Decision Trees Flashcards

1
Q

Information Gain

A

Distributions Entropy in bits
entropy(p1,p2,p3,….,pn) = -p1logp1 - p2logp2….
gain(A) = info(D) - info_a(D)
info(D) = information before split
info_a(D) = after split = |D1|/|D| * info(D1) + |D2|/|D| * info(D2) +….

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Highly Branching

A

Problematic: id, primary keys…
Each leaf node is pure,
Overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Information Gain Ratio

A

Large when data is evenly spread
Small when all data belong to one branch
IntrinsicInfo = - sum(|Si|/|S| * log(Si|/|S|))
S = all, Si = number of each value of attribute
GainRatio(S,A) = Gain(S,A)/IntrinsicInfo

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

C4.5

A

Numerical Attributes
Sort values including labels
Check cut points, choose with best info gain.
Sort complexity O(n*logn). Sort of children from parent. O(n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Gini Index

A

Binary splits

gini(T) = 1 - sum(pj^2)
pj is relative frequency of class j
If D split on a into D1 and D2 then,
gini_A(D) = |D1|/|D|*gini(D1) + |D2|/|D|*gini(D2)
dif_gini(A) = gini(D) - gini_A(D)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Overfitting in Trees

A

Preprunning.
Not split if goodness measure fall below threshold.
Statistical significance test
Stop when no statistically significant association btw attribute and class.
Postprunning.
remove from fully grown.
use data different from training data to decide best pruned tree
Subtree raising, subtree replacement.
Error on training data not a useful estimator (holdout)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Regression and Model Trees

A

Prediction computed as avg of numerical target variable
Leafs can contain a linear model to predict target value
Impurity measure? SDR = std(D) - sum(|Di|/|D|*std(Di))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Decision stumps

A

One level decision trees
Categorical: one branch for each attribute value.
One branch for one value, one branch for all others.
Numerical:
Two leaves defined by threshold.
Multiple splits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly