L4 - Decision Tree Flashcards

1
Q

What are the learning objectives for classification using Decision Trees?

A

Describe the concept of a decision tree
Explain how to build a decision tree
Analyse strengths and weaknesses
Apply decision tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the lesson outlines?

A

What is a decision tree
Building/growing a tree (e.g. information gain and entropy)
Pruning the tree
Improving the tree (e.g. ensemble method: AdaBoost)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define decision tree?

A

Use a tree structure to model the relationship among the features and outcomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why use decision tree over other classifiers?

A

(1) Classification mechanism needs to be transparent (e.g. credit scoring process)
We need to understand the classification rules, decision process and criteria that was used to get to that decision (e.g. to prevent bias and provide transparent process)
(2) Results need to be shared with others for future business practice

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Key idea for DT?

A

Divide and conquer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How to build a decision tree?

A

(1) Split the data into subsets (by feature)
(2) Split those subsets repeatedly into smaller subsets
(3) Repeat process until data within the subsets are sufficiently homogenous (e.g. most samples at the node of the same class or until it reaches a predefined size limit)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does sufficiently homogenous mean for DT?

A
Most samples at the node have the same class 
 There are no remaining features to distinguish among samples 
 Reach the predefined size limit
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is entropy?

A

Measurement of uncertainty in a data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Give the equation for entropy? Explain the components?

A
Entropy (S) = ∑C - Pi* log(pi)
 S = dataset
 C = number of classes 
 Pi = proportion of samples in class i

If there is more than one class then there will be another segment of the above equation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain entropy with following information: a set of data S has two classes: red (60%) and blue (40%). Use a graph to explain?

A
Entropy(S) = - 0.6 * log2(0.6) – 0.4 * log2(0.4) = 0.97
 0.6 is the red class and also the red proportion (0.6)
 Steps - (a) split features with largest information gain (b) measure entropy in each class
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Interpret an entropy value?

A

Measure of uncertainty between 0 and 1
0 - means 0 uncertainty (therefore certainty)
1 - means completely uncertain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain entropy for 2 classes or n classes?

A

Two classes entropy range from 0 to 1

N classes entropy ranges from 0 to log2(n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does entropy link to homogeneity?

A

0 entropy means completely homogenous

1 entropy means complete heterogeneity (completely diverse)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain entropy and class X in an X Y sample graphically?

A

Entropy - level of uncertainty (homogeneity)
X - the proportion of X in the sample
If X is 100% then then entropy is 0 because there is complete certainty
If X is 0% then entropy is 0 because there is complete certainty
If X is 50% in the sample then entropy is 1 because this is the largest possible amount of uncertainty that could exist in the data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How to measure homogeneity?

A

Use entropy as a measure of uncertainty
Uncertainty = heterogeneity
Certainty = homogeneity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to find the features to split into subsets?

A

Information gain - The feature with the highest information gain will be split first
Partition - which straight line partition of the dataset is going to result in the largest information gain
Maximise - where can we put a line on X or Y which results in largest IG

17
Q

Explain information gain steps (graphically)?

A

(1) Line - On a graph where can we draw a straight line to split the data to maximise information gain (e.g. going to be an X= or Y= value)
(2) Branch - the lines create subsets called branches which then count the classes of information inside
(3) Entropy - calculate the entropy within each subset or branch

18
Q

What is the information gain?

A

The difference in entropy before and after splitting or sub-setting the data

19
Q

What is the information gain formulas?

A

For each class: Entropy (S) = ∑C - Pi* log(pi)

Information Gain =
(1) = Entropy(before split) – Entropy(after split)

(1) = Entropy(before split) - [weighted average] * Entropy(children)

Entropy after split = weighted average x entropy children

20
Q

What are children?

A

The number of observations in each branch after a split (e.g. sums to 100%)

21
Q

What is the weighted average entropy formula?

A

Entropy (after splits) x Entropy children (e.g. number of observations in each branch)

22
Q

What is chopping in decision trees?

A

Chopping refers to the partitions

The more chopping the smaller the branches

23
Q

What is pruning the tree?

A

Too large - prune tree when the model is overfitted or tree is too large
Prune - cut down the dataset so it generalises better to unseen data
Methods - (a) early stopping or pre-pruning (b) post-pruning

24
Q

What is overfitting?

A

When the model doesn’t generalise well from our training data to unseen data.
The model is too closely fit to a narrow sample (too specific to training data).
Learns training data in too much detail

25
Q

Use mock exam analogy for overfitting?

A

You just learn math formulas off by heart but do not try to understand what they mean
Perform really well answering familiar textbook questions
When it comes to applying the information in exam you struggle because the questions are unfamiliar

26
Q

How does decision tree make a classification?

A

Training (building the tree) - predicted values stored in nodes
Testing - predicted value stored in leaf node revealed as the final prediction
(1)

27
Q

Explain classification steps? (use two features)

A

(1) Ask questions to determine the scores on each axis (e.g. high/low budget or high/low celebrities)
(2) Each classification is made by the highest number of class observations in each branch
(3) Check new sample against the decision tree

28
Q

Explain classification: (a) KNN (b) Naïve (c) Decision tree

A
KNN (lazy) - it does not make model it just uses class label which is assigned by nearest neighbours when classifying 
 Naïve - uses likelihood table to calculate probabilities for each class and picks class with highest probability 
 Decision tree - data is partitioned with the predictive value stored in leaf node as final prediction
29
Q

Strengths of decision tree?

A

Results can be easily interpreted
More efficient than other complex models
Can be used on small and large datasets

30
Q

Weaknesses of decision tree?

A

Small changes in training data can result in large changes to decision logic (tree is limited by its history of past decisions)
Easy to overfit/not robust to nosie (susceptible to noise due to sequential stage of tree)

31
Q

What is the ensemble method called AdaBoost?

A

Adaptive boosting
Makes prediction based on a number of different smaller models
Less sensitive to noise and bias- using smaller trees means noise has less of an effect on the final predictions

32
Q

How does AdaBoosting work?

A

Key idea - combines output of smaller trees rather than use one large tree which is susceptible to noise
Subset - rather than use the whole sample for decision trees it splits the into smaller subsets and combines the results of these decision trees using a weighted sum to represent final output
Weak learners - output of the weak learners is combined into a weighted sum that represents the final output of the boosted classifier

33
Q

Difference between boosting and bagging?

A

Bagging - Training a bunch of individual models in a parallel way. Each model is trained by a random subset of the data
Boosting - Training a bunch of individual models in a sequential way. Each individual model learns from mistakes made by the previous model.

34
Q

How is AdaBoosting similar to distribution of sample means?

A

Instead of one large tree which is susceptible to noise you build many smaller trees
These smaller trees will have a mixture of clean and noisy data but the corrections will average out
Similar concept to the distribution of sample means where extreme (noisy) observations are diluted by intermediate results

35
Q

What is the only way decision trees can split data?

A

Divide and conquer through axis-parallel splits (e.g. cannot split diagonally)

36
Q

Explain a weakness of decision trees?

A

Because of axis parallel splitting decision trees have a tendency to be overly complex or overfit the model.

A tree can grow too large and divide and conquer every feature