L4 - Decision Tree Flashcards by George Unknown

What are the learning objectives for classification using Decision Trees?

Describe the concept of a decision tree
Explain how to build a decision tree
Analyse strengths and weaknesses
Apply decision tree

How well did you know this?

Not at all

Perfectly

What are the lesson outlines?

What is a decision tree
Building/growing a tree (e.g. information gain and entropy)
Pruning the tree
Improving the tree (e.g. ensemble method: AdaBoost)

How well did you know this?

Not at all

Perfectly

Define decision tree?

Use a tree structure to model the relationship among the features and outcomes

How well did you know this?

Not at all

Perfectly

Why use decision tree over other classifiers?

(1) Classification mechanism needs to be transparent (e.g. credit scoring process)
We need to understand the classification rules, decision process and criteria that was used to get to that decision (e.g. to prevent bias and provide transparent process)
(2) Results need to be shared with others for future business practice

How well did you know this?

Not at all

Perfectly

Key idea for DT?

Divide and conquer

How well did you know this?

Not at all

Perfectly

How to build a decision tree?

(1) Split the data into subsets (by feature)
(2) Split those subsets repeatedly into smaller subsets
(3) Repeat process until data within the subsets are sufficiently homogenous (e.g. most samples at the node of the same class or until it reaches a predefined size limit)

How well did you know this?

Not at all

Perfectly

What does sufficiently homogenous mean for DT?

Most samples at the node have the same class 
 There are no remaining features to distinguish among samples 
 Reach the predefined size limit

How well did you know this?

Not at all

Perfectly

What is entropy?

Measurement of uncertainty in a data set

How well did you know this?

Not at all

Perfectly

Give the equation for entropy? Explain the components?

Entropy (S) = ∑C - Pi* log(pi)
 S = dataset
 C = number of classes 
 Pi = proportion of samples in class i

If there is more than one class then there will be another segment of the above equation.

How well did you know this?

Not at all

Perfectly

Explain entropy with following information: a set of data S has two classes: red (60%) and blue (40%). Use a graph to explain?

Entropy(S) = - 0.6 * log2(0.6) – 0.4 * log2(0.4) = 0.97
 0.6 is the red class and also the red proportion (0.6)
 Steps - (a) split features with largest information gain (b) measure entropy in each class

How well did you know this?

Not at all

Perfectly

Interpret an entropy value?

Measure of uncertainty between 0 and 1
0 - means 0 uncertainty (therefore certainty)
1 - means completely uncertain

How well did you know this?

Not at all

Perfectly

Explain entropy for 2 classes or n classes?

Two classes entropy range from 0 to 1

N classes entropy ranges from 0 to log2(n)

How well did you know this?

Not at all

Perfectly

How does entropy link to homogeneity?

0 entropy means completely homogenous

1 entropy means complete heterogeneity (completely diverse)

How well did you know this?

Not at all

Perfectly

Explain entropy and class X in an X Y sample graphically?

Entropy - level of uncertainty (homogeneity)
X - the proportion of X in the sample
If X is 100% then then entropy is 0 because there is complete certainty
If X is 0% then entropy is 0 because there is complete certainty
If X is 50% in the sample then entropy is 1 because this is the largest possible amount of uncertainty that could exist in the data set

How well did you know this?

Not at all

Perfectly

How to measure homogeneity?

Use entropy as a measure of uncertainty
Uncertainty = heterogeneity
Certainty = homogeneity

How well did you know this?

Not at all

Perfectly

How to find the features to split into subsets?

Study These Flashcards

Information gain - The feature with the highest information gain will be split first
Partition - which straight line partition of the dataset is going to result in the largest information gain
Maximise - where can we put a line on X or Y which results in largest IG

Explain information gain steps (graphically)?

Study These Flashcards

(1) Line - On a graph where can we draw a straight line to split the data to maximise information gain (e.g. going to be an X= or Y= value)
(2) Branch - the lines create subsets called branches which then count the classes of information inside
(3) Entropy - calculate the entropy within each subset or branch

What is the information gain?

Study These Flashcards

The difference in entropy before and after splitting or sub-setting the data

What is the information gain formulas?

Study These Flashcards

For each class: Entropy (S) = ∑C - Pi* log(pi)

Information Gain =
(1) = Entropy(before split) – Entropy(after split)

(1) = Entropy(before split) - [weighted average] * Entropy(children)

Entropy after split = weighted average x entropy children

What are children?

Study These Flashcards

The number of observations in each branch after a split (e.g. sums to 100%)

What is the weighted average entropy formula?

Study These Flashcards

Entropy (after splits) x Entropy children (e.g. number of observations in each branch)

What is chopping in decision trees?

Study These Flashcards

Chopping refers to the partitions

The more chopping the smaller the branches

What is pruning the tree?

Study These Flashcards

Too large - prune tree when the model is overfitted or tree is too large
Prune - cut down the dataset so it generalises better to unseen data
Methods - (a) early stopping or pre-pruning (b) post-pruning

What is overfitting?

Study These Flashcards

When the model doesn’t generalise well from our training data to unseen data.
The model is too closely fit to a narrow sample (too specific to training data).
Learns training data in too much detail

Use mock exam analogy for overfitting?

You just learn math formulas off by heart but do not try to understand what they mean Perform really well answering familiar textbook questions When it comes to applying the information in exam you struggle because the questions are unfamiliar

How does decision tree make a classification?

Training (building the tree) - predicted values stored in nodes Testing - predicted value stored in leaf node revealed as the final prediction (1)

Explain classification steps? (use two features)

(1) Ask questions to determine the scores on each axis (e.g. high/low budget or high/low celebrities) (2) Each classification is made by the highest number of class observations in each branch (3) Check new sample against the decision tree

Explain classification: (a) KNN (b) Naïve (c) Decision tree

``` KNN (lazy) - it does not make model it just uses class label which is assigned by nearest neighbours when classifying Naïve - uses likelihood table to calculate probabilities for each class and picks class with highest probability Decision tree - data is partitioned with the predictive value stored in leaf node as final prediction ```

Strengths of decision tree?

Results can be easily interpreted More efficient than other complex models Can be used on small and large datasets

Weaknesses of decision tree?

Small changes in training data can result in large changes to decision logic (tree is limited by its history of past decisions) Easy to overfit/not robust to nosie (susceptible to noise due to sequential stage of tree)

What is the ensemble method called AdaBoost?

Adaptive boosting Makes prediction based on a number of different smaller models Less sensitive to noise and bias- using smaller trees means noise has less of an effect on the final predictions

How does AdaBoosting work?

Key idea - combines output of smaller trees rather than use one large tree which is susceptible to noise Subset - rather than use the whole sample for decision trees it splits the into smaller subsets and combines the results of these decision trees using a weighted sum to represent final output Weak learners - output of the weak learners is combined into a weighted sum that represents the final output of the boosted classifier

Difference between boosting and bagging?

Bagging - Training a bunch of individual models in a parallel way. Each model is trained by a random subset of the data Boosting - Training a bunch of individual models in a sequential way. Each individual model learns from mistakes made by the previous model.

How is AdaBoosting similar to distribution of sample means?

Instead of one large tree which is susceptible to noise you build many smaller trees These smaller trees will have a mixture of clean and noisy data but the corrections will average out Similar concept to the distribution of sample means where extreme (noisy) observations are diluted by intermediate results

What is the only way decision trees can split data?

Divide and conquer through axis-parallel splits (e.g. cannot split diagonally)

Explain a weakness of decision trees?

Because of axis parallel splitting decision trees have a tendency to be overly complex or overfit the model. A tree can grow too large and divide and conquer every feature

L4 - Decision Tree Flashcards

(36 cards)