5: Decision Trees Flashcards

1
Q

What are decision trees?

A

Decision trees represent a group of classification techniques that are based on the construction of a tree like structure. This structure is a series of steps, where each step uses the given features one by one to help classify the input object.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Where are decision trees used?

A

Image processing and character recognition, medicine, financial analysis, astronomy, manufacturing, production, and molecular biology.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Are decision trees SL or UL?

A

SL, since they use labeled training instances to construct a classifier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does DT work?

A

TOP-DOWN PROCESS. 1. Select the highest ranked feature create the decision node
2. From this node, create the branches with distinct value (range)
− If all instances of this feature value (range) are of the same class:
the child node from this branch is a leaf node
− else:
repeat step 1 and 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the structure of DT?

A

Nodes (decision based on features), Branches (conditional statements IF), Leaves (classes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What happens to datasets that contain more than one feature?

A

For a dataset that contains more than one feature, the decision tree classifier uses a ranking technique to detect their degree of importance to the given classification problem. Accordingly, the classifier selects the most salient feature for representing the root node and then the remaining features in decreasing importance for the rest of the tree nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does the complexity of decision rules affect the interpretability and size of a decision tree?

A

A decision tree uses a tree structure to represent decision rules, which makes it easy for experts to understand the reasons behind classifications. However, as the tree adds more rules, it needs more training data, and if there are many features, the rules become more complex. This added complexity can make the tree harder to interpret, reducing its value as a visual tool.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is underfitting?

A

If the classifying model is not trained enough, the inducted decision tree is going to be too simple to classify instances accurately.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When is a DT model succesfull?

A

When it is able to generalize.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some challanges with DT?

A

May include branches that represent outliers or noise in the input dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the benefits?

A

-easy to interpret due to the natural representation (svm and neural networks are blaxk box classifiers where the decision logic is unknown) - independent from the statistical distribution of the input data - relationship between the features and the class lables can be nonlinear

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is pruning?

A
  • handles overfitting by decreasing the size of the tree to make it less
    complex
    − Method: Removing sub-trees in the decision tree that have low
    classification power
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the two types of pruning?

A

Pre-pruning: avoids building up the low-discriminating sub-trees while the
decision tree is being constructed, and replaces with leaf nodes
Post-pruning: removes spurious sub-trees from the fully constructed decision
tree, and replaces with leaf nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the most popular methods for DT?

A

ID3, C4.5 and CART (differ in feature selection and how the pruning mechanism is used)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What techniques do the mechanism use?

A

ID3 –> information gain, C4.5 –> gain ratio technique, CART –> Gini index technique

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain how ID3 and Information Gain work.

A

**Iterative Dichotomiser (ID3) uses information gain (IG) to
select the best splitting features **
ID3 measures the degree of homogeneity of the classes
induced by a decision node
*** IG is based on the entropy, which measures the randomness or disorder of the classes before and after splitting on a feature. If a split makes the groups more organized (less random), then IG is high, meaning it’s a good split.

So, the lower the entropy after the split, the higher the IG. That’s why we say IG is “inversely proportional” to entropy: as entropy goes down, IG goes up, making it a better choice for a split.

17
Q

What types of features does ID3 can deal with?

A

Discrete features only. ID3 can be applied for regression problems just by using standard deviation reduction
instead of IG

18
Q

What is the standard deviation?

A

A measure of the degree of variation in a set of numerical values. A feature vector of similar values is considered homogenous. The standard deviation of a completely homogenous feature vector is zero.

19
Q

Explain how C4.5 and Gain Ratio work.

A

C4.5 handles the problem of generalization when applying IG for the datasets with
high homogeneity
− It can deal with both continuous and discrete features
− Method: normalizing the information gain:

20
Q

Explain Generalization and why it might be better than IG.

A

Information Gain (IG) favors features with distinct values, as they often create purer groups. For example, if we use an identifier feature (like a unique ID), each ID is completely distinct, so each split will perfectly separate the data, making entropy zero. This would give a high IG score, but using an ID to split data isn’t useful for generalizing because it only separates based on unique labels without learning patterns.

The C4.5 algorithm fixes this by adjusting IG, normalizing it to prevent features like IDs from dominating splits. This adjusted ranking helps the decision tree focus on features that improve generalization rather than just creating pure splits.

21
Q

What are some advantages of the C4.5 algorithm in decision tree classification?

A

The C4.5 can deal with both continuous and discrete features. It handles missing values and applies tree pruning after the process of the tree induction.

22
Q

How doe CART and Gini Index work?

A

CART uses Gini Index 𝐺𝑖𝑛𝑖(𝐷) to measure the impurity in a dataset 𝐷. The feature that maximizes the impurity reduction ∆𝐺𝑖𝑛𝑖(𝑓) is selected as an important feature.

23
Q

How does pre-pruning work?

A

Pre-pruning stops a decision tree from growing too complex by avoiding branches that don’t add much value. When a certain condition is met, the tree-building process stops adding new decision points and instead creates a “leaf” with the most common class label for that branch. The specific condition for stopping depends on a ranking measure, like information gain, gain ratio, or Gini index. If this measure is too low, meaning the split won’t be useful enough, then no further splits are made.

24
Q

How does post-pruning work?

A

Post-pruning simplifies a fully built decision tree by cutting out unnecessary branches and replacing them with a single leaf showing the most common class. The CART method does this by calculating “cost complexity” for each branch, based on how many leaves it has and its error rate. If replacing a branch with a single leaf reduces complexity without hurting accuracy, that branch is removed.

25
Q
A

In ensemble methods, decision trees are combined to improve accuracy and robustness. Techniques like bagging (e.g., Random Forest) build multiple trees on different data samples and average their results to reduce variance and overfitting. Another method, boosting (e.g., AdaBoost), builds trees sequentially, with each new tree focusing on the errors of the previous ones, which reduces bias. Together, these ensemble approaches make the final model more accurate and generalizable than a single decision tree.