Decision Trees Flashcards

1
Q

What are decision trees? How do we build them? What is the general idea of it?

A

What Are Decision Trees?

A decision tree is a flowchart-like structure used in machine learning for classification and regression tasks. It models decisions and their possible consequences, breaking down data into smaller subsets while simultaneously building an associated tree structure. Each internal node represents a decision (based on a feature), each branch represents an outcome, and each leaf node represents a final prediction.

General Idea of Decision Trees

The primary goal of a decision tree is to split the data into subsets that are as pure as possible with respect to the target variable. The tree is built by repeatedly dividing the dataset based on feature values to maximize predictive power. This process continues until a stopping condition is met (e.g., a maximum tree depth or a minimum number of samples in a leaf).

How to Build a Decision Tree

1.	Start with the Entire Dataset:
•	Begin with the full training dataset at the root of the tree.
2.	Select the Best Feature to Split:
•	Choose a feature that best separates the data into subsets. Metrics like Gini Impurity, Entropy (used in Information Gain), or Variance Reduction (for regression) are commonly used to measure the “purity” of the split.
•	For example, if using Gini Impurity, the goal is to minimize impurity after splitting.
3.	Split the Dataset:
•	Divide the dataset into smaller groups based on the values of the selected feature.
4.	Repeat for Each Subset:
•	For each subset, choose the best feature to split further. Continue this process recursively.
5.	Stop When a Stopping Condition is Met:
•	The recursion stops if:
•	All data in a subset belongs to the same class (for classification).
•	A maximum tree depth is reached.
•	The number of samples in a subset is below a predefined threshold.
6.	Assign a Prediction to Leaf Nodes:
•	For classification: Assign the majority class of the samples in the leaf.
•	For regression: Assign the average value of the samples in the leaf.

Key Characteristics of Decision Trees

•	Intuitive and Interpretable: The tree structure makes it easy to understand and visualize how predictions are made.
•	Nonlinear Decision Boundaries: Trees can model complex relationships between features and the target variable.
•	Prone to Overfitting: Without constraints like maximum depth or minimum samples per leaf, trees can grow too complex and fit noise in the training data.

Practical Example (Classification):

Imagine you’re deciding whether to play tennis based on the weather:
• Root node: Is it sunny or cloudy/rainy?
• If sunny, split further: Is the humidity high or low?
• Continue until the tree classifies the conditions (e.g., “Play” or “Don’t Play”).

In practice, this decision-making process is automated and optimized using algorithms like CART (Classification and Regression Trees) or ID3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do we decide which attributes to use to split the data in decision trees?

A

In decision trees, we decide which attribute to use for splitting the data by evaluating how well each attribute separates the data into subsets that are homogeneous with respect to the target variable. The goal is to maximize the “purity” of the subsets after the split. To do this, we use metrics like Information Gain, Gini Impurity, or Variance Reduction (for regression).

Key Metrics for Splitting

1.	Information Gain (Based on Entropy):
•	Entropy measures the impurity or randomness in a dataset.
•	A pure subset has low entropy (e.g., all samples belong to one class), while a mixed subset has high entropy.
•	Information Gain is the reduction in entropy after splitting the dataset on a given attribute.
•	Steps:
1.	Compute the entropy of the dataset before splitting.
2.	For each potential split, calculate the weighted average entropy of the resulting subsets.
3.	Information Gain = (Entropy before split) - (Weighted entropy after split).
•	The attribute with the highest Information Gain is selected.

2.	Gini Impurity:
•	Gini Impurity measures the likelihood that a randomly chosen sample from a subset would be incorrectly classified if labeled according to the majority class.
•	A pure subset has a Gini Impurity of 0, while a completely mixed subset has a higher Gini Impurity.
•	Steps:
1.	Compute Gini Impurity for the original dataset.
2.	For each potential split, calculate the weighted Gini Impurity of the resulting subsets.
3.	The attribute with the lowest Gini Impurity after splitting is selected.

3.	Variance Reduction (For Regression):
•	In regression trees, the goal is to minimize the variance in the target values within each subset.
•	Steps:
1.	Calculate the variance of the target values in the dataset before splitting.
2.	For each potential split, compute the weighted variance of the target values in the resulting subsets.
3.	Variance Reduction = (Variance before split) - (Weighted variance after split).
•	The attribute that maximizes Variance Reduction is selected.

Intuition Behind These Metrics

•	The essence of all these metrics is to evaluate how well an attribute splits the data into pure subsets where the target variable’s values are consistent.
•	For example:
•	In classification, if splitting by an attribute creates subsets where most samples belong to a single class, that attribute is effective.
•	In regression, if splitting reduces the variance of target values in subsets, the split is considered good.

Practical Steps

1.	Calculate the chosen metric (e.g., Information Gain, Gini Impurity) for every possible split using all attributes.
2.	Compare the metric values across attributes and select the one that maximizes purity (or reduces impurity/variance the most).
3.	Repeat this process recursively at each node of the tree.

By systematically evaluating attributes with these metrics, decision trees ensure that each split improves the tree’s ability to make accurate predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Based on the DM5 image calculate the information gain for each attribute

A

Answer on DM6

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the stop criterion for decision trees?

A

Stopping criteria in decision trees are conditions that determine when the tree-building process should stop growing further. These criteria help prevent overfitting and ensure that the tree does not become excessively complex or impractical. Here are the most common stopping criteria:

  1. Maximum Depth of the Tree• The tree stops growing when it reaches a predefined depth.
    • This limits the number of splits and ensures the tree remains manageable.
    • Example: A maximum depth of 3 means the tree will have at most 3 levels of decisions.
  2. Minimum Number of Samples per Node• The tree stops splitting a node if the number of data samples in that node is below a threshold.
    • This avoids creating nodes that represent very small, potentially insignificant subsets of data.
    • Example: If the minimum is 5 samples, any node with fewer than 5 samples will not be split.
  3. Purity of the Node• If a node is “pure,” meaning all samples in the node belong to the same class (for classification), the splitting stops.
    • Similarly, in regression, if the variance of the target variable in a node is very low, the splitting may stop.
  4. Minimum Improvement in Splitting Criteria• Splitting stops if the improvement in metrics like Information Gain, Gini Impurity, or Variance Reduction falls below a threshold.
    • This ensures that only meaningful splits are made, avoiding marginal improvements that do not significantly enhance the tree’s performance.
  5. Maximum Number of Leaf Nodes• The tree stops growing once a specified number of leaf nodes is reached.
    • This limits the overall complexity of the tree.
  6. Early Stopping Based on Validation Performance• The tree-building process stops if the performance (e.g., accuracy) on a validation set no longer improves or starts to degrade.
    • This prevents overfitting to the training data.
  7. Practical Constraints• Memory or computational limits may also act as stopping criteria in practice. For example:
    • A limit on the total number of splits.
    • A restriction on the time available for building the tree.

These stopping criteria are essential for balancing the tree’s complexity and its ability to generalize well to unseen data. Often, a combination of these criteria is used in practice.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the highly-branching attributes? How do we deal with them? Why they can be problematic?

A

What Are Highly-Branching Attributes?

Highly-branching attributes are features in a dataset that can take on a large number of unique values. For example:
• In a dataset, a “user ID” or “product code” might have thousands or millions of unique values.
• If used as a splitting attribute in a decision tree, they create many branches—one for each unique value.

Why Are They Problematic?

1.	Overfitting:
•	Highly-branching attributes tend to create overly complex trees that perfectly fit the training data.
•	Each unique value forms its own branch, leading to splits that may capture noise instead of meaningful patterns.
2.	Data Fragmentation:
•	Splitting by such attributes can divide the dataset into many small subsets, each with very few samples.
•	This reduces the statistical power of the model and may lead to unreliable predictions.
3.	Increased Complexity:
•	Trees become deeper and harder to interpret, making the model less explainable.
•	Computational resources are also strained due to the large number of branches.
4.	Reduced Generalization:
•	Trees that split on highly-branching attributes often perform poorly on unseen data, as the splits are too specific to the training set.

How Do We Deal With Highly-Branching Attributes?

1.	Use Regularization Parameters:
•	Set constraints like minimum samples per split or maximum tree depth to prevent overly fine splits on these attributes.
2.	Ignore Such Attributes:
•	Manually or automatically exclude attributes with excessively high cardinality, especially if they are unlikely to carry meaningful patterns (e.g., user IDs).
3.	Use Information-Gain Normalization:
•	Some algorithms (like C4.5) normalize splitting criteria such as Information Gain to penalize attributes with many unique values. For example:
•	Divide Information Gain by the intrinsic information of the attribute to reduce the bias toward highly-branching attributes.
4.	Group or Bin Values:
•	Group similar values into broader categories. For example:
•	Instead of using individual product codes, group products by category.
•	Discretize numerical values into ranges or bins.
5.	Feature Engineering:
•	Transform highly-branching attributes into more meaningful ones, such as extracting key features or aggregating data (e.g., instead of user ID, use “total purchases by user”).
6.	Pruning:
•	Use post-pruning techniques to simplify the tree after it has been built, removing splits that don’t significantly improve predictive performance.

By handling highly-branching attributes carefully, we can avoid overfitting, ensure better generalization, and maintain a balance between model complexity and predictive accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is information gain ratio? When and how do we use it?

A

What is Information Gain Ratio?

The Information Gain Ratio is a metric used in decision tree algorithms (like C4.5) to address a specific bias in the Information Gain metric. While Information Gain measures how much a feature improves the purity of subsets after a split, it tends to favor attributes with many unique values (i.e., highly-branching attributes). The Information Gain Ratio corrects this bias by normalizing Information Gain using the “intrinsic information” of the attribute.

Components of Information Gain Ratio

1.	Information Gain (IG):
•	Measures the reduction in entropy after splitting the data based on an attribute.
•	 IG = \text{Entropy before split} - \text{Weighted entropy after split} .
2.	Intrinsic Information (II):
•	Reflects the inherent information provided by the distribution of data across the attribute’s branches.
•	\( II = - \sum \left( \frac{|S_i|}{|S|} \cdot \log_2 \frac{|S_i|}{|S|} \) ), where  S_i  is a subset of data for a branch, and  S  is the total dataset.
3.	Information Gain Ratio (IGR):
•	Normalizes Information Gain by dividing it by the Intrinsic Information:
•	 IGR = \frac{IG}{II} .

Why Do We Use Information Gain Ratio?

1.	To Avoid Bias Toward Highly-Branching Attributes:
•	Attributes with many unique values (like IDs or product codes) tend to have high Information Gain because they create many small, pure subsets.
•	However, these splits are often not meaningful and lead to overfitting.
•	The Information Gain Ratio penalizes such attributes by considering their intrinsic information.
2.	To Improve Decision Tree Performance:
•	By using the ratio, the decision tree chooses attributes that not only maximize Information Gain but also provide splits with balanced and meaningful branches.

How Do We Use Information Gain Ratio?

1.	Calculate Information Gain for Each Attribute:
•	For each candidate attribute, compute the reduction in entropy when splitting the dataset.
2.	Calculate Intrinsic Information for Each Attribute:
•	Measure the distribution of data across the branches formed by splitting on that attribute.
3.	Compute the Information Gain Ratio:
•	Divide the Information Gain by the Intrinsic Information for each attribute.
4.	Select the Attribute with the Highest Information Gain Ratio:
•	The attribute with the highest ratio is chosen for the split, provided it has sufficient Information Gain (to avoid meaningless splits with very low entropy reduction).

When Do We Use It?

1.	In Algorithms Like C4.5:
•	The Information Gain Ratio is specifically used in decision tree algorithms like C4.5 to improve upon the limitations of earlier methods like ID3.
2.	When Working with Highly-Branching Attributes:
•	If the dataset contains attributes with many unique values, using Information Gain Ratio helps avoid overfitting and ensures meaningful splits.
3.	When Building Balanced Decision Trees:
•	If a tree with more general and interpretable splits is desired, Information Gain Ratio helps prioritize attributes with better overall balance and significance.

By normalizing Information Gain, the ratio ensures that the decision tree grows in a way that reflects both the predictive value and meaningfulness of each attribute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the Gini Index? How is it calculated? What is it used for? How do we decide the splits for numerical attributes?

A

What is the Gini Index?

The Gini Index (or Gini Impurity) is a metric used to evaluate splits in decision trees. It measures the likelihood of incorrectly classifying a randomly chosen sample if it is labeled according to the class distribution in a given subset. The goal is to minimize the Gini Index, which corresponds to creating subsets that are as pure as possible.

How Is It Calculated?

1.	Gini Impurity for a Single Node:
•	Formula:  Gini = 1 - \sum_{i=1}^{C} p_i^2 , where  p_i  is the proportion of samples belonging to class  i  in the node, and  C  is the total number of classes.
•	Intuition: If all samples in a node belong to a single class,  p_i = 1  for that class and 0 for others, resulting in  Gini = 0  (pure node). If the classes are evenly distributed,  p_i  values are equal, resulting in higher impurity.
2.	Weighted Gini Impurity for a Split:
•	When a dataset is split into subsets, the Gini Index for the split is calculated as:  Gini_{split} = \sum_{j=1}^{k} \frac{|S_j|}{|S|} \cdot Gini_j , where  k  is the number of subsets,  S_j  is a subset, and  |S_j| / |S|  is the weight (proportion) of the subset.

What Is It Used For?

The Gini Index is commonly used in classification tasks within decision tree algorithms like CART (Classification and Regression Trees) to:
• Evaluate the quality of a split.
• Determine which attribute to split on at each node.
• Aim for subsets that are as pure as possible, ensuring better classification performance.

How Do We Decide Splits for Numerical Attributes?

Splitting on numerical attributes involves determining a threshold value (e.g., x < 5 ) that best separates the data. Here’s how the process works:
1. Sort the Data by the Attribute:
• Arrange the data points in ascending order of the numerical attribute.
2. Evaluate Potential Splits:
• Consider all potential threshold values between consecutive values in the sorted list.
• For each threshold t , divide the data into two subsets:
• Left: Samples where the attribute value is \leq t .
• Right: Samples where the attribute value is > t .
3. Compute Gini Index for Each Split:
• Calculate the Gini Impurity for the left and right subsets.
• Use the weighted Gini Impurity formula to evaluate the overall impurity for the split.
4. Choose the Threshold with the Lowest Gini Index:
• The threshold that minimizes the Gini Index is selected as the splitting point.

Intuition Behind Splitting Numerical Attributes with Gini Index

•	The algorithm seeks a threshold that creates subsets with minimal mixing of classes. For example, if an attribute like “age” is being split, the goal is to find an age threshold that separates different classes (e.g., “buy” vs. “not buy”) as cleanly as possible.
•	By evaluating every potential split point and selecting the one that minimizes impurity, the tree grows in a way that maximizes its predictive power.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Calculate the information gain ratio from DM5 image

A

Answer in DM7

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Calculate the Gini indexes values for the —- attribute in DM5

A

Answer in DM8

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is repeated sorting? How could we avoid it?

A

What Is Repeated Sorting?

Repeated sorting occurs during the decision tree-building process when evaluating splits for numerical attributes. Specifically:
1. For each numerical attribute, the algorithm must sort the data to identify the best split point.
2. This sorting is repeated at every node of the tree for all candidate attributes.

Repeated sorting can become computationally expensive, especially when working with large datasets or many numerical attributes. The computational cost increases as the depth of the tree grows because sorting is performed multiple times for subsets of the data at each node.

Why Is Repeated Sorting a Problem?

1.	High Computational Cost:
•	Sorting is an O(n \log n) operation, where n is the number of samples. Repeating this for multiple attributes and at each node leads to significant overhead.
2.	Decreased Scalability:
•	For large datasets with many numerical attributes, repeated sorting can make training a decision tree infeasible in terms of time.
3.	Inefficiency:
•	Sorting the same data multiple times for similar computations wastes resources that could be used more efficiently.

How Could We Avoid Repeated Sorting?

1.	Sort Once and Store Sorted Indices:
•	For each numerical attribute, sort the data once at the beginning of the tree-building process.
•	Store the sorted indices so that splits can be evaluated by iterating over the sorted data rather than re-sorting.
2.	Dynamic Programming for Split Evaluation:
•	Use a linear scan over the sorted data to calculate potential split points.
•	Maintain cumulative class counts while iterating, which allows efficient computation of metrics like Gini
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How could we avoid overfitting in a decision tree?

A

• The generated tree may overfit the training data

• Too many branches, some may reflect anomalies due to noise or outliers

• Result is in poor accuracy for unseen samples

• Two approaches to avoid overfitting
– Prepruning
– Postpruning

• Prepruning
– Halt tree construction early
– Do not split a node if this would result in the goodness measure falling below a threshold
– Difficult to choose an appropriate threshold

• Postpruning
– Remove branches from a “fully grown” tree
– Get a sequence of progressively pruned trees
– Use a set of data different from the training data to decide which is the “best pruned tree”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Detail the goal of the prepruning strategy for reducing the overfitting in decision trees

A

• Based on statistical significance test

• Stop growing the tree when there is no statistically significant association between any attribute
and the class at a particular node

• The most popular approach is the chi-squared test

• Quinlan’s classic tree learner ID3 used chi-squared test in addition to information gain

• Only statistically significant attributes were allowed to be selected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Detail the goal of the postpruning strategy for reducing the overfitting in decision trees

A

• First, build full tree, then prune it

• Fully-grown tree shows all attribute interactions

• Problem: some subtrees might be due to chance effects

• Two pruning operations
– Subtree raising
– Subtree replacement

• Possible strategies
– Error estimation
– Significance testing
– MDL principle

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do we compute regression trees? What do we use them for?

A

Decision trees can also be used to predict
the value of a numerical target variable
Regression trees work similarly to decision trees
They search for the best split that
minimizes an impurity measure

• Regression Trees
– Prediction is computed as the average of numerical target variable in the subspace
• What impurity measure?
– It can be measured as the expected error reduction, or SDR (standard deviation reduction)
– Where D is the original data Di are the partitions and 𝛔 is the standard deviation of the target

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are decision Stumps? What do we use them for?

A

• Decision and regression stumps are one level decision/regression trees
• They are the simplest trees possible and also the main building blocks for boosting methods
• Categorical Attributes
– One branch for each attribute value
– One branch for one value and one branch for all the others
– Missing values sometimes are treated as a special value
• Numerical Attributes
– Two leaves defined by a threshold value selected based on some criterion
– Multiple splits (rarely used)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly