Decision Trees Flashcards

Question

In terms of handling variables, what types of variables can Decision Trees manage?

Answer 1

Can handle both categorical and numerical variables

Answer 2

Nonlinear parameters don’t affect its performance

Answer 3

Do not require the assumptions of statistical models

Answer 4

overfitting

Answer 5

Overfitting can lead to wrong decisions. A decision tree will keep generating new nodes to fit the data. This makes it loses its generalization capabilities.

Answer 6

A decision tree will keep generating new nodes to fit the data

Answer 7

Leads to the regeneration of the overall tree meaning that nodes need to be recalculated.

Answer 8

a little bit of noise can make a decision tree model unstable

Answer 9

A large dataset can cause the tree to grow too large and complex, which will lead to overfitting.

Answer 10

It is a highly complicated tree that has low bias which makes it hard for the model to work on new data

Answer 11

The model can get unstable

Answer 12

They become difficult to interpret

Answer 13

One rule is created for each path from the root to a leaf node. Each splitting criterion along a given path is logically joined by AND operator to form the “IF” part. The leaf node holds the class prediction, forming the rule “THEN” part. | **If** age = youth **AND** student = no **then** buys_computer = no

Answer 14

Each splitting criterion along a given path is logically joined by AND operator to form the “IF” part.

Answer 15

The leaf node holds the class prediction, forming the rule “THEN” part.

Answer 16

can be considered as the starting point of the tree where there are no incoming edges but zero or more outgoing edges. The outgoing edges lead to either an internal node or a leaf node. | The root node is usually an attribute of the decision tree model

Answer 17

Appears after a root node or an internal node and is followed by either internal nodes or leaf nodes. It has only one incoming edge and at least two outgoing edges. | Internal nodes are always attributes of the decision tree model

Answer 18

These are the bottommost elements of the tree and normally represent classes of the decision tree model. ## Footnote Depending on the situation, if it can be classified, each leaf node can have only one class label or sometimes a class distribution.

Answer 19

Depending on the situation, if it can be classified, each leaf node can have only one class label or sometimes a class distribution. Leaf nodes have one incoming edge and no outgoing edges

Answer 20

1. Recursive partitioning 1. Pruning the tree

Answer 21

Repeatedly split the records into two or more branches, so as to achieve maximum homogeneity/purity within the new parts

Answer 22

Simplify the tree by pruning peripheral branches to avoid overfitting

Answer 23

a good attribute splits the examples into subsets that are (ideally) “all positive” or “all negative”

Answer 24

1. Order records according to the numerical variable 1. Find midpoints between successive non- duplicate values 1. Divide records into those with x> midpoint and those < midpoint ## Footnote E.g. for the three points 14, 14.8, 16, the midpoint between 14.0 and 14.8 is 14.4, and the midpoint between 14.8 and 16 is 15.4. records with lot_size> 14.4 and those lot_size< 14.4) After evaluating that split, try the next split which is 15.4

Answer 25

taking the average of the two values. For example, for the three points 14, 14.8, 16, the midpoint between 14.0 and 14.8 is 14.4, and the midpoint between 14.8 and 16 is 15.4.

Answer 26

Divide the records into two groups based on whether they are greater than or less than the midpoint. For example, records with lot_size > 14.4 and those with lot_size < 14.4.

Answer 27

Decision Trees greedily search for the best division of the Input Space into exhaustive, mutually exclusive pure rectangles.

Answer 28

there are 𝟐^(𝒏−𝟏) -1 possible binary splits. ## Footnote E.g., categories A, B, C can be split 3 ways {A} and {B, C} {B} and {A, C} {C} and {A, B}

Answer 29

With many categories, number of splits becomes huge

Answer 30

𝟐^(𝒏−𝟏)-1

Answer 31

Pick one of the predictor variables, x

Answer 32

si that divides the training data into two (not necessarily equal) portions

Answer 33

containing records of mostly one class

Answer 34

Algorithm tries different values of xi, and si to maximize purity in initial split

Answer 35

repeat the process for a second split, and so on

Answer 36

* There are no samples left. * There are no remaining attributes for further partitioning * a stopping criterion is used

Answer 37

Many large sets of real-world data are noisy, making it difficult to obtain pure data sets at leaf nodes ## Footnote Example for this stopping criterion is to set a measure of data purity to be smaller than a threshold value, e.g., entropy < 0.1

Answer 38

Basic algorithm (adopted by ID3, C4.5 and CART): a greedy algorithm. Tree is constructed in a top-down recursive divide-and-conquer manner

Answer 39

* At start, all the training tuples are at the root * Tuples are partitioned recursively based on selected attributes * Test attributes are selected on the basis of a heuristic or statistical measure (e.g, information gain)

Answer 40

* All samples for a given node belong to the same class * There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf * There are no samples left

Answer 41

Rules are easier to understand than large trees

Answer 42

Each attribute-value pair along a path forms a conjunction: the leaf holds the class prediction

Answer 43

buys_computer = no

Answer 44

holds the class prediction

Answer 45

The key to building a decision tree - which attribute to choose in order to branch.

Answer 46

The objective is to reduce impurity or uncertainty in data as much as possible

Answer 47

to have a high degree of purity

Answer 48

* Maximum purity: All examples are of the same class * Minimum purity : All classes are equally likely

Answer 49

Measures for impurity 1. Entropy, 2. Information Gain 3. Gini Index and Other measures

Answer 50

Entropy measures the degree of randomness or uncertainty in the dataset.

Answer 51

In classifications, entropy measures randomness based on the distribution of class labels in the dataset.

Answer 52

Hᵢ = - Σₖ₌₁ᵏ₌ₙ pᵢ(k) * log₂(pᵢ(k)), where pᵢ(k) is the probability of class k in the subset.

Answer 53

Entropy is 0 when the dataset is completely homogeneous, indicating that each instance belongs to the same class.

Answer 54

Entropy is at its maximum when the dataset is equally divided between multiple classes, indicating maximum uncertainty in the dataset.

Answer 55

Entropy is used to select the attribute that minimizes the entropy of resulting subsets, aiming to create more homogeneous subsets with respect to class labels.

Answer 56

The goal is to choose the attribute with the highest information gain, i.e., the attribute that minimizes entropy after splitting, and to build a decision tree recursively.

Answer 57

E(S) = Σᶦ pᵢ * log₂(pᵢ)

Answer 58

E(T, X) = Σᶜ P(c) * E(c)

Answer 59

Information Gain is a measure based on Claude Shannon's information theory, assessing the reduction in entropy or variance resulting from splitting a dataset. In decision trees, it guides attribute selection by favoring the attribute that maximizes Information Gain, indicating its usefulness in creating homogeneous subsets with respect to class labels or target variables. Higher Information Gain signifies greater predictive value. ## Footnote The attribute age has the highest information gain and therefore becomes the splitting attribute at the root node of the decision tree. Branches are growing through each outcome of age. The tuples are shown partitioned accordingly.

Answer 60

Information Gain for "age" is calculated by evaluating the expected information requirement. This involves examining the distribution of "yes" and "no" tuples for each age category. The formula includes the entropy calculation for each category and yields the Information Gain. In the provided example, the Information Gain for "age" is determined as 0.246 bits. ## Footnote **entropy calculation for each category** Inf o age (D)= 5 14 *(- 2 5 log 2 2 5 - 3 5 log 2 3 5 )+ 4/14 * (- 4/4 * log_2(4/4)) + 5/14 * (- 3/5 * log_2(3/5) - 2/5 * log_2(2/5)) = 0.694 bits

Answer 61

Information Gains for "income," "student," and "credit rating" are computed using a similar process as for "age." The gains are determined by evaluating the expected information requirement for each attribute. In this case, the computed gains are 0.029 bits for "income," 0.151 bits for "student," and 0.048 bits for "credit rating." Despite having the highest gain among the attributes, "age" is selected as the splitting attribute for Node N in the decision tree.

Answer 62

The Gini index is given by the formula: `\( Gini(D) = 1 - \sum_{i=1}^{m} p_i^2 \)`, where `\( p_i \)` is the probability that a tuple in` \( D \)` belongs to class` \( C_i \)`. The index measures the impurity of a data partition or set of training tuples. When considering a binary split for each attribute, the Gini index for a partitioning `\( Gin*i_{A}(D) \)` is calculated as a weighted sum of the impurity of each resulting partition. For a discrete-valued attribute, the subset that gives the minimum Gini index is selected as its splitting subset.

Answer 63

To induce a decision tree, the Gini index is computed for each attribute. The process involves considering each possible binary split for a discrete-valued attribute. The splitting criterion for the tuples in \( D \) is determined by selecting the subset that gives the minimum Gini index for that attribute. The weighted sum of impurity for each resulting partition is used to evaluate the Gini index for the binary split on an attribute.

Answer 64

The Gini index for the subset (low, medium) is computed using the formula \( Ginlincome \in \{low, medium\} (D) = 10/14 * Gini(D_{1}) + 4/14 * Gini(D_{2}) \), where \( D_{1} \) and \( D_{2} \) are partitions resulting from the binary split based on the condition "income € (low, medium)." The Gini index values for \( D_{1} \) and \( D_{2} \) are calculated, and the weighted sum is used to determine the Gini index for the binary split on the "income" attribute. In this example, the resulting Gini index is 0.443.

Answer 65

The overall process involves computing the Gini index for each attribute, considering all possible binary splits. For each attribute, the subset that minimizes the Gini index is selected as the splitting subset. The Gini index values for the selected subsets are used to determine the best attribute for the root node. In the provided example, the process starts with the attribute "income," and the Gini index is computed for subsets like (low, medium). The same process is then repeated for other attributes, and the attribute with the minimum Gini index becomes the splitting attribute for the root node. The decision tree is grown recursively based on these splitting criteria.

Answer 66

The Gini index values for splits on the subsets are as follows: {low, high} and {medium}: 0.458 {medium, high} and {low}: 0.450 Therefore, the best binary split for the "income" attribute is on {low, medium} (or {high}) as it minimizes the Gini index.

Answer 67

The best binary split for the "age" attribute is on {youth, senior} (or {middle aged}), with a Gini index of 0.375.

Answer 68

Yes, both "student" and "credit rating" are binary attributes. The Gini index values are 0.367 for "student" and 0.429 for "credit rating."

Answer 69

The attribute "age" and splitting subset {youth, senior} give the minimum Gini index overall, with a reduction in impurity of 0.459−0.357 = 0.102 0.459−0.357=0.102.

Answer 70

The final splitting criterion is determined by selecting the attribute and its corresponding splitting subset that result in the minimum Gini index. In this example, the binary split "age ∈ {youth, senior?}" yields the maximum reduction in impurity and is returned as the splitting criterion. Node N is labeled with this criterion, two branches are grown from it, and the tuples are partitioned accordingly during the construction of the decision tree.

Answer 71

Pruning by definition is basically eliminating the subtrees and replacing them with leaf node

Answer 72

When a decision tree is built, many of the branches will reflect anomalies in the training data due to noise or outliers | Tree pruning methods address this problem of overfitting the data

Answer 73

reducing its size and removing the parts of the tree that do not provide power to classify instances.

Answer 74

The complexity of the tree, reduces overfitting, and increases its predictive power.

Answer 75

Decision trees

Answer 76

prepruning and postpruning

Answer 77

In prepruning, a decision tree is halted while growing so that it won’t get too complex.

Answer 78

The tree is grown till its fullest and then pruned following a bottom up or a top down strategy.

Answer 79

Pre-Pruning because it would save time since no time would be wasted growing subtrees that will be eliminated further on.

Answer 80

The main idea behind the prepruning approach is that trees are not pruned in prepruning algorithms; instead, the algorithms are halted based on some stopping criterion. This criterion is often related to the goodness of the split, which is determined by metrics such as Information Gain, Gini Index, Gain Ratio, etc. If the information measured at a test node falls below a predefined threshold, the branching on that path is halted.

Answer 81

The decision to halt the branching is determined based on the goodness of the split. If the information measured at a test node is below a specified threshold, the branching is stopped on that path.

Answer 82

Common stopping criteria include: * Information Gain below a threshold * Gini Index below a threshold * Gain Ratio below a threshold * Limiting tree size * Limiting instances in an internal node * Halt if class distribution of instances is independent of the available feature

Answer 83

Threshold values play a crucial role in prepruning, as they define the conditions for stopping the branching process. If certain metrics (e.g., Information Gain, Gini Index) fall below the specified thresholds, the tree-growing process is halted on that path. Similarly, tree size and instances in internal nodes can be limited by threshold values.

Answer 84

The main goal of the prepruning approach is to "prune" the tree by halting its construction early, thus preventing further splitting or partitioning of the subset of training tuples at a given node.

Answer 85

Upon halting the construction at a node in the prepruning approach, that node becomes a leaf. The leaf may hold either the most frequent class among the subset tuples or the probability distribution of those tuples.

Answer 86

Measures such as statistical significance, information gain, Gini index, and similar metrics are commonly used to assess the goodness of a split in the prepruning approach.

Answer 87

If partitioning the tuples at a node would result in a split that falls below a pre-specified threshold (e.g., in terms of information gain, Gini index), further partitioning of the given subset is halted.

Answer 88

Choosing an appropriate threshold in the prepruning approach is challenging. High thresholds may lead to oversimplified trees, while low thresholds could result in very little simplification. Striking the right balance is crucial for achieving an optimal level of simplification without sacrificing the tree's predictive capabilities.

Answer 89

Postpruning is not restricted by predefined thresholds, unlike prepruning, which relies on stopping criteria based on specific thresholds.

Answer 90

In postpruning, a subtree at a given node is pruned by removing its branches and replacing it with a leaf. The leaf is then labeled with the most frequent class among the subtree being replaced.

Answer 91

Pruning a subtree in postpruning might lower the accuracy in the training data; however, it is expected to increase the accuracy in the test data.

Answer 92

Prepruning is considered more efficient as it halts tree growth early, producing trees faster. On the other hand, postpruning tends to provide better accuracy overall, according to most studies, despite being potentially less efficient.

Answer 93

Pruned decision trees, although more compact than their unpruned counterparts, may still be large and complex, leading to challenges in interpretation.

Answer 94

Repetition occurs when an attribute is repeatedly tested along a given branch of the decision tree. Replication refers to the existence of duplicate subtrees within the tree. Both repetition and replication can make decision trees overwhelming to interpret

Answer 95

The issues of repetition and replication can be addressed by using multivariate splits (splits based on a combination of attributes). Another approach is to use a different form of knowledge representation, such as rules, instead of decision trees.

Answer 96

The goal of a regression tree is regression, focusing on predicting continuous values instead of class labels. Unlike classification trees, which aim to assign class labels, regression trees predict continuous values for the resulting leaf nodes.

Answer 97

In regression trees, mean squared error is used as the impurity measure instead of entropy or similar measures. Mean squared error is more suitable for regression tasks where the goal is to minimize the difference between predicted and actual continuous values.

Answer 98

Leaf nodes in a regression tree are generated by taking an average over the distributed target values of the path that is taken after all the branching is done until that leaf node. These leaf nodes represent the predicted continuous values.

Answer 99

The resulting tree in a regression tree is binary because the nodes are always branched into two partitions: one with values greater than or equal to a specified value and another with values less than the specified value.

Answer 100

The core principle of the Greedy method is to make locally-optimal choices at each step, hoping that these choices will lead to a globally-optimal solution. It focuses on making the best decision at the current moment without considering the long-term impact on future decisions.

Answer 101

The Greedy method makes decisions based on the information available at each phase without considering the broader problem. It focuses on local optimum choices at each stage, and there is a possibility that the greedy solution may not provide the best solution for every problem.

Answer 102

The Greedy algorithm makes good local choices at each stage with the intention of finding the global optimum. It follows a strategy of making decisions based on the information available at each phase, aiming for the solution to be either feasible or optimal.

Answer 103

In general, if we have a variable with n possible values there are 2^(n−1)-1 possible binary splits. In the above example 4 values means 2^(4−1)-1=7 possible splits.

Answer 104

It means that locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree.

Decision Trees Flashcards

(130 cards)