Decision Trees Flashcards

Question

What is the relative training error?

Answer 1

The training error of the tree scaled by the training error of the simplest tree, i.e., the tree with no splits * Regression trees: RSS/TSS * Classification trees: # misclassifications made by tree / # of misclassifications made by using majority class

Answer 2

By varying the complexity paramter, we can trace the whole family of trees ranging from the largest, most complex tree, to the tree only with the root node because the values of the complexity paramter forms subsets of trees

Answer 3

When we minimize the complexity paramter, there are other constraints imposed such as the minimum number of observations in a node (minbucket), or the maximum depth level (maxdepth) * The higher minbucket, the less complex the tree * The higher maxdepth, the more complex the tree

Answer 4

* cp can be tuned by cross validation * minbucket and maxdepth have to be tuned by trial and error

Answer 5

* Number of tree splits * Split sequence, i.e., start with X1, further split the bucket by X2 * Which are the most important predictors? (early splits) * Which terminal nodes have the most observations? * Any prominent interactions? * (Classification trees) Combinations leading to the positive event

Answer 6

We can select the smallest tree whose cross validation error is within one standard error of the minimum cross validation error. The rationale is to select a simpler and more interpretable tree with comparable prediction performance

Answer 7

* mtry: # of features sampled as candidates at each split. can be tuned by CV, but usually = root p for classification and p/3 for regression * ntree: # of trees to be grown. The higher ntree, the more variance reduction. We set ntree to a relatively small value to save run time, and overfitting often does not arise even if we set ntree to a large number

Answer 8

* eta: the learning rate (or shrinkage) parameter. The higher eta, the faster the algorithim converges, but more prone to overfitting. Rule of thumb is to set eta to a relatively small value * nrounds: maximum # of rounds in the tree construction process

Answer 9

The total drop in node impurity (RSS for regression trees and Gini index for classification trees) due to splits over a given predictor, averaged over all base trees

Answer 10

The model prediction obtained after averaging the values or levels of variables that are not of interest

Answer 11

Pros: * (Target distribution) excel in accomodating a wide variety of distributions for the target variable * (Interpretability) the model equation clearly shows how the target mean depends on features; coefficients = interpretable measures of directional effect of features * (Implementation) simple to implement Cons: * (Complex relationships) unable to capture non-linear or non-additive relationships unless additional features are manually incoroporated (such as polynomial variables or interactions) * (Interpretability) for some link functions (inverse link), the coefficients may be difficult to interpret

Answer 12

Pros: * (Categorical predictors) via the use of model matrices, binarization of categorical variables is done automatically * (Tuning) can be tuned by CV * (Variable selection) for lasso and elastic nets, variable selection can be done by making lambda large enough Cons: * (Target distribution) limited/restricted model forms allowed by glmet * (Categorical predictors) possible to see some non-intuitive or nonsensical results when only a handful of the levels of a categorical predictor are selected * (Interpretability) coefficient estimates are more difficult to interpret because the variables are standardized

Answer 13

Pros: * (Interpretability) if there are not too many buckets, trees are easy to interpret because of the if/then nature of the classification rules and their graphical representation * (Complex relationships) excel in handling non-linear and non-additive relationships without the need to insert extra features manually * (Categorical variables) categorical predictors are automatically handles wihtout the need for binarization or selecting a baseline level * (Variable selection) variables are automatically selected as part of the model building process --> most important variables are at the top of the tree Cons: * (Overfitting) strongly dependent on training data and prone to overfitting which leads to unstable predictions with a high variance and results in lower user confidence * (Numeric variables) usually need to split a numeric predictor repeatedly to capture its effect effectively * (Categorical variables) tend to favor categorical predictors with a large number of levels --> because too many ways to split --> easy to find split that looks good on training data, but doesn't really exist in the signal

Answer 14

Pros: * Much more robust and predictive than base trees by combining the results of multiple trees Cons: * Opaque / difficult to interpret. This is because many base trees are used, but variable importance and partial dependence plots can help * Difficult to implement. This is because there's a huge computational burden with fitting multiple base trees

Answer 15

1. construction: choose the best predictor and best cutoff resulting in the greatest reduction in chosen impurity measure (generally Gini and entropy) 2. pruning: impurity measure will make up the objective function penalized by tree complexity (generally classification error rate)

Answer 16

Dissimilarity measures the proximity of two observations, whereas linkage measures the proximity of two clusters

Answer 17

The height is the inter-cluster dissimilarity between the two closest clusters when they are fused in each step of the hierarchical clustering algorithm

Answer 18

The ROC curve shows the sensitivity and specificity of a classifier for a range of cutoffs between 0 and 1. The AUC combined these quantities across all cutoffs into a single metric measuring the classifier’s predictive ability

Decision Trees Flashcards

(42 cards)