Tree-Based Models (10-20%) Flashcards

1
Q

Compare decision trees, random forests, and gradient boosting machines (GBMs).

A

Decision trees:
1. intuitive and quick to run
2. good for recognizing natural break points in continuous variables
3. good for recognizing nonlinear interactions between variables
4. unstable and can be prone to overfitting

Random forests:
1. a natural extension of decision trees, uses many weak learners to solve a problem
2. “bagging” is used to get from decision trees to random forests, using a subset of features to build each decision tree, with a focus on reducing model variance
3. a powerful tool to detect nonlinear interactions between predictor variables
4. less prone to overfitting than gradient boosting machines

Gradient boosting machines (GBMs):
1. uses the concept of continuously refitting residuals of the previous model, with a focus on reducing model bias
2. an extension of decision trees with the insights of “bagging + boosting”
3. a powerful tool to detect nonlinear interactions between predictor variables
4. more prone to overfitting than random forests and more sensitive to hyperparameter inputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe decision trees.

A

The idea is to split the feature space into exhaustive and mutually exclusive segments that minimize the variability of the target within each segment by splitting the parent segment by a variable that causes the new segments to be most different from each other.

For classification, the estimate of the target for observations in a leaf segment/node is set as the class that appears most in that segment. (E.g., are the majority of observations in that segment lapses or non-lapses?)

For regression, the target estimate is set as the mean of the target values of the observations within the leaf segment.

The structure is similar to a real (upside-down) tree. Each new segment is created by splitting an existing segment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define binary tree.

A

A tree structure in which each node has two children

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define balanced binary tree.

A

A binary tree in which the left and right subtrees of any node differ in depth by at most one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define a node in decision trees.

A

A subset of the data. For a given node, nodes that are below it must contain a subset of that node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define a root in decision trees.

A

The top node is a tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define a child in decision trees.

A

A node directly connected to another node when moving away from the root node. In most tree depictions, the child nodes appear below their parent nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define a parent in decision trees.

A

A node directly connected to another node when moving away from the root. In most tree depictions, the parent node appears above its child nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define a leaf in decision trees.

A

A node with no children. These represent the final segments of data. The union of all data in the leaf nodes will be the full dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define edge in decision trees.

A

The connection between one node and another. These are represented by the arrows in the tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Define depth in decision trees.

A

The number of edges from the tree’s root node to the “furthest” leaf.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Define a subtree in decision trees.

A

A branch of the tree that doesn’t start from the root.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Define pruning in decision trees.

A

A technique in machine learning that reduces the size of decision trees by removing sections of the tree that provide little predictive power. Pruning reduces the complexity of the final model and, hence, improves predictive accuracy by reducing overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Define unbalanced binary tree.

A

A binary tree ub which the left and right subtrees of some nodes differ in depth by more that one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Define entrophy.

A

Entropy is a measure of the impurity of each node in a decision tree (classification only, although there are regression equivalents). Knowing the impurity of a node helps the decision tee decide how ro create the splits. We can calculate the impurity of the resulting nodes for a candidate split so that we can measure how good that split is. We will choose the final split as the candidate split that results in child nodes with the lowest levels of impurity.

Entropy = -sum(p*log(p))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Define information gain in decision trees.

A

We would like to be able to split the data into smaller segments that help us improve the overall impurity of the resulting segments. We call this improvement in the impurity the information gain resulting from a chosen split. In order to pick the best split, we need to calculate the information gain for all possible splits and then pick the best one.

Information gain = entropy(parent) - sum[(Nk/Np)*entropy(child))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Describe Gini.

A

Gini impurity is a measure of how often a randomly chosen element from the segment would be incorrectly classified if it were randomly classified according to the distribution of targets in the subset.

The goal is to make Gini as small as possible by picking splits that maximize its reduction.

Gini = 1-sum(p^2)

18
Q

Describe classification error formula.

A

Classification error = 1 - max(p)

19
Q

Methods to compare impurity in regression trees.

A

Variance (squared deviance), R-squared, or residual sum of squares error between target and predicted values.

The use of the R-squared impurity measure assumes the data is distributed according to a Gaussian (or at least a symmetric distribution).

Our goal is to maximize the difference between nodes and minimize within each node.

20
Q

Define cost complexity pruning.

A

We can also use measures of impurity reduction to decide which branches to prune back after a decision tree has been built by removing those branches that don’t achieve the threshold level of impurity reduction.

It begins with the full tree previously built and removes the least important splits, using the specified CP value. Note that when the example tree was built, CP was set to zero. This allowed the most complex tree (given the other control parameters) to be built. If the original CP was set higher, the initial tree would be simpler, and pruning would begin from that simpler tree.

21
Q

Describe the complexity parameter (CP).

A

Tells the algorithm the minimum impurity reduction required before making a split. A high complexity parameter means the algorithm will make fewer splits, resulting in a less complex tree, whereas a low complexity parameter means there is a very low requirement for impurity reduction, so more splits will be made, resulting in a more complex tree. We need to find the right complexity level that optimizes the bias-variance trade-off, or maximizes the generalization capability of the model. This is essentially what the internal cross validation does: it finds the level of complexity within the tree that results in the greatest generalization performance.

22
Q

Define “one standard error” approach.

A

Begin the CP value with the smallest xerror and add the standard error. If none are below, the same CP value would be selected.

23
Q

Accuracy formula.

A

Accuracy = (TP + TN)/N

24
Q

Error rate formula.

A

Error rate = (FP + FN)/N = 1-accuracy

25
Q

Precision formula.

A

Precision = TP/(TP + FP)

Precision is the proportion of positive predictions that turn out to be correct.

26
Q

Sensitivity formula.

A

Sensitivity = TP/(TP + FP)

Sensitivity is the proportion of those actual positive predictions that were predicted to be positive. Also known as the true positive rate (TPR)

27
Q

Define Receiver Operator Characteristic (ROC) Curve.

A

The ROC curve is a plot of the true positive rate (TPR) and false positive rate (FPR) over cutoff values rather than just using a single cutoff. The (FPR,TPR) is always a point on the ROC curve.

FPR = FP/(FP + FP)
TPR = TP/(TP + TP)

note: a good result is a curve that is well above the line from (0,0) to (1,1).
note: area under curve (AUC) displays the estimate of the models fit.

28
Q

Describe the area under the curve (AUC).

A

An estimate of model fit is the area under the ROC curve

29
Q

What do GLMs and linear models suffer from?

A
  1. Limited ability to extract complex relationships from the data (in the case of GLMs)
  2. Sensitivity to noise and a tendency to overfit (in the case of decision trees)

Note: these drawbacks relate to the bias-variance trade-off.
Note: Ensemble methods are a good substitute to overcome this limitation

30
Q

Describe ensemble methods.

A

Instead of relying on one single model (ie. GLM or linear models), we build many models on random subsets of the data and take the answer in aggregate. Not only does this allow us to increase the ability to reflect complex relationships in the data (with each component model potentially becoming responsible for different parts of the complex relationship), but it also reduces the variance of our model’s output by taking the average (or similar measure) over all of the component models’ output. Thus, most of the noise in the model introduced by fitting to a specific subset of data will be canceled out because the results of all models will be considered. We hinted at this when the bias-variance decomposition was introduced. Remember when we trained 10 models on the different subsets of the data to see how each model varied? Taking the average of all of those models gave us a much more stable result that ended up much closer to the true signal than any of the individual models. When trying to fit a single model such as a decision tree or a GLM, we had to be careful with how complex we let the model become. More complexity allowed us to achieve much lower bias, but we ended up with models with many more variables. There are several types of ensembling methods, some of which allow us to reduce the bias and others that help us to reduce the variance.

Advantage: better predictiveness and more robust (compared to GLMs and decision trees)
Disadvantage: more computational (fitting more models to the data)

31
Q

Define bagging.

A

Bagging (bootstrap aggregation) is an ensemble method that involves training multiple base models in parallel on bootstrapped samples of the training dataset. The outputs of these base models are then aggregated using an average of all base model predictions. The reason we do this is because we ultimately want a model with low variance and low bias.

32
Q

Define boosting.

A

Boosting involves training a single model, then training a subsequent model on the residuals obtained from predicting with the first one. The new model is obtained by adding a scaled-down version of the second model to the first one, and the process is repeated. The effect produced is that each additional model will focus on predicting those observations the previous model did poorly on until gradually the entire model (which is just the sum of all component models) predicts the entire dataset well. This algorithm is commonly called a gradient boosting machine.

33
Q

Define random forests.

A

Random forests are bagged models where the base model is a decision tree.

In addition, a modification is made regarding how the trees are constructed. The modification is that at each split, only a subset (selected at random) of the predictor variables is used. This allows early splits to use different predictors, which may lead to alternative models that fit well. On the downside, random forests are much less interpretable than an individual decision tree because it can produce hundreds or potentially even thousands of trees.

34
Q

List the steps in how to fit a random forest.

A
  1. Training
    - get a random sample of observations (with replacement) from the training data
    - At each split, search from among a random sample of features to determine that split (without replacement)
    - Train a decision tree on the above
    - Repeat
  2. Predicting
    - Predict the target using each tree previously trained
    - Average the predictions
35
Q

How to prevent random forests from overfitting?

A

Theoretically, random forests shouldn’t overfit but if individual models are overly complex, the random forest could still overfit. Was to control the parameters to prevent this are:
1) max depth of the tree
2) min improvement required to split
3) min observations required before splitting
4) min observations required after splitting (at each leaf)
5) the splitting method ie. Gini, squared deviance, etc.

36
Q

What is the issue with unbalanced data?

A

When data is very unbalanced, there is a high proportion of one target value, relative to the other.

37
Q

What are the approaches to combat unbalanced data?

A

Note, under/over sampling is only performed on training data

1) undersampling - Known as down sampling. Instead of using the full dataset with majority observations, we can instead undersample the majority observations and keep all of the minority observations. This means we end up with balanced classes when training our model, so it could better pick up the signal leading to the minority class. However, we are using less data and, consequently, we will have less information about the majority class.

2) oversampling - Known as up sampling. Keep all of the data, and oversample (either generate more of or duplicate/sample with replacement) the minority class observations. Again, this will leave us with roughly balanced target classes, which will improve model performance.

38
Q

What are the 2 methods used in the random forest to help with interpretation?

A

1) Feature importance: a method that analyzes the structure of the model and ranks the contribution of each feature
2) Partial dependence plots: a method that allows us to get an understanding of the relationship between features and our target. Partial (or “average”) dependence plots calculate and show the “average” predicted value of the target variable by varying the value of one (or more) input features. This isn’t a perfect visualization because it only captures the average dependence, but we are limited by the human brain’s capacity to comprehend highly complex relationships between many variables beyond three dimensions. Partial dependence plots at least give us a way to check that the model is reacting to certain features in a sensible way, and give some insight into its behavior.

39
Q

Name the advantages and disadvantages of bagging.

A

Advantages:
1) bagging methods reduce the expected loss of models by reducing variance without affecting the bias. This means that bagged models are more robust than their individual components.
2) able to handle both categorical and numerical values, and missing values.

Disadvantages:
1) when the predictive power and robustness of the model increases, the interpretability of the model decreases (bias-variance trade-off)
2) less flexible in their ability to reduce bias and are tied to simpler models

40
Q

What are the differences between bagging and boosting?

A

Bagging: addresses model variance
- parallel training (independently trained)
Boosting: focuses on bias ie. capturing a complex signal in the data
- better predictive accuracy
- more susceptible to overfitting ie. uses regularization to prevent this (shrinkage) or early stopping
- requires performing model fitting procedure many time, which comes at a computational cost
- sequential learning (not independently trained - where you build multiple models one after the other and at each step you adjust the training data to place more emphasis on the data points that previous models predicted poorly)

41
Q

Describe gradient boosting machines.

A

They approximate the solution by fitting each new model such that it moves in the direction of the negative gradient of the loss function. In the squared loss case for regression, the derivative is given by the residuals, so this is not new. What is new is that the new model that is added at each step is multiplied by a constant called the learning or shrinkage parameter. Essentially, it controls the rate at which you move toward the minimum value. The trade-off is that smaller steps take longer to converge, but can reduce overfitting.

Variable Importance - We can see the overall feature importance for each feature, which accounts for the total weighted contribution to the model improvement over all trees in the model by each feature.
Partial Dependence Plots - As with the other ensemble methods, we are restricted to evaluating our model using partial dependence plots. Again, these plots give us a qualitative picture of what the model is doing.