Module 7 - Decision Trees Flashcards

1
Q

Decision Trees, advantages (3) and disadvantages? (2)

A

1) Advantages
- Intuitive and quick to run
- Good for recognizing natural break points in continuous variables
- Good for recognizing nonlinear interactions between variables
- Automatically handles categorical data -> no need to binarize or determine a base class
- Interactions automatically handled -> no need to identify potential interactions prior to fitting the tree

2) Disadvantages
- Unstable, prone to overfitting
- does not predict as accurately compared to other models due to greedy nature of tree construction algorithm
- When underlying data changes, break points for DTs can change significantly, leading to low user confidence in the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Random Forest, advantages? (2)

A
  • Reduces model variance by combining the results of multiple decision trees
  • Detects nonlinear interactions between predictor variables when finding variable importance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Gradient boosting machine (GBM), advantage (1), disadvantages (2)?

A

1) Advantages
Reduces model Bias

2) Disadvantages
- More prone to overfitting than random forests
- More sensitive to hyperparameter inputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Pruning, description?

A

Reduces the size of decision trees by removing sections of the tree that provide little predictive power
= reduces complexity of final model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Cost-complexity pruning, explained?

A

Use measures of impurity reduction to decide which branches to prune back

-This is AFTER a decision tree has been built -> remove branches that don’t achieve the threshold level of impurity reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Complexity Parameter in decision tree, description?

A

A type of control parameter that indicates the MINIMUM amount of impurity reduction required for a split to be made

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Control Parameter: Minsplit, definition?

A

Minimum # of observations that must exist in a node in order for a split to be attempted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Control Parameter: Minbucket, definition?

A

Minimum # of observations in any terminal node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Control Parameter: Maxdepth definition?

A

Maximum depth of any node of the final tree

The Root node counted as depth = 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Control Parameter: xval definition?

A

of cross validations (CV)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Regression tree vs Classification tree: what’s the main difference?

A

Uses RSS error between target and predicted values as the impurity measure
- Instead of measuring Gini/entropy/Classification error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Regression tree output: Variable importance

A

Shows the ordering of variables according to their contribution to the model

Variables higher up in the list = larger improvements in the splitting criteria when they were used as splits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to choose number of splits for a regression tree?

A

Choose the nsplit that MINIMIZES the cross-validation error (“xerror” in R)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Describe how ensemble methods overcome the bias-variance tradeoff? (3)

A

1) Building many models on random subsets of data
= Taking an aggregate answer, instead of relying on 1 model

2) Each component model potentially becoming responsible for different parts of the complex relationship
- Thus cancelling out most of the noise that would have been caused by fitting the model on a specific subset of data

3) Reduces the variance of a model’s output
- By taking the average over all of the component models’ output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Bagging, description? what is the goal?

A

1) Training multiple models independently in parallel on random subsets of data
- Then take the final result to be the average of the outputs of all models

2) Goal = model with low variance and low bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Boosting, definition? Effect?

A

1) Training 1 model, then training a subsequent model on the residuals obtained from predicting with the 1st one
- New model is obtained by adding a scaled-down version of the 2nd model to the first one, then repeat process
2) Effect: Each additional model will focus on predicting those observations the previous model did poorly on

17
Q

Random Forest, defintion?

A

Bagged model where:

Base model = decision tree

18
Q

Oversampling/undersampling, which data do you need to use it on? Test, training, etc?

A

Only perform oversampling on TRAINING Data
-Not on full data before splitting into test/training

Oversampling before splitting into test/training sets increases the chances of duplicate records in training and test data

19
Q

When can you use sampling techniques? (2)

A

so steps:

1) split into test/training
2) do the oversampling/undersampling on TRAIN data
3) Parameter tuning and selection
4) Training the model

or

1) split into test/training
2) Parameter tuning and select best parameters
3) do oversampling/undersampling
4) training

20
Q

Partial dependence plot

A

Allow us to get an understanding of the relationship between features and our target

-Calculate and show the average predicted value of the target variable by varying the value of 1 (or more) input features

21
Q

Bagging Methods, Advantage? Disadvantage? (2)

A

1) Advantage
- Reduce expected loss of models -> more robust than individual components

-Bagging methods do this by reducing variance without affecting the bias

2) Disadvantage
- Loss of model interpretability, in exchange for additional predictive power and robustness

22
Q

Boosting, focus

A

Focused on building multiple one after the other

  • At each step, adjust training data to place more emphasis on data points that previous models predicted poorly
  • Models NOT independently trained = main difference with bagging
23
Q

To simplify decision tree in rpart, 3 ways?

A

1) Increase the CP
2) increase the minbucket parameter
3) decrease the maxdepth argument

24
Q

Why are random forests an improvement over bagging?

A

Random forests are an improvement over bagging because the trees are decorrelated

25
Q

Why can recursive binary splitting method lead to overfitting the data?

A

The method optimizes with respect to the training set, but may perform poorly on the test set

26
Q

True or false:

-A tree with more splits tends to have lower variance

A

FALSE

-Adding additional splits tends to INCREASE variance due to adding MORE complexity to the model

27
Q

Explain how AUC/ROC is used

Best/Worst Case value of AUC?

A
  • AUC/ROC curve tells how much a model is capable of distinguishing between classes -> measure of model performance
  • the higher the AUC, the better the model is at classifying
  • Best case: AUC = 1
  • Worst case: AUC = 0.5 -> incapable of classifying
  • AUC = 0 -> classifying the responses as the inverse (negative as positive and vice-versa)
28
Q

How are partial dependence plots used?

A

PDPs are used to show the marginal effect of a feature on the predicted outcome of the model

-Can show whether relationship between the target and a feature is linear, monotonous or more complex

29
Q

Bagging is trying to find what kind of model?

A

Model with low variance and low bias

30
Q

Advantage of using stratified sampling? (CreateDataPartition)

A

Stratified sampling is used to create balanced splits of the data, thus allowing us to preserve the overall class distribution of the data.

-Maintain the ratio / “balance” of the factor classes

31
Q

Why are variables with a lot of dimensions bad?

A

Can dilute predictive power

32
Q

Interpret a decision tree bucket?

A

All observations in a given bucket have the same predicted value

33
Q

For DT, after finding the optimal value of the complexity parameter, 2 options? advantages and disadvantages

A
  1. building a new tree using the optimal CP
    - However, with new tree, a good split might occur after a bad split (greedy algorithm)
    - so if we pre-specify a CP that is too big, the tree might never get to them
  2. pruning back to remove splits that dont satisfy the impurity reduction requirement
34
Q

in Xgboost, what is the shrinkage parameter used for?

A

parameter that controls the rate at which the model converges towards a minimum value
-the higher the learning rate, the faster the model will find a minimum. but the model will liekly be overfit

  • the smaller the learning rate ,the slower the algorithm will converge (more iterations)
  • but the solution will more likely be optimal
35
Q

rule of thumb of for eta value for xgboost?

A

[0.001,0.2]

36
Q

Describe what a decision tree does

A

DTs divide the feature space into mutually exclusive, collectively exhaustive set of buckets, where all observations in a bucket are given the sae value