Module 7 - Decision Trees Flashcards by Vincent Nguyen

Decision Trees, advantages (3) and disadvantages? (2)

1) Advantages
- Intuitive and quick to run
- Good for recognizing natural break points in continuous variables
- Good for recognizing nonlinear interactions between variables
- Automatically handles categorical data -> no need to binarize or determine a base class
- Interactions automatically handled -> no need to identify potential interactions prior to fitting the tree

2) Disadvantages
- Unstable, prone to overfitting
- does not predict as accurately compared to other models due to greedy nature of tree construction algorithm
- When underlying data changes, break points for DTs can change significantly, leading to low user confidence in the model

How well did you know this?

Not at all

Perfectly

Random Forest, advantages? (2)

Reduces model variance by combining the results of multiple decision trees
Detects nonlinear interactions between predictor variables when finding variable importance

How well did you know this?

Not at all

Perfectly

Gradient boosting machine (GBM), advantage (1), disadvantages (2)?

1) Advantages
Reduces model Bias

2) Disadvantages
- More prone to overfitting than random forests
- More sensitive to hyperparameter inputs

How well did you know this?

Not at all

Perfectly

Pruning, description?

Reduces the size of decision trees by removing sections of the tree that provide little predictive power
= reduces complexity of final model

How well did you know this?

Not at all

Perfectly

Cost-complexity pruning, explained?

Use measures of impurity reduction to decide which branches to prune back

-This is AFTER a decision tree has been built -> remove branches that don’t achieve the threshold level of impurity reduction

How well did you know this?

Not at all

Perfectly

Complexity Parameter in decision tree, description?

A type of control parameter that indicates the MINIMUM amount of impurity reduction required for a split to be made

How well did you know this?

Not at all

Perfectly

Control Parameter: Minsplit, definition?

Minimum # of observations that must exist in a node in order for a split to be attempted

How well did you know this?

Not at all

Perfectly

Control Parameter: Minbucket, definition?

Minimum # of observations in any terminal node

How well did you know this?

Not at all

Perfectly

Control Parameter: Maxdepth definition?

Maximum depth of any node of the final tree

The Root node counted as depth = 0

How well did you know this?

Not at all

Perfectly

Control Parameter: xval definition?

of cross validations (CV)

How well did you know this?

Not at all

Perfectly

Regression tree vs Classification tree: what’s the main difference?

Uses RSS error between target and predicted values as the impurity measure
- Instead of measuring Gini/entropy/Classification error

How well did you know this?

Not at all

Perfectly

Regression tree output: Variable importance

Shows the ordering of variables according to their contribution to the model

Variables higher up in the list = larger improvements in the splitting criteria when they were used as splits

How well did you know this?

Not at all

Perfectly

How to choose number of splits for a regression tree?

Choose the nsplit that MINIMIZES the cross-validation error (“xerror” in R)

How well did you know this?

Not at all

Perfectly

Describe how ensemble methods overcome the bias-variance tradeoff? (3)

1) Building many models on random subsets of data
= Taking an aggregate answer, instead of relying on 1 model

2) Each component model potentially becoming responsible for different parts of the complex relationship
- Thus cancelling out most of the noise that would have been caused by fitting the model on a specific subset of data

3) Reduces the variance of a model’s output
- By taking the average over all of the component models’ output

How well did you know this?

Not at all

Perfectly

Bagging, description? what is the goal?

1) Training multiple models independently in parallel on random subsets of data
- Then take the final result to be the average of the outputs of all models

2) Goal = model with low variance and low bias

How well did you know this?

Not at all

Perfectly

Boosting, definition? Effect?

Study These Flashcards

1) Training 1 model, then training a subsequent model on the residuals obtained from predicting with the 1st one
- New model is obtained by adding a scaled-down version of the 2nd model to the first one, then repeat process
2) Effect: Each additional model will focus on predicting those observations the previous model did poorly on

Random Forest, defintion?

Study These Flashcards

Bagged model where:

Base model = decision tree

Oversampling/undersampling, which data do you need to use it on? Test, training, etc?

Study These Flashcards

Only perform oversampling on TRAINING Data
-Not on full data before splitting into test/training

Oversampling before splitting into test/training sets increases the chances of duplicate records in training and test data

When can you use sampling techniques? (2)

Study These Flashcards

so steps:

1) split into test/training
2) do the oversampling/undersampling on TRAIN data
3) Parameter tuning and selection
4) Training the model

1) split into test/training
2) Parameter tuning and select best parameters
3) do oversampling/undersampling
4) training

Partial dependence plot

Study These Flashcards

Allow us to get an understanding of the relationship between features and our target

-Calculate and show the average predicted value of the target variable by varying the value of 1 (or more) input features

Bagging Methods, Advantage? Disadvantage? (2)

Study These Flashcards

1) Advantage
- Reduce expected loss of models -> more robust than individual components

-Bagging methods do this by reducing variance without affecting the bias

2) Disadvantage
- Loss of model interpretability, in exchange for additional predictive power and robustness

Boosting, focus

Study These Flashcards

Focused on building multiple one after the other

At each step, adjust training data to place more emphasis on data points that previous models predicted poorly
Models NOT independently trained = main difference with bagging

To simplify decision tree in rpart, 3 ways?

Study These Flashcards

1) Increase the CP
2) increase the minbucket parameter
3) decrease the maxdepth argument

Why are random forests an improvement over bagging?

Study These Flashcards

Random forests are an improvement over bagging because the trees are decorrelated

Why can recursive binary splitting method lead to overfitting the data?

The method optimizes with respect to the training set, but may perform poorly on the test set

True or false: | -A tree with more splits tends to have lower variance

FALSE -Adding additional splits tends to INCREASE variance due to adding MORE complexity to the model

Explain how AUC/ROC is used Best/Worst Case value of AUC?

- AUC/ROC curve tells how much a model is capable of distinguishing between classes -> measure of model performance - the higher the AUC, the better the model is at classifying - Best case: AUC = 1 - Worst case: AUC = 0.5 -> incapable of classifying - AUC = 0 -> classifying the responses as the inverse (negative as positive and vice-versa)

How are partial dependence plots used?

PDPs are used to show the marginal effect of a feature on the predicted outcome of the model -Can show whether relationship between the target and a feature is linear, monotonous or more complex

Bagging is trying to find what kind of model?

Model with low variance and low bias

Advantage of using stratified sampling? (CreateDataPartition)

Stratified sampling is used to create balanced splits of the data, thus allowing us to preserve the overall class distribution of the data. -Maintain the ratio / "balance" of the factor classes

Why are variables with a lot of dimensions bad?

Can dilute predictive power

Interpret a decision tree bucket?

All observations in a given bucket have the same predicted value

For DT, after finding the optimal value of the complexity parameter, 2 options? advantages and disadvantages

1. building a new tree using the optimal CP - However, with new tree, a good split might occur after a bad split (greedy algorithm) - so if we pre-specify a CP that is too big, the tree might never get to them 2. pruning back to remove splits that dont satisfy the impurity reduction requirement

in Xgboost, what is the shrinkage parameter used for?

parameter that controls the rate at which the model converges towards a minimum value -the higher the learning rate, the faster the model will find a minimum. but the model will liekly be overfit - the smaller the learning rate ,the slower the algorithm will converge (more iterations) - but the solution will more likely be optimal

rule of thumb of for eta value for xgboost?

[0.001,0.2]

Describe what a decision tree does

DTs divide the feature space into mutually exclusive, collectively exhaustive set of buckets, where all observations in a bucket are given the sae value

Module 7 - Decision Trees Flashcards

(36 cards)