Module 7 - Decision Trees Flashcards
Decision Trees, advantages (3) and disadvantages? (2)
1) Advantages
- Intuitive and quick to run
- Good for recognizing natural break points in continuous variables
- Good for recognizing nonlinear interactions between variables
- Automatically handles categorical data -> no need to binarize or determine a base class
- Interactions automatically handled -> no need to identify potential interactions prior to fitting the tree
2) Disadvantages
- Unstable, prone to overfitting
- does not predict as accurately compared to other models due to greedy nature of tree construction algorithm
- When underlying data changes, break points for DTs can change significantly, leading to low user confidence in the model
Random Forest, advantages? (2)
- Reduces model variance by combining the results of multiple decision trees
- Detects nonlinear interactions between predictor variables when finding variable importance
Gradient boosting machine (GBM), advantage (1), disadvantages (2)?
1) Advantages
Reduces model Bias
2) Disadvantages
- More prone to overfitting than random forests
- More sensitive to hyperparameter inputs
Pruning, description?
Reduces the size of decision trees by removing sections of the tree that provide little predictive power
= reduces complexity of final model
Cost-complexity pruning, explained?
Use measures of impurity reduction to decide which branches to prune back
-This is AFTER a decision tree has been built -> remove branches that don’t achieve the threshold level of impurity reduction
Complexity Parameter in decision tree, description?
A type of control parameter that indicates the MINIMUM amount of impurity reduction required for a split to be made
Control Parameter: Minsplit, definition?
Minimum # of observations that must exist in a node in order for a split to be attempted
Control Parameter: Minbucket, definition?
Minimum # of observations in any terminal node
Control Parameter: Maxdepth definition?
Maximum depth of any node of the final tree
The Root node counted as depth = 0
Control Parameter: xval definition?
of cross validations (CV)
Regression tree vs Classification tree: what’s the main difference?
Uses RSS error between target and predicted values as the impurity measure
- Instead of measuring Gini/entropy/Classification error
Regression tree output: Variable importance
Shows the ordering of variables according to their contribution to the model
Variables higher up in the list = larger improvements in the splitting criteria when they were used as splits
How to choose number of splits for a regression tree?
Choose the nsplit that MINIMIZES the cross-validation error (“xerror” in R)
Describe how ensemble methods overcome the bias-variance tradeoff? (3)
1) Building many models on random subsets of data
= Taking an aggregate answer, instead of relying on 1 model
2) Each component model potentially becoming responsible for different parts of the complex relationship
- Thus cancelling out most of the noise that would have been caused by fitting the model on a specific subset of data
3) Reduces the variance of a model’s output
- By taking the average over all of the component models’ output
Bagging, description? what is the goal?
1) Training multiple models independently in parallel on random subsets of data
- Then take the final result to be the average of the outputs of all models
2) Goal = model with low variance and low bias
Boosting, definition? Effect?
1) Training 1 model, then training a subsequent model on the residuals obtained from predicting with the 1st one
- New model is obtained by adding a scaled-down version of the 2nd model to the first one, then repeat process
2) Effect: Each additional model will focus on predicting those observations the previous model did poorly on
Random Forest, defintion?
Bagged model where:
Base model = decision tree
Oversampling/undersampling, which data do you need to use it on? Test, training, etc?
Only perform oversampling on TRAINING Data
-Not on full data before splitting into test/training
Oversampling before splitting into test/training sets increases the chances of duplicate records in training and test data
When can you use sampling techniques? (2)
so steps:
1) split into test/training
2) do the oversampling/undersampling on TRAIN data
3) Parameter tuning and selection
4) Training the model
or
1) split into test/training
2) Parameter tuning and select best parameters
3) do oversampling/undersampling
4) training
Partial dependence plot
Allow us to get an understanding of the relationship between features and our target
-Calculate and show the average predicted value of the target variable by varying the value of 1 (or more) input features
Bagging Methods, Advantage? Disadvantage? (2)
1) Advantage
- Reduce expected loss of models -> more robust than individual components
-Bagging methods do this by reducing variance without affecting the bias
2) Disadvantage
- Loss of model interpretability, in exchange for additional predictive power and robustness
Boosting, focus
Focused on building multiple one after the other
- At each step, adjust training data to place more emphasis on data points that previous models predicted poorly
- Models NOT independently trained = main difference with bagging
To simplify decision tree in rpart, 3 ways?
1) Increase the CP
2) increase the minbucket parameter
3) decrease the maxdepth argument
Why are random forests an improvement over bagging?
Random forests are an improvement over bagging because the trees are decorrelated
Why can recursive binary splitting method lead to overfitting the data?
The method optimizes with respect to the training set, but may perform poorly on the test set
True or false:
-A tree with more splits tends to have lower variance
FALSE
-Adding additional splits tends to INCREASE variance due to adding MORE complexity to the model
Explain how AUC/ROC is used
Best/Worst Case value of AUC?
- AUC/ROC curve tells how much a model is capable of distinguishing between classes -> measure of model performance
- the higher the AUC, the better the model is at classifying
- Best case: AUC = 1
- Worst case: AUC = 0.5 -> incapable of classifying
- AUC = 0 -> classifying the responses as the inverse (negative as positive and vice-versa)
How are partial dependence plots used?
PDPs are used to show the marginal effect of a feature on the predicted outcome of the model
-Can show whether relationship between the target and a feature is linear, monotonous or more complex
Bagging is trying to find what kind of model?
Model with low variance and low bias
Advantage of using stratified sampling? (CreateDataPartition)
Stratified sampling is used to create balanced splits of the data, thus allowing us to preserve the overall class distribution of the data.
-Maintain the ratio / “balance” of the factor classes
Why are variables with a lot of dimensions bad?
Can dilute predictive power
Interpret a decision tree bucket?
All observations in a given bucket have the same predicted value
For DT, after finding the optimal value of the complexity parameter, 2 options? advantages and disadvantages
- building a new tree using the optimal CP
- However, with new tree, a good split might occur after a bad split (greedy algorithm)
- so if we pre-specify a CP that is too big, the tree might never get to them - pruning back to remove splits that dont satisfy the impurity reduction requirement
in Xgboost, what is the shrinkage parameter used for?
parameter that controls the rate at which the model converges towards a minimum value
-the higher the learning rate, the faster the model will find a minimum. but the model will liekly be overfit
- the smaller the learning rate ,the slower the algorithm will converge (more iterations)
- but the solution will more likely be optimal
rule of thumb of for eta value for xgboost?
[0.001,0.2]
Describe what a decision tree does
DTs divide the feature space into mutually exclusive, collectively exhaustive set of buckets, where all observations in a bucket are given the sae value