Ensemble Methods - Models Flashcards
Definition
statistical techniques that combine multiple models to create 1 predictive model
Difficult to interpret compared to a single model
Focus on predictive accuracy and less on interpretability
->less output in R
Random Forests - Basics
Bootstrapping creates many bootstrap samples; to create one, observations from the training set are sampled with replacement until the sample has the same number of observations.
-some observations are left out of each bootstrap sample. These are called ‘out-of-bag’
A random forest model fits a decision tree to each bootstrap sample. To produce a prediction, the model aggregates the predictions from each tree.
->For regression trees, ‘aggregate’ means to take their average
->For classification trees, it means to select the most frequent one. If exactly half the trees predict 0 and 1, the target prediction will be chosen at random.
Random Forests - Notes
The individual trees are typically large with no pruning, having high variance and low bias. This is permissible since aggregating the trees reduces the high variance. Even so, this does not mean a random forest is unable to overfit.
–Variance refers to the sensitivity of the model to changes in the training dataset. Bootstrapping reduces variance because each individual tree is trained on different data.
When all predictors are considered for every split, the model is referred to as bagging. This could be undesirable, such as if a dominant predictor appears as the first split in many of the trees, thus resulting in similar/correlated predictions across trees. To prevent this, a random subset of predictors should be considered for every split instead.
Random forests are typically “tuned” by changing the number of features used in each split, though other features such as the number of trees, the minimum node size, and the maximum number of terminal nodes can be adjusted as well. The tuning process typically leads to a more accurate model, which is consistent with the additional drop in RMSE.
mtry: Integer and >=1
-Factor -> floor(sqrt(ncol(x))) sqrt(predictors)
-Not Factor -> max(floor(ncol(x)/3),1) predicors / 3
-default = p/3
RF - interpretation
A random forest is difficult to interpret because, unlike a decision tree where the splits and the impact of those splits can be observed, a random forest is made up of the aggregated results of hundreds or thousands of decision trees. Directly observing the component decision trees is generally uninterpretable or in some cases not possible.
–The additional component of having randomly available features for consideration at each split leads to ambiguous relationships to predictors
Variable importance is a measure of how much a predictor contributes to the overall fit of the model. This can be used to rank which predictors are most important in the model. It is calculated by aggregating across all trees in the random forest the reductions in error that all splits on a selected variable produce. Variable importance cannot be used to draw inference as to what is causing model results but can identify which variables cause the largest reduction in model error on the training data. A variable importance plot is one way to show how much impact a variable has on model predictions. Variables with higher incremental node impurity have higher importance.
Partial dependence plots show the predicted target as a function of a predictor taken in isolation.
1. Choose a predictor you wish to analyze.
2. Note all of the possible values of the predictor in the training set.
3. Modifying the training set by setting all rows of that predictor to one of its possible values.
4. Use the (already trained) model to predict the target on the modified dataset.
5. Average the step 4 predictions and note the corresponding predictor value used (in step 3).
6. Repeat steps 3-5 for all the other possible values of the predictor.
7. Plot the prediction averages against the predictor values (as recorded during step 5).
If two predictors are correlated, the PDP will calculate predicted values for unrealistic combinations while the model itself was only fit to realistic combinations. For example, if height and weight are predictor variables, making everyone seven feet tall will force predictions for individuals that tall yet weighing only 100 pounds.
Boosting - Basics
Boosting grows decision trees sequentially and then aggregates them. The first tree is fit regularly on the training set. Before fitting the second tree, the training set is updated by removing information explained by the first tree; information that remains unexplained is referred to as ‘residuals’. Before fitting the third tree, the training set is updated again, and so on, for as many trees as desired.
A GBM iteratively builds trees fit to the residuals of prior trees. Depending on the hyperparameters, this model can produce a very complex model which is susceptible to overfitting to patterns in the training data.
Gradient boosting machines use the same underlying training data at each step. This is very effective at reducing bias but is very sensitive to the training data (high variance).
The boosting prediction is the sum (over all the trees) of the shrinkage parameter times a tree prediction.
Boosting - Notes
The individual trees are typically small (usually from specifying a low depth), having high bias and low variance. Aggregating the trees reduces the high bias.
Tuning focuses on the shrinkage parameter and the number of trees. This should be done in tandem using cross-validation.
Stopping criteria:
both:
-min observations in terminal node
-shrinkage parameter
-proportion of observations used for each tree
-maximum depth
GBM: number of trees
xgboost:
-Gamma = minimum reduction of the splitting measure**
-proportion of predictors to use for each tree
The shrinkage parameter is a value between 0 and 1 that controls the speed of learning. The smaller the shrinkage parameter, the less information is explained by each tree. In turn, a large number of trees may be required to obtain a good model. Conversely, a large shrinkage parameter means few trees would be required, but the few trees in aggregate will tend to be similar to a single decision tree. The shrinkage parameter can reduce the extent to which a single tree is able to influence the model fitting process.
In this model, AUC on the testing data increases until the number of trees reaches about 500. However, as more trees are added beyond 500, AUC on the testing data starts to drop, which indicates the model is overfit to the training data.
–Early stopping: Early stopping criteria, such as improvement of the performance metrics in each subsequent tree, can stop training when it detects the improvement is marginal. This avoids overfitting.
–Controlling learning rate (shrinkage parameter): Learning rate controls the impact of subsequent trees to the overall model outcome. This reduces the extent to which a single tree is able to influence the model fitting process.