Tree-Based Models (10-20%) Flashcards

Question

Precision formula.

Answer 1

Precision = TP/(TP + FP) Precision is the proportion of positive predictions that turn out to be correct.

Answer 2

Sensitivity = TP/(TP + FP) Sensitivity is the proportion of those actual positive predictions that were predicted to be positive. Also known as the true positive rate (TPR)

Answer 3

The ROC curve is a plot of the true positive rate (TPR) and false positive rate (FPR) over cutoff values rather than just using a single cutoff. The (FPR,TPR) is always a point on the ROC curve. FPR = FP/(FP + FP) TPR = TP/(TP + TP) note: a good result is a curve that is well above the line from (0,0) to (1,1). note: area under curve (AUC) displays the estimate of the models fit.

Answer 4

An estimate of model fit is the area under the ROC curve

Answer 5

1. Limited ability to extract complex relationships from the data (in the case of GLMs) 2. Sensitivity to noise and a tendency to overfit (in the case of decision trees) Note: these drawbacks relate to the bias-variance trade-off. Note: Ensemble methods are a good substitute to overcome this limitation

Answer 6

Instead of relying on one single model (ie. GLM or linear models), we build many models on random subsets of the data and take the answer in aggregate. Not only does this allow us to increase the ability to reflect complex relationships in the data (with each component model potentially becoming responsible for different parts of the complex relationship), but it also reduces the variance of our model’s output by taking the average (or similar measure) over all of the component models’ output. Thus, most of the noise in the model introduced by fitting to a specific subset of data will be canceled out because the results of all models will be considered. We hinted at this when the bias-variance decomposition was introduced. Remember when we trained 10 models on the different subsets of the data to see how each model varied? Taking the average of all of those models gave us a much more stable result that ended up much closer to the true signal than any of the individual models. When trying to fit a single model such as a decision tree or a GLM, we had to be careful with how complex we let the model become. More complexity allowed us to achieve much lower bias, but we ended up with models with many more variables. There are several types of ensembling methods, some of which allow us to reduce the bias and others that help us to reduce the variance. Advantage: better predictiveness and more robust (compared to GLMs and decision trees) Disadvantage: more computational (fitting more models to the data)

Answer 7

Bagging (bootstrap aggregation) is an ensemble method that involves training multiple base models in parallel on bootstrapped samples of the training dataset. The outputs of these base models are then aggregated using an average of all base model predictions. The reason we do this is because we ultimately want a model with low variance and low bias.

Answer 8

Boosting involves training a single model, then training a subsequent model on the residuals obtained from predicting with the first one. The new model is obtained by adding a scaled-down version of the second model to the first one, and the process is repeated. The effect produced is that each additional model will focus on predicting those observations the previous model did poorly on until gradually the entire model (which is just the sum of all component models) predicts the entire dataset well. This algorithm is commonly called a gradient boosting machine.

Answer 9

Random forests are bagged models where the base model is a decision tree. In addition, a modification is made regarding how the trees are constructed. The modification is that at each split, only a subset (selected at random) of the predictor variables is used. This allows early splits to use different predictors, which may lead to alternative models that fit well. On the downside, random forests are much less interpretable than an individual decision tree because it can produce hundreds or potentially even thousands of trees.

Answer 10

1. Training - get a random sample of observations (with replacement) from the training data - At each split, search from among a random sample of features to determine that split (without replacement) - Train a decision tree on the above - Repeat 2. Predicting - Predict the target using each tree previously trained - Average the predictions

Answer 11

Theoretically, random forests shouldn't overfit but if individual models are overly complex, the random forest could still overfit. Was to control the parameters to prevent this are: 1) max depth of the tree 2) min improvement required to split 3) min observations required before splitting 4) min observations required after splitting (at each leaf) 5) the splitting method ie. Gini, squared deviance, etc.

Answer 12

When data is very unbalanced, there is a high proportion of one target value, relative to the other.

Answer 13

Note, under/over sampling is only performed on training data 1) undersampling - Known as down sampling. Instead of using the full dataset with majority observations, we can instead undersample the majority observations and keep all of the minority observations. This means we end up with balanced classes when training our model, so it could better pick up the signal leading to the minority class. However, we are using less data and, consequently, we will have less information about the majority class. 2) oversampling - Known as up sampling. Keep all of the data, and oversample (either generate more of or duplicate/sample with replacement) the minority class observations. Again, this will leave us with roughly balanced target classes, which will improve model performance.

Answer 14

1) Feature importance: a method that analyzes the structure of the model and ranks the contribution of each feature 2) Partial dependence plots: a method that allows us to get an understanding of the relationship between features and our target. Partial (or “average”) dependence plots calculate and show the “average” predicted value of the target variable by varying the value of one (or more) input features. This isn’t a perfect visualization because it only captures the average dependence, but we are limited by the human brain’s capacity to comprehend highly complex relationships between many variables beyond three dimensions. Partial dependence plots at least give us a way to check that the model is reacting to certain features in a sensible way, and give some insight into its behavior.

Answer 15

Advantages: 1) bagging methods reduce the expected loss of models by reducing variance without affecting the bias. This means that bagged models are more robust than their individual components. 2) able to handle both categorical and numerical values, and missing values. Disadvantages: 1) when the predictive power and robustness of the model increases, the interpretability of the model decreases (bias-variance trade-off) 2) less flexible in their ability to reduce bias and are tied to simpler models

Answer 16

Bagging: addresses model variance - parallel training (independently trained) Boosting: focuses on bias ie. capturing a complex signal in the data - better predictive accuracy - more susceptible to overfitting ie. uses regularization to prevent this (shrinkage) or early stopping - requires performing model fitting procedure many time, which comes at a computational cost - sequential learning (not independently trained - where you build multiple models one after the other and at each step you adjust the training data to place more emphasis on the data points that previous models predicted poorly)

Answer 17

They approximate the solution by fitting each new model such that it moves in the direction of the negative gradient of the loss function. In the squared loss case for regression, the derivative is given by the residuals, so this is not new. What is new is that the new model that is added at each step is multiplied by a constant called the learning or shrinkage parameter. Essentially, it controls the rate at which you move toward the minimum value. The trade-off is that smaller steps take longer to converge, but can reduce overfitting. Variable Importance - We can see the overall feature importance for each feature, which accounts for the total weighted contribution to the model improvement over all trees in the model by each feature. Partial Dependence Plots - As with the other ensemble methods, we are restricted to evaluating our model using partial dependence plots. Again, these plots give us a qualitative picture of what the model is doing.

Tree-Based Models (10-20%) Flashcards

(41 cards)