Decision Tree Flashcards

Question

What is Bagging?

Answer 1

Bootstrap Aggregation - A subset of the datapoints is selected (with replacement) for every model. Final result is based on the maximum vote count. When the base model used in Bagging is Decision Tree, it is called as Random Forest

Answer 2

Top down greedy approach The best split is made at each particular step, rather than looking ahead and picking a split that may lead to a better tree in some future step.

Answer 3

Removing nodes from the fully grown decision tree to reduce its size and prevent it from overfitting.

Answer 4

For a 2 class problem: [0,1] For >2 class problem: >0

Answer 5

For a 2 class problem: [0,0.5]

Answer 6

After plotting the distribution, lesser the peakedness, more is the entropy.

Answer 7

Series of models, learning from mistakes of previous model.

Answer 8

This is because it may be possible that the short grown tree may have a not-so-important split early on followed by a very good split later.

Answer 9

Decision trees have high variance, bagging reduces variance by averaging the output. n observations have σ2 variance. Variance of the mean of those observations = σ2/n

Answer 10

Each tree uses on an average 2/3 of the observation. So around 1/3 of the observation (OOB) can be used to predict the error of the model, since those observations have not been trained on that particular tree. OOB approach is convenient since CV can be computationally expensive for large datasets.

Answer 11

Bagging improves prediction accuracy at the expense of interpretability. In case there is a very strong predictor in the dataset, then that strong predictor will be the root node of almost all the trees. Therefore, the trees will be highly correlated, and averaging out the trees won't reduce the variance as much.

Answer 12

Variable importance by using RSS for regression or Gini-impurity for classification Total amount of RSS/Gini-impurity decreased by split over a given predictor averaged over all trees. This gives the order of predictors by their importance.

Answer 13

Assumptions for the working of voting ensemble: 1. Model should not be similar to each other. 2. Accuracy of the models should be more than 50%, otherwise performance of the ensemble be deteriorate.

Answer 14

From the output of all the models, the ensemble output is obtained from voting count

Answer 15

The output of all the models are converted into Probabilistic value. These value for a particular classes are averaged over all the models to get the ensemble output.

Answer 16

Because the result is a linear model! Bagging is an additive ensemble technique. When we add many linear models, the result is another linear model! Fitting a linear model is convex and we can find the "best possible" solution easily. Since bagging produces a linear model it can't beat the "best possible" solution.

Answer 17

Row sampling without replacement

Answer 18

Column sampling with (or without) replacement

Answer 19

Both Row & Column sampling with (or without) replacement

Answer 20

Bagging - Row Sampling Random Forest - Row + Column Samling (Node Level) Moreover, the feature selection process in RF prevents the trees from being highly correlated, which is a problem in bagging.

Answer 21

Yes, This is because, in case of bagging, if there is a very strong predictor, that predictor will be the part of almost all the tree. And trees will be highly correlated. So random forest reduce this correlation by taking a subset of those predictors.

Answer 22

Model with just over 50% accuracy.

Answer 23

Decision tree with max_depth = 1, i.e., only one split.

Answer 24

Stagewise additive ensemble - Sequential addition of weak learners. Weightage (α) is given to each model based on the error in prediction. error = sum of weight of misclassified datapoint α = 1/2 ln (1-error/error) new weight = old weight * e^α (for correct classification) new weight = old weight * e^-α (for incorrect classification)

Answer 25

We first calculate the weight for each of the data point based on the classification and misclassification. Then we formulate the range based on the cumulative weights. We then randomly select the data points. And since the range for correctly classified will be more than incorrectly classified points, more data point will be sampled in correctly classified data points.

Answer 26

P = α1h1 + α2h2 + α3h3 + ......... αnhn; this is stagewise additive modelling where, hn denotes nth weak learner. If P = positive, point is +1 else -1

Answer 27

Both are ensemble technique. 1. Bagging works parallelly; boosting works sequentially 2. Equal weightage is given to each of the model in bagging; Whereas in boosting each model can have different weights 3. Bagging is used on low bias high variance model; boosting is used on high bias low variance model

Answer 28

First model is: Mean of all outcome. It is a leaf - Regression Max Voting Count - Classification

Answer 29

Second model forms the decision tree, but try to predict the residual of first model, instead of response variable.

Answer 30

To control overfitting

Answer 31

1. Max no. of leaf node in Adaboost is 2; whereas in GB, it is [8,32] 2. In Adaboost, different models are assigned different weights, but in gradient boosting, we add the models using the same learning rate

Answer 32

1. In bagging/boosting base models are same, whereas in stacking, base models can be different. 2. In bagging the output is the average (regression) and majority count (classification). In boosting the output is the weighted avg of all the models' output. Whereas in stacking the output of the first layer models is passed on to the meta model to get the final output.

Answer 33

Since the output of the base models are passed on to the meta model for prediction, this may lead to overfitting.

Answer 34

The data is divided into training and testing. The testing is set is further divided into training and validation. The training of the base models happen on the training jr. and prediction on validation set to create a new data set based on the results of the base model, on which meta model is trained. Meta model predicts on initial test part

Answer 35

The data is divided into training and testing. The testing is set is further divided into k parts. Each of the Base models is then trained on each (k-1) fold and validated on kth fold in every iteration to create a new data set on which meta model is trained. Meta model predicts on test part.

Answer 36

1. Flexibility - Any loss function can be used in GB 2. Scalable (Multiple platforms & language) 3. All kind of ML problems 4. Parallel processing (only in terms of individual model training & computation) 5. Can be integrated with other libraries

Answer 37

As we add more trees to the random forest the generalization error reduces and result in less risk of overfitting. However, the overall complexity of the model also increases.

Answer 38

It might underperform in cases with high-dimensional sparse data, small noisy datasets, extreme class imbalance, time series data, or when interpretability and computational efficiency are critical.

Decision Tree Flashcards

(63 cards)