Decision Tree Flashcards

1
Q

What is a Decision Tree?

A

Nested if else condition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does decision trees work?

A

Decision trees use hyperplanes which run parallel to one of the axes to cut the co-ordinate system into hyper-cuboids in order to classify the data points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Advantages of Decision Tree

A
  1. Minimal Data Preparation is required. (No normalization, standardization).
  2. Time complexity - Logarithmic
  3. Interpretability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Disadvantage of Decision Tree?

A
  1. Overfitting
  2. Prone to error for imbalanced dataset.
  3. Computationally expensive when the column is numerical.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

CART

A

Classification and Regression Trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is entropy?

A

Entropy is a measure of randomness. It is the measure of purity/impurity in the data.

E(s) = € -pi. Log(pi), where the base of Log is 2 (or e).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to calculate entropy for numerical data?

A

We can plot the distribution of numerical data points.

The distribution which is more horizontly distributed i.e., less peaked will have more entropy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Information Gain?

A

Measures the quality of a split.

The information gain is based on the decrease in entropy after the data-set is split on an attribute.

The goal is to construct a decision tree that gives the highest information gain.

IG = E(Parent) - {Weighted Avg.} . E(Children)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why is Decision Tree a greedy approach?

A

Decision tree applies a recursive greedy search algorithm in top bottom fashion to find the information gain at every level of the tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the Entropy of leaf node?

A

Zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Gini Impurity?

A

The probability of misclassifying a randomly chosen element in a set. It is used to decide the optimal split for a decision tree.

GI = 1 - €(pi) ^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why is Gini impurity preferred over entropy?

A

This is because Gini impurity is computationally inexpensive as compared to entropy, due to log in entropy.

However, for certain kind of data set, entropy performs better than Gini Impurity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to form the decision tree for numerical column?

A
  1. Sort the value of numerical column.
  2. For every data point, divide the tree and subsequently calculate the entropy for both the parts.
  3. Take the maximum value of information gain to find the node for that particular split.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the max_depth criteria in the decision tree.

A
  • If max_depth criteria is “None”, the nodes are expanded until all leaves are pure.
  • If max_depth criteria is “1”, the nodes are expanded only once.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the splitter criteria in the decision tree. (Hyperparameter)

A
  • If splitter criteria is “best”, the nodes are split based on maximum Information Gain.
  • If splitter criteria is “random”, the nodes are split randomly.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the Min Samples Split criteria in the decision tree.

A

The minimum sample at node so that it can be split into further nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the Min Samples Leaf criteria in the decision tree.

A

The minimum sample at leaf to be present after every split.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the Max Features criteria in the decision tree.

A

The number of maximum feature available at every split.

To prevent overfitting, we can limit the number of features. Here, number of features is chosen randomly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the Max Leaf Nodes criteria in the decision tree.

A

Maximum number of Leaf Nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

When to use decision tree for regression?

A

When the data is not Linearly separable, then we can use decision tree for regression purpose.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How to use decision tree regressor?

A
  1. Datapoints (Independent variable) are arranged in ascending order.
  2. Split at each data point is made and the best split is one with the minimum error.

In case of multiple column, every column is selected, sorted separately and the best split is found using minimum error from all the columns.

Error can be calculated using MSE or MAE.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the types of ensemble learning?

A
  1. Voting
  2. Stacking
  3. Bagging
  4. Boosting
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is voting ensemble?

A

Classification - Class with the maximum value count

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is stacking ensemble?

A

Stacking has different layers.

In first layer, there are several models which generates an output. The outputs of the first layer is then passed on to the second layer model where a weightage is assigned to each model of the first layer, and the final result is the weighted average of all the model of the first layer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is Bagging?

A

Bootstrap Aggregation - A subset of the datapoints is selected (with replacement) for every model. Final result is based on the maximum vote count.

When the base model used in Bagging is Decision Tree, it is called as Random Forest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Recursive binary splitting

A

Top down greedy approach

The best split is made at each particular step, rather than looking ahead and picking a split that may lead to a better tree in some future step.

27
Q

Tree Pruning

A

Removing nodes from the fully grown decision tree to reduce its size and prevent it from overfitting.

28
Q

What is the range of entropy?

A

For a 2 class problem: [0,1]
For >2 class problem: >0

29
Q

What is the range of gini impurity?

A

For a 2 class problem: [0,0.5]

29
Q

Entropy of a continuous variable

A

After plotting the distribution, lesser the peakedness, more is the entropy.

30
Q

What is Boosting?

A

Series of models, learning from mistakes of previous model.

31
Q

To reduce overfitting, why not grow the tree initially to a certain height, rather than growing it fully and then pruning it?

A

This is because it may be possible that the short grown tree may have a not-so-important split early on followed by a very good split later.

32
Q

Advantage of Bagging

A

Decision trees have high variance, bagging reduces variance by averaging the output.

n observations have σ2 variance.
Variance of the mean of those observations = σ2/n

33
Q

Out-of-Bag error estimation

A

Each tree uses on an average 2/3 of the observation. So around 1/3 of the observation (OOB) can be used to predict the error of the model, since those observations have not been trained on that particular tree.

OOB approach is convenient since CV can be computationally expensive for large datasets.

34
Q

Disadvantage of Bagging

A

Bagging improves prediction accuracy at the expense of interpretability.

In case there is a very strong predictor in the dataset, then that strong predictor will be the root node of almost all the trees. Therefore, the trees will be highly correlated, and averaging out the trees won’t reduce the variance as much.

35
Q

Variable importance by bagging

A

Variable importance by using RSS for regression or Gini-impurity for classification

Total amount of RSS/Gini-impurity decreased by split over a given predictor averaged over all trees. This gives the order of predictors by their importance.

36
Q

Why voting ensemble works? What are the assumptions?

A

Assumptions for the working of voting ensemble:
1. Model should not be similar to each other.
2. Accuracy of the models should be more than 50%, otherwise performance of the ensemble be deteriorate.

37
Q

Hard voting ensemble

A

From the output of all the models, the ensemble output is obtained from voting count

38
Q

Soft voting ensemble

A

The output of all the models are converted into Probabilistic value. These value for a particular classes are averaged over all the models to get the ensemble output.

39
Q

Why bagging does not work on linear model?

A

Because the result is a linear model!

Bagging is an additive ensemble technique. When we add many linear models, the result is another linear model!

Fitting a linear model is convex and we can find the “best possible” solution easily. Since bagging produces a linear model it can’t beat the “best possible” solution.

40
Q

What is Pasting?

A

Row sampling without replacement

41
Q

What is Random subspaces?

A

Column sampling with (or without) replacement

42
Q

What is Random patches?

A

Both Row & Column sampling with (or without) replacement

43
Q

Bagging vs Random Forest

A

Bagging - Row Sampling

Random Forest - Row + Column Samling (Node Level)

Moreover, the feature selection process in RF prevents the trees from being highly correlated, which is a problem in bagging.

44
Q

What is the suggested number of predictors to be considered for random forest?

A

Sqrt(P)

45
Q

Can random forest be called as decorrelating the trees?

A

Yes,

This is because, in case of bagging, if there is a very strong predictor, that predictor will be the part of almost all the tree. And trees will be highly correlated. So random forest reduce this correlation by taking a subset of those predictors.

46
Q

Weak learners

A

Model with just over 50% accuracy.

47
Q

Decision Stump

A

Decision tree with max_depth = 1, i.e., only one split.

48
Q

What is Adaboost?
Also give the formula of α, error and new weight of data points

A

Stagewise additive ensemble - Sequential addition of weak learners.

Weightage (α) is given to each model based on the error in prediction.

error = sum of weight of misclassified datapoint

α = 1/2 ln (1-error/error)

new weight = old weight * e^α (for correct classification)
new weight = old weight * e^-α (for incorrect classification)

49
Q

How is data upscaled (boosted) in adaboost?

A

We first calculate the weight for each of the data point based on the classification and misclassification. Then we formulate the range based on the cumulative weights.

We then randomly select the data points. And since the range for correctly classified will be more than incorrectly classified points, more data point will be sampled in correctly classified data points.

50
Q

How is the new point classified based on different weak learner in Adaboost?

A

P = α1h1 + α2h2 + α3h3 + ……… αnhn; this is stagewise additive modelling

where, hn denotes nth weak learner.

If P = positive, point is +1
else -1

51
Q

Bagging vs Boosting

A

Both are ensemble technique.

  1. Bagging works parallelly; boosting works sequentially
  2. Equal weightage is given to each of the model in bagging; Whereas in boosting each model can have different weights
  3. Bagging is used on low bias high variance model; boosting is used on high bias low variance model
52
Q

What is special about the first model of gradient boost?

A

First model is:

Mean of all outcome. It is a leaf - Regression
Max Voting Count - Classification

53
Q

What is special about the second model of gradient boost?

A

Second model forms the decision tree, but try to predict the residual of first model, instead of response variable.

54
Q

Why do we add learning rate in order to calculate the prediction in gradient boosting?

A

To control overfitting

55
Q

Gradient Boost vs Adaboost

A
  1. Max no. of leaf node in Adaboost is 2; whereas in GB, it is [8,32]
  2. In Adaboost, different models are assigned different weights, but in gradient boosting, we add the models using the same learning rate
56
Q

Stacking vs Bagging/Boosting

A
  1. In bagging/boosting base models are same, whereas in stacking, base models can be different.
  2. In bagging the output is the average (regression) and majority count (classification). In boosting the output is the weighted avg of all the models’ output. Whereas in stacking the output of the first layer models is passed on to the meta model to get the final output.
57
Q

Problem with stacking

A

Since the output of the base models are passed on to the meta model for prediction, this may lead to overfitting.

58
Q

Blending - Hold out method

A

The data is divided into training and testing. The testing is set is further divided into training and validation.

The training of the base models happen on the training jr. and prediction on validation set to create a new data set based on the results of the base model, on which meta model is trained. Meta model predicts on initial test part

59
Q

Stacking - K fold approach

A

The data is divided into training and testing. The testing is set is further divided into k parts.

Each of the Base models is then trained on each (k-1) fold and validated on kth fold in every iteration to create a new data set on which meta model is trained. Meta model predicts on test part.

60
Q

Advantage of XGB

A
  1. Flexibility - Any loss function can be used in GB
  2. Scalable (Multiple platforms & language)
  3. All kind of ML problems
  4. Parallel processing (only in terms of individual model training & computation)
  5. Can be integrated with other libraries
61
Q

How does change in number of trees in Random Forest affect the overfitting?

A

As we add more trees to the random forest the generalization error reduces and result in less risk of overfitting. However, the overall complexity of the model also increases.

62
Q

Give examples of cases where random forest might perform worse than other algorithms

A

It might underperform in cases with high-dimensional sparse data, small noisy datasets, extreme class imbalance, time series data, or when interpretability and computational efficiency are critical.