Random Forests and Bagging Flashcards

Learn the general concept of random forest and bagging.

1
Q

Difference between Information Gain and Gini Index?

A

The Gini Index is calculated by subtracting the sum of the squared probabilities of each class from one. It favors larger partitions. Information Gain multiplies the probability of the class times the log (base=2) of that class probability. Information Gain favors smaller partitions with many distinct values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Relationship between the homogeneity and impurity with respect to nodes?

A

As you add more nodes the homogeneity increases which means the impurity decreases. Therefore homogeneity is inversely proportional to impurity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is max_depth? Do we consider the root node to determine depth?

A

The maximum depth is the number of nodes along the longest path from the root node down to the farthest leaf node. It is a hyperparameter in Decision Trees. The root node is typically not included when determining the depth of the tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When does overfitting occur?

A

Low bias, high variance indicate overfit. Bias can be interpreted as train error and variance can be interpreted as test error. Hence for a overfit model, the training error is low and the test error is high. Hence low bias and high variance indicates overfit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does model pruning work?

A

Pruning is like removing the nodes that will not affect the model performance. It is like an experiment during implementation. You will remove certain nodes and observe the model performance. In this way, you will implement it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the reasons to prune a decision tree?

A

Pruning a decision tree helps to prevent overfitting the training data so that our model generalizes well to unseen data. Pruning a decision tree means removing a subtree that is redundant and not a useful split and replacing it with a leaf node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Is it possible to limit the depth to which the original tree can grow?

A

Yes, the max_depth of a Decision Tree is a hyperparameter we can set beforehand. We can restrict the tree not to growing too depth.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are some typical heuristics to decide whether pruning is beneficial?

A

A common strategy is to grow the tree until each node contains a small number of instances then use pruning to remove nodes that do not provide additional information. Pruning should reduce the size of a learning tree without reducing predictive accuracy as measured by a cross-validation set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Does having more data increase accuracy?

A

Having more data certainly increases the accuracy of your model, but there comes a stage where even adding infinite amounts of data cannot improve any more accuracy. This is what we called the natural noise of the data. It is not just big data, but good (quality) data that helps us build better-performing ML models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are acceptable performance standards concerning the business?

A

Acceptable performance is usually a business consideration. Whatever minimum depth gives you acceptable business performance is usually chosen so as not to make the model unnecessarily computationally heavy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Do decision nodes grow back after pruning?

A

No, that is not typically done in pruning. Pruning only removes parts of the tree from the terminal nodes onwards in some branches.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a Random Variable?

A

A random variable is a variable whose value is unknown or a function that assigns values to each of an experiment’s outcomes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why is the size of the test data different?

A

The model uses the rules learned from the training data to predict the test data. So we take the large portion of the original data as the training data and a small portion (usually 30% or 20%) as the test data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is bootstrapping in correlation?

A

Bootstrapping works by resampling, with replacement, your sample data, via the drawing of large numbers of smaller samples, each of which is the same size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Can bootstrap samples repeat?

A

The bootstrap method involves iteratively resampling a dataset with replacement. That when using the bootstrap you must choose the size of the sample and the number of repeats. The scikit-learn provides a function that you can use to resample a dataset for the bootstrap method.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is test data creation?

A

Test Data Generation is the process of collecting and managing a large amount of data from various resources just to implement the test cases to ensure the functional soundness of the system under testing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What’s the benefit of including the same observation twice in a bootstrap sample?

A

It is to make the samples and consequently classifiers as independent as they can be. The repetition of taking a bootstrap sample just replaces otherwise very steady calculations with that empirical distribution that would be required to assess your initial statistic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are out-of-bag observations?

A

A prediction made for observation in the original data set using only base learners not trained on this particular observation is called out-of-bag (OOB) prediction. These predictions are not prone to overfitting, as each prediction is only made by learners that did not use the observation for training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How voting is carried out for regression and classification problems?

A

For regression, a voting ensemble involves making a prediction that is the average of multiple other regression models. In classification, a hard voting ensemble involves summing the votes for crisp class labels from other models and predicting the class with the most votes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Is there any way to speed up the Random Forest algorithm?

A

There are techniques for this on the algorithmic side, such as Quantization with a smaller number of bits, and Parallelized computing with a library called Dask.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Is there no power analysis in bootstrap to determine the minimal sample size?

A

The percentage of samples from the original data is a hyperparameter for Random Forests which can be tuned for optimal value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How is sampling with replacement better than sampling without replacement?

A

Sampling with replacement is better in the following ways:
1. The size of the subsets would decrease if the sampling was done without replacement, and without repetition, you would need to throw out a lot of data to get a reasonable diversity in samples. 2. What we have to work with is a sample of data, never the true distribution, certain attributes may be slightly over or under-represented. Sampling with repetition helps us reflect the population distribution more accurately with many repetitions and true randomness. 3. The effectiveness of bagging, among many other results, depends upon the randomness/independence of classifiers. Sampling without replacement might increase the bias and decrease the randomness of the samples, rendering such ensemble techniques less effective.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the effect of outliers and their removal?

A

It’s essential to understand how outliers occur and whether they might happen again as a normal part of the process or study area. Unfortunately, resisting the temptation to remove outliers inappropriately can be difficult. Outliers increase the variability in your data, which decreases statistical power. Outliers are the entries that have very few percentage/probability in the data and since we are sampling with replacement, the probability of each data point to be chosen remains the same i. E. 1/n.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a good loss in machine learning?

A

Loss is a number indicating how bad the model’s prediction was on a single example. If the model’s prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Does a bootstrap sample have to be the same size as the original sample?

A

The bootstrap method is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement. The bootstrap method involves iteratively resampling a dataset with replacement. That when using the bootstrap you must choose the size of the sample and the number of repeats. The bootstrap principle says that choosing a random sample of size n from the population can be mimicked by choosing a bootstrap sample of size n from the original sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How do you prove that an algorithm is correct?

A

The only way to prove the correctness of an algorithm over all possible inputs is by reasoning formally or mathematically about it. One form of reasoning is a “proof by induction”, a technique that’s also used by mathematicians to prove properties of numerical sequences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Would maximizing each model and weighing averages increase model performance?

A

Yes, the idea is to use simple classifiers that are not expensive computationally and get an ensemble model that is better than those individual weak learners.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What are the advantages and disadvantages of ensemble models?

A

An ensemble can create lower variance and lower bias. Also, an ensemble creates a deeper understanding of the data. Underlying data patterns are hidden. Ensembles should be used for more accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What type of error does bias cause in a model?

A

Bias error results from simplifying the assumptions used in a model so the target functions are easier to approximate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How do you interpret the results of a random forest?

A

One way of getting an insight into a random forest is to compute feature importances, either by permuting the values of each feature one by one and checking how it changes the model performance or computing the amount of “impurity”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Is random forest sequential? Does order matter?

A

Ensemble algorithms or methods can be divided into two groups: Sequential ensemble methods, where the base learners are generated sequentially (e. G. AdaBoost). Parallel ensemble methods, where the base learners are generated in parallel (e. G. Random Forest). No, the order does not matter because each classifier is considered with equal importance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

When should we use random forests?

A

Random Forest is suitable for situations when we have a large dataset, and interpretability is not a major concern. Decision trees are much easier to interpret and understand. Since a random forest combines multiple decision trees, it becomes more difficult to interpret.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

If the two classifiers have identical features then do we say both are highly correlated?

A

Not necessarily, it depends on the bootstrapped samples as well. While bootstrapping the samples, there is a chance that the features are nearly the same for different trees leading to the same feature interpretability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What situation do you think where bootstrapping is not applicable?

A

There are several conditions when bootstrapping is not appropriate, such for example when the population variance is infinite, or when the population values are discontinuous at the median. And, there are various conditions where tweaks to the bootstrapping process are necessary to adjust for bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

How is the output value determined in the Random Forest Regression?

A

The algorithm operates by constructing multiple decision trees at training time and outputs by taking the mean of predictions of the individual trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

How does the classification work for multi-class problems in Random Forest?

A

Similar to Binary class predictions, it works the same on the multi-class use cases. At the final stage, the prediction is made based on the majority of the class among different class labels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What value does the leaf nodes have in Decision Tree Regressor

A

In Random Forest Regression, leaf nodes contain the continuous values that can be used for outcome-based on conditions’ satisfiability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Does Bagging handle the class imbalance?

A

We need to handle the data class imbalance before it passes to the model. Bagging models do not have an implicit mechanism to handle the class imbalance, so it should be handled beforehand.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

In a random forest, what does “sampling the features at each node” mean?

A

It means that we are taking a subset of all features to decide the best split at each stage, unlike decision trees where we use all the features at each stage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Does ensemble models computationally expensive?

A

Yes, it is more computationally expensive than the traditional machine learning algorithms. The bagging usually creates n number of trees and fits on different subsets of the data. For a larger chunk of data, training ensemble models takes a huge time and RAM.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Do we interpret the best feature to be the one with the most homogeneous split or the one that explains most of the target variable?

A

The level of homogeneity achieved in the split is what the tree uses to determine the most important feature.

42
Q

Is it correct to say that the feature with the least entropy value becomes the root node, and the rest of the terminal nodes are chosen by the feature that has a less entropy value from the root node and so forth?

A

Yes, the root node is the feature split that leads to the biggest reduction in entropy, and so on to the next biggest reduction in entropy for every node henceforth.

43
Q

What does it mean for pruning to be “Expensive”?

A

Pruning is expensive because while pruning, many sub-trees must be formed and compared. For example, for each subtree, we have to compare the misclassification error before and after removing that subtree. Then based on that, a decision can be made whether to keep the subtree or remove it, hence pruning requires more computation.

44
Q

What is level in a decision tree?

A

Level – The level of a node is defined by 1 + the number of connections between the node and the root. The important thing to remember is when talking about level, it starts from 1 and the level of the root is 1.

45
Q

How do we compute feature importance?

A

The decrease in node impurity is weighted by the likelihood of accessing that node to compute feature importance. The number of samples that reach the node divided by the total number of samples yields the node probability. The more significant the feature, the higher the value.

46
Q

Can we know the cost of pruning analysis beforehand?

A

No. We cannot know before trying pruning.

47
Q

Entropy can be interpreted as a risk measure?

A

Yes, it can be called a Risk measure because entropy measures impurity, disorder, or uncertainty of the nodes. Based on this measure we can keep track of the homogeneity of the nodes at each level in decision trees.

48
Q

What is the rule of thumb for Training/ Test data split %?

A

The general rule of thumb is to partition the data set into the ratio of 3:1:1 (60:20:20) for training, validation, and testing respectively.

49
Q

Don’t we have to keep aside the test set and do the bootstrap sampling only from the training set?

A

Test data set is kept aside and we do bootstrap on the train data. Bootstrapping creates several subsets of the original dataset chosen randomly with replacement. Each subset has equal size observations and can be used to train models in parallel. By sampling with replacement, some observations may be repeated in each new training dataset.

50
Q

Different classifiers = different trees based on different datasets… is this correctly interpreted?

A

In Random Forest, each tree is trained on uncorrelated data and the final result can be taken as voting from the possible trees. Each individual can be called a Classifier. Different classifiers are combined to give more accurate results.

51
Q

Random forests are good for creating predictive models, and it’s much harder to interpret?

A

Yes, that is one tradeoff between performance and the interpretability of the model. It is hard to interpret results from ensemble methods and we cannot define clear decision rules like a single decision tree.

52
Q

Is there a formal way to measure the model complexity?

A

There is no formal way or formula to find the complexity of the model. Complexity is a notion. Model complexity often refers to the number of parameters to be estimated or terms included in a given predictive model. If the number of parameters is high then we can say that the model is more complex.

53
Q

So do you use training data to determine which way the classifier votes?

A

Yes, the model uses the rules learned from the training data to predict the test data.

54
Q

How does bootstrapping make independent samples given that all the data comes from the same original dataset?

A

Bootstrapping creates several subsets of the original dataset chosen randomly with replacement. Each subset has equal size observations and can be used to train models in parallel. By sampling with replacement, some observations may be repeated in each new training dataset. Bootstrapping is done only using the training dataset and the test data set is kept separately without Bootstrapping.

55
Q

If the base error is >0. 5 then ensemble error looks higher than base error, does it mean that bagging works only when the individual classifiers produce <0. 5 error?

A

If we get a base error greater than 0. 5 it shows that it is giving us an opposite prediction to what is supposed to be. The bagging works if we pick models which give us a base error strictly lower than 0. 5.

56
Q

If the decision tree has multiple leaf nodes, how will you go about voting?

A

In a Random Forest classifier, each tree’s classification is combined into a final classification through a “majority vote” mechanism. Suppose let’s say we formed 100 random decision trees from the random forest. Each random forest will predict different targets (outcomes) for the same test feature. Then by considering each predicted target vote will be calculated. Suppose the 100 random decision trees are prediction some 3 unique targets x, y, z then the votes of x is nothing but out of 100 random decision tree how many trees prediction is x. Likewise for the other 2 targets (y, z). If x is getting high votes. Let’s say out of 100 random decision trees 60 trees are predicting the target will be x. Then the final random forest returns the x as the predicted target.

57
Q

How does a Random Forest Algorithm give predictions on an unseen dataset?

A

For a classification problem, it predicts by each tree and then selects the majority outcome. While for a regression problem it predicts by each tree and then takes the mean of the prediction to give the final output.

58
Q

How does random forest define the Proximity between observations and what is the use of the proximity matrix?

A

The term “proximity” means “closeness” or “nearness” between pairs of cases. Proximities are calculated for each pair of cases/observations/sample points. If two cases occupy the same terminal node through one tree, their proximity is increased by one. At the end of the run of all trees, the proximities are normalized by dividing by the number of trees. For example, if your random forest consisted of 100 trees, and a pair of observations end up in the same leaf node in 80 of the 100 trees. Then the proximity measure is 80/100 = 0. 8.

59
Q

Should we use a stratified sample to avoid the case where the bootstrapped dataset contained no examples of Italian or French restaurants?

A

Stratified sampling is a sampling technique where the samples are selected in the same proportion (by dividing the population into groups called ‘strata’ based on a characteristic) as they appear in the population. For example, if the population of interest has 30% male and 70% female subjects, then we divide the population into two (‘male’ and ‘female’) groups and choose 30% of the sample from the ‘male’ group and ‘70%’ of the sample from the ‘female’ group.

60
Q

How do you decide how many classifiers to use?

A

Ther hyperparameter max_features in Random forest can be used to find the ideal number of features to be selected. We can use “auto”, “sqrt”, “log2” and tune to get which gives us a better performance.

61
Q

Do we let the algorithm tell us what data wasn’t used in any of the bootstraps?

A

Bagging is essentially Bootstrap + Aggregation. In bootstrapping, it creates several subsets of the original dataset chosen randomly with replacement. And the data not sampled can be used for validation. The probability that an observation is not sampled is approximately ? 0. 368. So, it can provide a reasonable percentage for cross-validation to estimate error.

62
Q

How many features are in a subset of features? What proportion?

A

The number of features to be selected while building a Random forest is a Hyperparameter. The features are selected at random by the forest to reduce the overfitting. In each split the number of features Random forest takes into account will always be the same. To get the optimal number of features we should do hyperparameter tuning.

63
Q

I don’t see intuitively or rigorously how the sampling feature reduces dependency. Can you please give some intuition?

A

Random selection of features at each split further increases the diversity in the model and hence the independence of the weak classifiers from one another. The higher the number of classifiers, the more the number of samples of the dataset. This would make them overlap more and if we use the best feature (from all the features) at each split then more and more classifiers would be similar and less random.

64
Q

So the output of a regression tree is still categorical?

A

No, In regression we get the values from each tree. We take the average of all outputs and compute the final output value.

65
Q

Does having max depth means all the trees in Random Forest have the same depth or they can be less than that (max depth)?

A

No. Each tree can attain the maximum depth and all the trees need not be of the same depth even after attaining the max depth. But if we specify the maximum depth, for example in random forests, then all the trees have the same ‘maximum depth’.

66
Q

Is it fair to say RandomForest is not great for real-time data?

A

Random forests are slow in generating predictions because it has multiple decision trees. Whenever it makes a prediction, all the trees in the forest have to predict the same given input and then perform voting on it. It’s slow and ineffective in real-time.

67
Q

Are the features also picked in the same way we did in bootstrapping (with replacement)?

A

Random forest adds additional randomness to the model while growing the trees. Instead of searching for the most important feature while splitting a node, it searches for the best feature among a random subset of features. This results in a wide diversity that generally results in a better model.

68
Q

How significant is the difference between the error from the DT and RF? It seems they were very similar in the Titanic example. Is it worth using RF when the error doesn’t decrease significantly?

A

In the graph, we can see that they follow more or less the same path till the depth=4. There is no significant decrease in the error when compared with the decision tree. There can be cases where a Decision tree might outperform a Random forest. We should always compare multiple models’ performances before selecting. It should also be noted that random forest if tuned in a better way can give a much better performance.

69
Q

Is there a natural way to avoid overfitting and therefore, not needing to depend on pruning, bagging, random forest?

A

The Decision Trees and Random forest by default are built in such a way that they generally overfit the data. You should either use pruning or truncation to reduce overfitting.

70
Q

The Decision Trees and Random forests by default are built in such a way that they generally Overfit. If we let the DT build itself it will try to fit for each row of the dataset.

A

The condition is based on impurity, which in the case of classification problems is Gini impurity/information gain (entropy), while for regression trees its variance. So when training a tree we can compute how much each feature contributes to decreasing the weighted impurity. Feature_importances_ in Scikit-Learn is based on that logic, but in the case of Random Forest, we are talking about averaging the decrease in impurity over trees.

71
Q

Can we do dimensionality reduction upfront instead of pruning later?

A

If the dimensionality reduction is done before we may lose some important variables. This will give us principal components to be used as features that are not interpretable and we can’t find any important variables which help us to make any decisions.

72
Q

Which is more important model accuracy or model performance?

A

Model accuracy is a part of measuring the model performance. There are multiple metrics to measure the model performance such as Precision, Recall, F1 score, etc.

73
Q

Do we do feature engineer the numerical values in DT?

A

The Decision trees can take care of the numerical values automatically. While using Decision trees we can directly use the numerical columns without scaling or normalization.

74
Q

Is how many children nodes can branch out also a hyperparameter?

A

It is not a hyperparameter. The number of child nodes from a parent node is generally 2 in most cases.

75
Q

If we have linear relations, are there instances where decision tree is the best performing?

A

Decision trees are agnostic to whether or not the patterns are linear. For a dataset with mostly linear relationships, it may or may not have better predictive power than linear regression.

76
Q

What is meant by a train set and validation data?

A

Training data: The data on which the model is getting trained.
Validation data: The data which is used to validate the results of the trained model.

77
Q

Can you explain the greedy algorithm in the context of decision trees?

A

Greedy algorithms are a class of algorithms that aim to make the best locally optimal choice at each step of the solution, in order to approximate what the globally optimal solution to the problem could be. For example, in a decision tree, we make a greedy choice that we will make a split, where we get the maximum information gain.

78
Q

In slide 12, which data is used as training and testing, and what do we infer from this error chart?

A

The titanic dataset is used for comparing train and test errors. Here the training and test datasets are chosen randomly in the ratio of 80:20. The misclassification error of the decision tree algorithm is plotted for train and test data for different values of tree depth. It is observed that the training error keeps decreasing as the maximum depth increases but the test error eventually starts increasing, after decreasing initially. A model with low train error and high test error implies low bias and high variance of the model i.e. the model has started to overfit the training data. The train and test error are close for max_depth=1 but that is again not a good model as the decision tree is making predictions on the basis of a single node.

79
Q

Should both train and test error be decreasing with the increase of depth in the tree? Or should they be close to each other irrespective of the amount of error?

A

Train error always decreases as we increase the depth of the tree, whereas the test error first decreases and then increases (due to overfitting). So, we need to find that sweet spot where our model does comparably well on both the training and the test datasets i.e. the model is neither overfitting nor underfitting.

80
Q

In slide Number 14, explain all terms, what is bias, variance, tradeoff, and what is the center of the data?

A

Bias is the difference between the prediction of our model and the correct value which we are trying to predict. The model with high bias gives less attention to the training data and overgeneralizes the model which leads to a high error on training and test data. This results in underfitting the data.
Variance is the value that tells us the spread of our data. A model with high variance pays a lot of attention to training data and does not generalize on the test data. Therefore, such models perform very well on training data but have a high error on test data. This results in overfitting the data.
The Bias-variance tradeoff exists because an algorithm can’t be more complex and less complex at the same time.
The center represents the truth. Any hit close to it is considered as low bias data points. If each subsequent hit is close to the previous hit, it is considered a low variance case.

81
Q

Is bias equal to the recall?

A

No. Bias is the difference between the prediction of our model and the correct value which we are trying to predict. On the other hand, recall is one of the performance metrics to evaluate a classification model. Mathematically, we define recall as the number of true positives divided by the number of true positives plus the number of false negatives.

82
Q

What is entropy?

A

Entropy: It is a measure of uncertainty or diversity embedded in a random variable. Suppose Z is a random variable with the probability mass function P(Z), then the entropy of Z is given as: H(Z)= -?P(Z)logP(Z)=-E(log?(P(Z))), Where E represents the expected value.

83
Q

Can the depth be the same even after performing pruning?

A

The depth of a tree is the total number of edges from the root node to a leaf node in the longest path. Is it possible that even after removing some subtrees, the longest path from the root node to a leaf node remains the same. Hence, the depth can be the same even after Pruning.

84
Q

Is there any possibility to get entropy value as zero at all leaf nodes?

A

In a training phase, if we allow a decision tree to grow fully until every leaf contains only one type of class (i.e. leaf is homogeneous), then the entropy of all leaf nodes is zero. But it results in overfitting the data.

85
Q

How does one get an average of different estimates in bagging?

A

In bagging, models are trained on each of the bootstrapped training sets independently and the results are aggregated for the final prediction. The final prediction is decided from these models with the most votes (mode) in a classification setting. In Regression, the final prediction is an average of all the predictions.

86
Q

How does bootstrapping vary from simply splitting the data N-times into test and validation sets?

A

The whole idea and theory behind bagging, or random forest to be particular, is based on the assumptions of randomness and independence of classifiers. So, when we sample the data with replacement, we are increasing the randomness in the subsets, which in turn, makes the classifiers more independent. Also, the new sample can be of the same size as the original data which is very helpful if we have a small dataset.

87
Q

What is the effect of outliers in bagging?

A

Outlier points are generally a small percentage of the whole dataset, therefore while sampling the data they would be included in just a small number of total classifiers. If the outlier is in the test set when the outcome of all classifiers is aggregated, it is likely that the majority of classifiers would predict that point correctly and the model would generalize well. In general practice, one should be careful about which points are considered outliers, as they might just be a data point that correctly represents the points in the population. One should treat or include the outlier after analyzing the data.

88
Q

How is sampling with replacement better than sampling without replacement?

A

Sampling with replacement is better in the following ways:
1. The size of the subsets would decrease if the sampling was done without replacement, and without repetition you would need to throw out a lot of data in order to get a reasonable diversity in samples.
2. What we have to work with is a sample of data, never the true distribution, certain attributes may be slightly over or under represented. Sampling with repetition helps us reflect the population distribution more accurately with many repetitions and true randomness.
3. The effectiveness of bagging, among many other results, depends upon the randomness/independence of classifiers. Sampling without replacement might increase the bias and decrease the randomness of the samples, rendering such ensemble techniques less effective.

89
Q

How can averaging reduce variance?

A

We can reduce the variance if we average a number of independent random variables. Let X1, X2,….Xn be independent samples. The variance of these variables is denoted by, var(Xi)=sigma^2. Let X bar be the average of random variables Xi. It is given as, X bar = sum(Xi)/N, where X bar itself is another random variable and ‘i’ ranges from 1 to N. The variance of the random variable X bar is a lot less than the individual variances. It is reduced by 1/N. It is given as var(X bar) = 1/N(sigma^2). This is the key idea of averaging techniques in ensemble learning. By aggregating the results from the multiple classifiers, we can reduce the variance of the final prediction.

90
Q

In bootstrapping, can we pick a data point (observation) twice in sampling?

A

Bootstrapping creates several subsets of the original dataset chosen randomly with replacement. Each subset has equal size observations and can be used to train models in parallel. By sampling with replacement, some observations may be repeated in each new training dataset.

91
Q

In bagging, first, we have to split the data into training and validation and we should then perform bootstrapping on the training set. Is that right?

A

Bagging is essentially Bootstrap + Aggregation. In bootstrapping, it creates several subsets of the original dataset chosen randomly with replacement. And the data not sampled can be used for validation. The probability that an observation is not sampled is approximately ? 0.368. So, it can provide a reasonable percentage for cross-validation to estimate error.

92
Q

Is k-fold different from bootstrapping?

A

Yes, they are different. In K-fold the folds are selected “without” replacement (repeats not allowed), so we’ll always have the same number of folds for train and test in each iteration. That is not the case in bootstrapping, where it allows repeated while creating several subsets of the original dataset.

93
Q

For each dataset, we build a tree using the same entropy minimization idea. Is this right?

A

Yes, for each bootstrapped training dataset, we fit a decision tree that works on minimum entropy. Entropy is calculated for every feature, and the one yielding the minimum value is selected for the split. The mathematical range of entropy is from 0–1.

94
Q

The slide-Analysis of Voting: What is meant by uniform error? Why it is less than 0.5?

A

Uniform Error: Let each decision tree has a misclassification error rate Ei. It should be less than 0.5. Because if it misclassified more than 0.5, then it means more than half of the classifiers are wrong. Then it leads to a very bad prediction, even worse than random guessing which would give a misclassification rate of 0.5 in the case of binary classification. For example, the uniform error rate of each classifier is assumed as 0.25 in that slide. Also, we assume that all classifiers are independent. If these two conditions hold, then the probability of the majority of classifiers making a mistake is much lower than the probability of one of them making a mistake.

95
Q

In bootstrap do we provide a hyperparameter to leave a certain percentage of data points out?

A

No, it is a random selection with replacement, so a random number of points (~37%) are left out during bootstrapping.

96
Q

Are we combining weak learners to create a strong one?

A

Yes. That is the idea behind ensemble techniques, where a set of weak learners are combined to create a strong learner that obtains better performance than a single one.

97
Q

Do all classifiers always have the same depth across classifiers?

A

Since samples, and consequently, the splits are different for each classifier, the depth may differ. But if we specify the maximum depth, for example in random forests, then all the trees have the same ‘maximum depth’

98
Q

Does the number of classifiers have to be odd to avoid ties during the voting process?

A

It is of no consequence because it would be rare for exactly half the classifiers to predict one class with the other half predicting the other class. Moreover, a majority of 51 out of 100 is almost no better than flipping a coin to decide the class (which is what would happen in the case of an even number of classifiers).

99
Q

What is the difference between bagging and random forest?

A

Bagging can combine multiple predictions generated by different algorithms. Random forest is also a bagging algorithm where the base models are decision trees. A Random forest uses only a subset of randomly picked independent variables (features) for each node’s branching possibilities unlike in bagging where all features are considered for splitting a node.

100
Q

How does the random selection of features in the random forest help in better prediction?

A

Random selection of features at each split further increases the diversity in the model and hence the independence of the weak classifiers from one another. The higher the number of classifiers, the more the number of samples of the dataset. This would make them overlap more and if we use the best feature (from all the features) at each split then more and more classifiers would be similar and less random.

101
Q

Do we also use cross-validation in the random forest?

A

Yes, cross-validation is also used in random forests. In the random forest, each classifier is evaluated on the data not included in the random sample subset and this is different for each classifier, but with cross-validation, we can actually assess the performance of the model on the data which is not seen by any of the classifiers.

102
Q

Does the algorithm generate duplicate trees on different bootstrapped data?

A

Each tree is generated on each data sample. If the data samples we got from bootstrapping are close or capture the same pattern, then maybe the tree can be the same.