Random Forests and Bagging Flashcards

Question

Does a bootstrap sample have to be the same size as the original sample?

Answer 1

The bootstrap method is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement. The bootstrap method involves iteratively resampling a dataset with replacement. That when using the bootstrap you must choose the size of the sample and the number of repeats. The bootstrap principle says that choosing a random sample of size n from the population can be mimicked by choosing a bootstrap sample of size n from the original sample.

Answer 2

The only way to prove the correctness of an algorithm over all possible inputs is by reasoning formally or mathematically about it. One form of reasoning is a "proof by induction", a technique that's also used by mathematicians to prove properties of numerical sequences.

Answer 3

Yes, the idea is to use simple classifiers that are not expensive computationally and get an ensemble model that is better than those individual weak learners.

Answer 4

An ensemble can create lower variance and lower bias. Also, an ensemble creates a deeper understanding of the data. Underlying data patterns are hidden. Ensembles should be used for more accuracy.

Answer 5

Bias error results from simplifying the assumptions used in a model so the target functions are easier to approximate.

Answer 6

One way of getting an insight into a random forest is to compute feature importances, either by permuting the values of each feature one by one and checking how it changes the model performance or computing the amount of impurity.

Answer 7

Ensemble algorithms or methods can be divided into two groups: Sequential ensemble methods, where the base learners are generated sequentially (e. G. AdaBoost). Parallel ensemble methods, where the base learners are generated in parallel (e. G. Random Forest). No, the order does not matter because each classifier is considered with equal importance.

Answer 8

Random Forest is suitable for situations when we have a large dataset, and interpretability is not a major concern. Decision trees are much easier to interpret and understand. Since a random forest combines multiple decision trees, it becomes more difficult to interpret.

Answer 9

Not necessarily, it depends on the bootstrapped samples as well. While bootstrapping the samples, there is a chance that the features are nearly the same for different trees leading to the same feature interpretability.

Answer 10

There are several conditions when bootstrapping is not appropriate, such for example when the population variance is infinite, or when the population values are discontinuous at the median. And, there are various conditions where tweaks to the bootstrapping process are necessary to adjust for bias.

Answer 11

The algorithm operates by constructing multiple decision trees at training time and outputs by taking the mean of predictions of the individual trees.

Answer 12

Similar to Binary class predictions, it works the same on the multi-class use cases. At the final stage, the prediction is made based on the majority of the class among different class labels.

Answer 13

In Random Forest Regression, leaf nodes contain the continuous values that can be used for outcome-based on conditions' satisfiability.

Answer 14

We need to handle the data class imbalance before it passes to the model. Bagging models do not have an implicit mechanism to handle the class imbalance, so it should be handled beforehand.

Answer 15

It means that we are taking a subset of all features to decide the best split at each stage, unlike decision trees where we use all the features at each stage.

Answer 16

Yes, it is more computationally expensive than the traditional machine learning algorithms. The bagging usually creates n number of trees and fits on different subsets of the data. For a larger chunk of data, training ensemble models takes a huge time and RAM.

Answer 17

The level of homogeneity achieved in the split is what the tree uses to determine the most important feature.

Answer 18

Yes, the root node is the feature split that leads to the biggest reduction in entropy, and so on to the next biggest reduction in entropy for every node henceforth.

Answer 19

Pruning is expensive because while pruning, many sub-trees must be formed and compared. For example, for each subtree, we have to compare the misclassification error before and after removing that subtree. Then based on that, a decision can be made whether to keep the subtree or remove it, hence pruning requires more computation.

Answer 20

Level The level of a node is defined by 1 + the number of connections between the node and the root. The important thing to remember is when talking about level, it starts from 1 and the level of the root is 1.

Answer 21

The decrease in node impurity is weighted by the likelihood of accessing that node to compute feature importance. The number of samples that reach the node divided by the total number of samples yields the node probability. The more significant the feature, the higher the value.

Answer 22

No. We cannot know before trying pruning.

Answer 23

Yes, it can be called a Risk measure because entropy measures impurity, disorder, or uncertainty of the nodes. Based on this measure we can keep track of the homogeneity of the nodes at each level in decision trees.

Answer 24

The general rule of thumb is to partition the data set into the ratio of 3:1:1 (60:20:20) for training, validation, and testing respectively.

Answer 25

Test data set is kept aside and we do bootstrap on the train data. Bootstrapping creates several subsets of the original dataset chosen randomly with replacement. Each subset has equal size observations and can be used to train models in parallel. By sampling with replacement, some observations may be repeated in each new training dataset.

Answer 26

In Random Forest, each tree is trained on uncorrelated data and the final result can be taken as voting from the possible trees. Each individual can be called a Classifier. Different classifiers are combined to give more accurate results.

Answer 27

Yes, that is one tradeoff between performance and the interpretability of the model. It is hard to interpret results from ensemble methods and we cannot define clear decision rules like a single decision tree.

Answer 28

There is no formal way or formula to find the complexity of the model. Complexity is a notion. Model complexity often refers to the number of parameters to be estimated or terms included in a given predictive model. If the number of parameters is high then we can say that the model is more complex.

Answer 29

Yes, the model uses the rules learned from the training data to predict the test data.

Answer 30

Bootstrapping creates several subsets of the original dataset chosen randomly with replacement. Each subset has equal size observations and can be used to train models in parallel. By sampling with replacement, some observations may be repeated in each new training dataset. Bootstrapping is done only using the training dataset and the test data set is kept separately without Bootstrapping.

Answer 31

If we get a base error greater than 0. 5 it shows that it is giving us an opposite prediction to what is supposed to be. The bagging works if we pick models which give us a base error strictly lower than 0. 5.

Answer 32

In a Random Forest classifier, each tree's classification is combined into a final classification through a "majority vote" mechanism. Suppose lets say we formed 100 random decision trees from the random forest. Each random forest will predict different targets (outcomes) for the same test feature. Then by considering each predicted target vote will be calculated. Suppose the 100 random decision trees are prediction some 3 unique targets x, y, z then the votes of x is nothing but out of 100 random decision tree how many trees prediction is x. Likewise for the other 2 targets (y, z). If x is getting high votes. Lets say out of 100 random decision trees 60 trees are predicting the target will be x. Then the final random forest returns the x as the predicted target.

Answer 33

For a classification problem, it predicts by each tree and then selects the majority outcome. While for a regression problem it predicts by each tree and then takes the mean of the prediction to give the final output.

Answer 34

The term "proximity" means "closeness" or "nearness" between pairs of cases. Proximities are calculated for each pair of cases/observations/sample points. If two cases occupy the same terminal node through one tree, their proximity is increased by one. At the end of the run of all trees, the proximities are normalized by dividing by the number of trees. For example, if your random forest consisted of 100 trees, and a pair of observations end up in the same leaf node in 80 of the 100 trees. Then the proximity measure is 80/100 = 0. 8.

Answer 35

Stratified sampling is a sampling technique where the samples are selected in the same proportion (by dividing the population into groups called strata based on a characteristic) as they appear in the population. For example, if the population of interest has 30% male and 70% female subjects, then we divide the population into two (male and female) groups and choose 30% of the sample from the male group and 70% of the sample from the female group.

Answer 36

Ther hyperparameter max_features in Random forest can be used to find the ideal number of features to be selected. We can use auto, sqrt, log2 and tune to get which gives us a better performance.

Answer 37

Bagging is essentially Bootstrap + Aggregation. In bootstrapping, it creates several subsets of the original dataset chosen randomly with replacement. And the data not sampled can be used for validation. The probability that an observation is not sampled is approximately ? 0. 368. So, it can provide a reasonable percentage for cross-validation to estimate error.

Answer 38

The number of features to be selected while building a Random forest is a Hyperparameter. The features are selected at random by the forest to reduce the overfitting. In each split the number of features Random forest takes into account will always be the same. To get the optimal number of features we should do hyperparameter tuning.

Answer 39

Random selection of features at each split further increases the diversity in the model and hence the independence of the weak classifiers from one another. The higher the number of classifiers, the more the number of samples of the dataset. This would make them overlap more and if we use the best feature (from all the features) at each split then more and more classifiers would be similar and less random.

Answer 40

No, In regression we get the values from each tree. We take the average of all outputs and compute the final output value.

Answer 41

No. Each tree can attain the maximum depth and all the trees need not be of the same depth even after attaining the max depth. But if we specify the maximum depth, for example in random forests, then all the trees have the same 'maximum depth'.

Answer 42

Random forests are slow in generating predictions because it has multiple decision trees. Whenever it makes a prediction, all the trees in the forest have to predict the same given input and then perform voting on it. It's slow and ineffective in real-time.

Answer 43

Random forest adds additional randomness to the model while growing the trees. Instead of searching for the most important feature while splitting a node, it searches for the best feature among a random subset of features. This results in a wide diversity that generally results in a better model.

Answer 44

In the graph, we can see that they follow more or less the same path till the depth=4. There is no significant decrease in the error when compared with the decision tree. There can be cases where a Decision tree might outperform a Random forest. We should always compare multiple models' performances before selecting. It should also be noted that random forest if tuned in a better way can give a much better performance.

Answer 45

The Decision Trees and Random forest by default are built in such a way that they generally overfit the data. You should either use pruning or truncation to reduce overfitting.

Answer 46

The condition is based on impurity, which in the case of classification problems is Gini impurity/information gain (entropy), while for regression trees its variance. So when training a tree we can compute how much each feature contributes to decreasing the weighted impurity. Feature_importances_ in Scikit-Learn is based on that logic, but in the case of Random Forest, we are talking about averaging the decrease in impurity over trees.

Answer 47

If the dimensionality reduction is done before we may lose some important variables. This will give us principal components to be used as features that are not interpretable and we can't find any important variables which help us to make any decisions.

Answer 48

Model accuracy is a part of measuring the model performance. There are multiple metrics to measure the model performance such as Precision, Recall, F1 score, etc.

Answer 49

The Decision trees can take care of the numerical values automatically. While using Decision trees we can directly use the numerical columns without scaling or normalization.

Answer 50

It is not a hyperparameter. The number of child nodes from a parent node is generally 2 in most cases.

Answer 51

Decision trees are agnostic to whether or not the patterns are linear. For a dataset with mostly linear relationships, it may or may not have better predictive power than linear regression.

Answer 52

Training data: The data on which the model is getting trained. Validation data: The data which is used to validate the results of the trained model.

Answer 53

Greedy algorithms are a class of algorithms that aim to make the best locally optimal choice at each step of the solution, in order to approximate what the globally optimal solution to the problem could be. For example, in a decision tree, we make a greedy choice that we will make a split, where we get the maximum information gain.

Answer 54

The titanic dataset is used for comparing train and test errors. Here the training and test datasets are chosen randomly in the ratio of 80:20. The misclassification error of the decision tree algorithm is plotted for train and test data for different values of tree depth. It is observed that the training error keeps decreasing as the maximum depth increases but the test error eventually starts increasing, after decreasing initially. A model with low train error and high test error implies low bias and high variance of the model i.e. the model has started to overfit the training data. The train and test error are close for max_depth=1 but that is again not a good model as the decision tree is making predictions on the basis of a single node.

Answer 55

Train error always decreases as we increase the depth of the tree, whereas the test error first decreases and then increases (due to overfitting). So, we need to find that sweet spot where our model does comparably well on both the training and the test datasets i.e. the model is neither overfitting nor underfitting.

Answer 56

Bias is the difference between the prediction of our model and the correct value which we are trying to predict. The model with high bias gives less attention to the training data and overgeneralizes the model which leads to a high error on training and test data. This results in underfitting the data. Variance is the value that tells us the spread of our data. A model with high variance pays a lot of attention to training data and does not generalize on the test data. Therefore, such models perform very well on training data but have a high error on test data. This results in overfitting the data. The Bias-variance tradeoff exists because an algorithm cant be more complex and less complex at the same time. The center represents the truth. Any hit close to it is considered as low bias data points. If each subsequent hit is close to the previous hit, it is considered a low variance case.

Answer 57

No. Bias is the difference between the prediction of our model and the correct value which we are trying to predict. On the other hand, recall is one of the performance metrics to evaluate a classification model. Mathematically, we define recall as the number of true positives divided by the number of true positives plus the number of false negatives.

Answer 58

Entropy: It is a measure of uncertainty or diversity embedded in a random variable. Suppose Z is a random variable with the probability mass function P(Z), then the entropy of Z is given as: H(Z)= -?P(Z)logP(Z)=-E(log?(P(Z))), Where E represents the expected value.

Answer 59

The depth of a tree is the total number of edges from the root node to a leaf node in the longest path. Is it possible that even after removing some subtrees, the longest path from the root node to a leaf node remains the same. Hence, the depth can be the same even after Pruning.

Answer 60

In a training phase, if we allow a decision tree to grow fully until every leaf contains only one type of class (i.e. leaf is homogeneous), then the entropy of all leaf nodes is zero. But it results in overfitting the data.

Answer 61

In bagging, models are trained on each of the bootstrapped training sets independently and the results are aggregated for the final prediction. The final prediction is decided from these models with the most votes (mode) in a classification setting. In Regression, the final prediction is an average of all the predictions.

Answer 62

The whole idea and theory behind bagging, or random forest to be particular, is based on the assumptions of randomness and independence of classifiers. So, when we sample the data with replacement, we are increasing the randomness in the subsets, which in turn, makes the classifiers more independent. Also, the new sample can be of the same size as the original data which is very helpful if we have a small dataset.

Answer 63

Outlier points are generally a small percentage of the whole dataset, therefore while sampling the data they would be included in just a small number of total classifiers. If the outlier is in the test set when the outcome of all classifiers is aggregated, it is likely that the majority of classifiers would predict that point correctly and the model would generalize well. In general practice, one should be careful about which points are considered outliers, as they might just be a data point that correctly represents the points in the population. One should treat or include the outlier after analyzing the data.

Answer 64

Sampling with replacement is better in the following ways: 1. The size of the subsets would decrease if the sampling was done without replacement, and without repetition you would need to throw out a lot of data in order to get a reasonable diversity in samples. 2. What we have to work with is a sample of data, never the true distribution, certain attributes may be slightly over or under represented. Sampling with repetition helps us reflect the population distribution more accurately with many repetitions and true randomness. 3. The effectiveness of bagging, among many other results, depends upon the randomness/independence of classifiers. Sampling without replacement might increase the bias and decrease the randomness of the samples, rendering such ensemble techniques less effective.

Answer 65

We can reduce the variance if we average a number of independent random variables. Let X1, X2,....Xn be independent samples. The variance of these variables is denoted by, var(Xi)=sigma^2. Let X bar be the average of random variables Xi. It is given as, X bar = sum(Xi)/N, where X bar itself is another random variable and 'i' ranges from 1 to N. The variance of the random variable X bar is a lot less than the individual variances. It is reduced by 1/N. It is given as var(X bar) = 1/N(sigma^2). This is the key idea of averaging techniques in ensemble learning. By aggregating the results from the multiple classifiers, we can reduce the variance of the final prediction.

Answer 66

Bootstrapping creates several subsets of the original dataset chosen randomly with replacement. Each subset has equal size observations and can be used to train models in parallel. By sampling with replacement, some observations may be repeated in each new training dataset.

Answer 67

Bagging is essentially Bootstrap + Aggregation. In bootstrapping, it creates several subsets of the original dataset chosen randomly with replacement. And the data not sampled can be used for validation. The probability that an observation is not sampled is approximately ? 0.368. So, it can provide a reasonable percentage for cross-validation to estimate error.

Answer 68

Yes, they are different. In K-fold the folds are selected "without" replacement (repeats not allowed), so we'll always have the same number of folds for train and test in each iteration. That is not the case in bootstrapping, where it allows repeated while creating several subsets of the original dataset.

Answer 69

Yes, for each bootstrapped training dataset, we fit a decision tree that works on minimum entropy. Entropy is calculated for every feature, and the one yielding the minimum value is selected for the split. The mathematical range of entropy is from 01.

Answer 70

Uniform Error: Let each decision tree has a misclassification error rate Ei. It should be less than 0.5. Because if it misclassified more than 0.5, then it means more than half of the classifiers are wrong. Then it leads to a very bad prediction, even worse than random guessing which would give a misclassification rate of 0.5 in the case of binary classification. For example, the uniform error rate of each classifier is assumed as 0.25 in that slide. Also, we assume that all classifiers are independent. If these two conditions hold, then the probability of the majority of classifiers making a mistake is much lower than the probability of one of them making a mistake.

Answer 71

No, it is a random selection with replacement, so a random number of points (~37%) are left out during bootstrapping.

Answer 72

Yes. That is the idea behind ensemble techniques, where a set of weak learners are combined to create a strong learner that obtains better performance than a single one.

Answer 73

Since samples, and consequently, the splits are different for each classifier, the depth may differ. But if we specify the maximum depth, for example in random forests, then all the trees have the same 'maximum depth'

Answer 74

It is of no consequence because it would be rare for exactly half the classifiers to predict one class with the other half predicting the other class. Moreover, a majority of 51 out of 100 is almost no better than flipping a coin to decide the class (which is what would happen in the case of an even number of classifiers).

Answer 75

Bagging can combine multiple predictions generated by different algorithms. Random forest is also a bagging algorithm where the base models are decision trees. A Random forest uses only a subset of randomly picked independent variables (features) for each nodes branching possibilities unlike in bagging where all features are considered for splitting a node.

Answer 76

Random selection of features at each split further increases the diversity in the model and hence the independence of the weak classifiers from one another. The higher the number of classifiers, the more the number of samples of the dataset. This would make them overlap more and if we use the best feature (from all the features) at each split then more and more classifiers would be similar and less random.

Answer 77

Yes, cross-validation is also used in random forests. In the random forest, each classifier is evaluated on the data not included in the random sample subset and this is different for each classifier, but with cross-validation, we can actually assess the performance of the model on the data which is not seen by any of the classifiers.

Answer 78

Each tree is generated on each data sample. If the data samples we got from bootstrapping are close or capture the same pattern, then maybe the tree can be the same.

Random Forests and Bagging Flashcards

Learn the general concept of random forest and bagging. (102 cards)