Random Forests and Bagging Flashcards
Learn the general concept of random forest and bagging.
Difference between Information Gain and Gini Index?
The Gini Index is calculated by subtracting the sum of the squared probabilities of each class from one. It favors larger partitions. Information Gain multiplies the probability of the class times the log (base=2) of that class probability. Information Gain favors smaller partitions with many distinct values.
Relationship between the homogeneity and impurity with respect to nodes?
As you add more nodes the homogeneity increases which means the impurity decreases. Therefore homogeneity is inversely proportional to impurity.
What is max_depth? Do we consider the root node to determine depth?
The maximum depth is the number of nodes along the longest path from the root node down to the farthest leaf node. It is a hyperparameter in Decision Trees. The root node is typically not included when determining the depth of the tree.
When does overfitting occur?
Low bias, high variance indicate overfit. Bias can be interpreted as train error and variance can be interpreted as test error. Hence for a overfit model, the training error is low and the test error is high. Hence low bias and high variance indicates overfit.
How does model pruning work?
Pruning is like removing the nodes that will not affect the model performance. It is like an experiment during implementation. You will remove certain nodes and observe the model performance. In this way, you will implement it.
What are the reasons to prune a decision tree?
Pruning a decision tree helps to prevent overfitting the training data so that our model generalizes well to unseen data. Pruning a decision tree means removing a subtree that is redundant and not a useful split and replacing it with a leaf node.
Is it possible to limit the depth to which the original tree can grow?
Yes, the max_depth of a Decision Tree is a hyperparameter we can set beforehand. We can restrict the tree not to growing too depth.
What are some typical heuristics to decide whether pruning is beneficial?
A common strategy is to grow the tree until each node contains a small number of instances then use pruning to remove nodes that do not provide additional information. Pruning should reduce the size of a learning tree without reducing predictive accuracy as measured by a cross-validation set.
Does having more data increase accuracy?
Having more data certainly increases the accuracy of your model, but there comes a stage where even adding infinite amounts of data cannot improve any more accuracy. This is what we called the natural noise of the data. It is not just big data, but good (quality) data that helps us build better-performing ML models.
What are acceptable performance standards concerning the business?
Acceptable performance is usually a business consideration. Whatever minimum depth gives you acceptable business performance is usually chosen so as not to make the model unnecessarily computationally heavy.
Do decision nodes grow back after pruning?
No, that is not typically done in pruning. Pruning only removes parts of the tree from the terminal nodes onwards in some branches.
What is a Random Variable?
A random variable is a variable whose value is unknown or a function that assigns values to each of an experiment’s outcomes.
Why is the size of the test data different?
The model uses the rules learned from the training data to predict the test data. So we take the large portion of the original data as the training data and a small portion (usually 30% or 20%) as the test data.
What is bootstrapping in correlation?
Bootstrapping works by resampling, with replacement, your sample data, via the drawing of large numbers of smaller samples, each of which is the same size.
Can bootstrap samples repeat?
The bootstrap method involves iteratively resampling a dataset with replacement. That when using the bootstrap you must choose the size of the sample and the number of repeats. The scikit-learn provides a function that you can use to resample a dataset for the bootstrap method.
What is test data creation?
Test Data Generation is the process of collecting and managing a large amount of data from various resources just to implement the test cases to ensure the functional soundness of the system under testing.
What’s the benefit of including the same observation twice in a bootstrap sample?
It is to make the samples and consequently classifiers as independent as they can be. The repetition of taking a bootstrap sample just replaces otherwise very steady calculations with that empirical distribution that would be required to assess your initial statistic.
What are out-of-bag observations?
A prediction made for observation in the original data set using only base learners not trained on this particular observation is called out-of-bag (OOB) prediction. These predictions are not prone to overfitting, as each prediction is only made by learners that did not use the observation for training.
How voting is carried out for regression and classification problems?
For regression, a voting ensemble involves making a prediction that is the average of multiple other regression models. In classification, a hard voting ensemble involves summing the votes for crisp class labels from other models and predicting the class with the most votes.
Is there any way to speed up the Random Forest algorithm?
There are techniques for this on the algorithmic side, such as Quantization with a smaller number of bits, and Parallelized computing with a library called Dask.
Is there no power analysis in bootstrap to determine the minimal sample size?
The percentage of samples from the original data is a hyperparameter for Random Forests which can be tuned for optimal value.
How is sampling with replacement better than sampling without replacement?
Sampling with replacement is better in the following ways:
1. The size of the subsets would decrease if the sampling was done without replacement, and without repetition, you would need to throw out a lot of data to get a reasonable diversity in samples. 2. What we have to work with is a sample of data, never the true distribution, certain attributes may be slightly over or under-represented. Sampling with repetition helps us reflect the population distribution more accurately with many repetitions and true randomness. 3. The effectiveness of bagging, among many other results, depends upon the randomness/independence of classifiers. Sampling without replacement might increase the bias and decrease the randomness of the samples, rendering such ensemble techniques less effective.
What is the effect of outliers and their removal?
It’s essential to understand how outliers occur and whether they might happen again as a normal part of the process or study area. Unfortunately, resisting the temptation to remove outliers inappropriately can be difficult. Outliers increase the variability in your data, which decreases statistical power. Outliers are the entries that have very few percentage/probability in the data and since we are sampling with replacement, the probability of each data point to be chosen remains the same i. E. 1/n.
What is a good loss in machine learning?
Loss is a number indicating how bad the model’s prediction was on a single example. If the model’s prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples.
Does a bootstrap sample have to be the same size as the original sample?
The bootstrap method is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement. The bootstrap method involves iteratively resampling a dataset with replacement. That when using the bootstrap you must choose the size of the sample and the number of repeats. The bootstrap principle says that choosing a random sample of size n from the population can be mimicked by choosing a bootstrap sample of size n from the original sample.
How do you prove that an algorithm is correct?
The only way to prove the correctness of an algorithm over all possible inputs is by reasoning formally or mathematically about it. One form of reasoning is a “proof by induction”, a technique that’s also used by mathematicians to prove properties of numerical sequences.
Would maximizing each model and weighing averages increase model performance?
Yes, the idea is to use simple classifiers that are not expensive computationally and get an ensemble model that is better than those individual weak learners.
What are the advantages and disadvantages of ensemble models?
An ensemble can create lower variance and lower bias. Also, an ensemble creates a deeper understanding of the data. Underlying data patterns are hidden. Ensembles should be used for more accuracy.
What type of error does bias cause in a model?
Bias error results from simplifying the assumptions used in a model so the target functions are easier to approximate.
How do you interpret the results of a random forest?
One way of getting an insight into a random forest is to compute feature importances, either by permuting the values of each feature one by one and checking how it changes the model performance or computing the amount of impurity.
Is random forest sequential? Does order matter?
Ensemble algorithms or methods can be divided into two groups: Sequential ensemble methods, where the base learners are generated sequentially (e. G. AdaBoost). Parallel ensemble methods, where the base learners are generated in parallel (e. G. Random Forest). No, the order does not matter because each classifier is considered with equal importance.
When should we use random forests?
Random Forest is suitable for situations when we have a large dataset, and interpretability is not a major concern. Decision trees are much easier to interpret and understand. Since a random forest combines multiple decision trees, it becomes more difficult to interpret.
If the two classifiers have identical features then do we say both are highly correlated?
Not necessarily, it depends on the bootstrapped samples as well. While bootstrapping the samples, there is a chance that the features are nearly the same for different trees leading to the same feature interpretability.
What situation do you think where bootstrapping is not applicable?
There are several conditions when bootstrapping is not appropriate, such for example when the population variance is infinite, or when the population values are discontinuous at the median. And, there are various conditions where tweaks to the bootstrapping process are necessary to adjust for bias.
How is the output value determined in the Random Forest Regression?
The algorithm operates by constructing multiple decision trees at training time and outputs by taking the mean of predictions of the individual trees.
How does the classification work for multi-class problems in Random Forest?
Similar to Binary class predictions, it works the same on the multi-class use cases. At the final stage, the prediction is made based on the majority of the class among different class labels.
What value does the leaf nodes have in Decision Tree Regressor
In Random Forest Regression, leaf nodes contain the continuous values that can be used for outcome-based on conditions’ satisfiability.
Does Bagging handle the class imbalance?
We need to handle the data class imbalance before it passes to the model. Bagging models do not have an implicit mechanism to handle the class imbalance, so it should be handled beforehand.
In a random forest, what does “sampling the features at each node” mean?
It means that we are taking a subset of all features to decide the best split at each stage, unlike decision trees where we use all the features at each stage.
Does ensemble models computationally expensive?
Yes, it is more computationally expensive than the traditional machine learning algorithms. The bagging usually creates n number of trees and fits on different subsets of the data. For a larger chunk of data, training ensemble models takes a huge time and RAM.