Random Forests and Bagging Flashcards
Learn the general concept of random forest and bagging.
Difference between Information Gain and Gini Index?
The Gini Index is calculated by subtracting the sum of the squared probabilities of each class from one. It favors larger partitions. Information Gain multiplies the probability of the class times the log (base=2) of that class probability. Information Gain favors smaller partitions with many distinct values.
Relationship between the homogeneity and impurity with respect to nodes?
As you add more nodes the homogeneity increases which means the impurity decreases. Therefore homogeneity is inversely proportional to impurity.
What is max_depth? Do we consider the root node to determine depth?
The maximum depth is the number of nodes along the longest path from the root node down to the farthest leaf node. It is a hyperparameter in Decision Trees. The root node is typically not included when determining the depth of the tree.
When does overfitting occur?
Low bias, high variance indicate overfit. Bias can be interpreted as train error and variance can be interpreted as test error. Hence for a overfit model, the training error is low and the test error is high. Hence low bias and high variance indicates overfit.
How does model pruning work?
Pruning is like removing the nodes that will not affect the model performance. It is like an experiment during implementation. You will remove certain nodes and observe the model performance. In this way, you will implement it.
What are the reasons to prune a decision tree?
Pruning a decision tree helps to prevent overfitting the training data so that our model generalizes well to unseen data. Pruning a decision tree means removing a subtree that is redundant and not a useful split and replacing it with a leaf node.
Is it possible to limit the depth to which the original tree can grow?
Yes, the max_depth of a Decision Tree is a hyperparameter we can set beforehand. We can restrict the tree not to growing too depth.
What are some typical heuristics to decide whether pruning is beneficial?
A common strategy is to grow the tree until each node contains a small number of instances then use pruning to remove nodes that do not provide additional information. Pruning should reduce the size of a learning tree without reducing predictive accuracy as measured by a cross-validation set.
Does having more data increase accuracy?
Having more data certainly increases the accuracy of your model, but there comes a stage where even adding infinite amounts of data cannot improve any more accuracy. This is what we called the natural noise of the data. It is not just big data, but good (quality) data that helps us build better-performing ML models.
What are acceptable performance standards concerning the business?
Acceptable performance is usually a business consideration. Whatever minimum depth gives you acceptable business performance is usually chosen so as not to make the model unnecessarily computationally heavy.
Do decision nodes grow back after pruning?
No, that is not typically done in pruning. Pruning only removes parts of the tree from the terminal nodes onwards in some branches.
What is a Random Variable?
A random variable is a variable whose value is unknown or a function that assigns values to each of an experiment’s outcomes.
Why is the size of the test data different?
The model uses the rules learned from the training data to predict the test data. So we take the large portion of the original data as the training data and a small portion (usually 30% or 20%) as the test data.
What is bootstrapping in correlation?
Bootstrapping works by resampling, with replacement, your sample data, via the drawing of large numbers of smaller samples, each of which is the same size.
Can bootstrap samples repeat?
The bootstrap method involves iteratively resampling a dataset with replacement. That when using the bootstrap you must choose the size of the sample and the number of repeats. The scikit-learn provides a function that you can use to resample a dataset for the bootstrap method.
What is test data creation?
Test Data Generation is the process of collecting and managing a large amount of data from various resources just to implement the test cases to ensure the functional soundness of the system under testing.
What’s the benefit of including the same observation twice in a bootstrap sample?
It is to make the samples and consequently classifiers as independent as they can be. The repetition of taking a bootstrap sample just replaces otherwise very steady calculations with that empirical distribution that would be required to assess your initial statistic.
What are out-of-bag observations?
A prediction made for observation in the original data set using only base learners not trained on this particular observation is called out-of-bag (OOB) prediction. These predictions are not prone to overfitting, as each prediction is only made by learners that did not use the observation for training.
How voting is carried out for regression and classification problems?
For regression, a voting ensemble involves making a prediction that is the average of multiple other regression models. In classification, a hard voting ensemble involves summing the votes for crisp class labels from other models and predicting the class with the most votes.
Is there any way to speed up the Random Forest algorithm?
There are techniques for this on the algorithmic side, such as Quantization with a smaller number of bits, and Parallelized computing with a library called Dask.
Is there no power analysis in bootstrap to determine the minimal sample size?
The percentage of samples from the original data is a hyperparameter for Random Forests which can be tuned for optimal value.
How is sampling with replacement better than sampling without replacement?
Sampling with replacement is better in the following ways:
1. The size of the subsets would decrease if the sampling was done without replacement, and without repetition, you would need to throw out a lot of data to get a reasonable diversity in samples. 2. What we have to work with is a sample of data, never the true distribution, certain attributes may be slightly over or under-represented. Sampling with repetition helps us reflect the population distribution more accurately with many repetitions and true randomness. 3. The effectiveness of bagging, among many other results, depends upon the randomness/independence of classifiers. Sampling without replacement might increase the bias and decrease the randomness of the samples, rendering such ensemble techniques less effective.
What is the effect of outliers and their removal?
It’s essential to understand how outliers occur and whether they might happen again as a normal part of the process or study area. Unfortunately, resisting the temptation to remove outliers inappropriately can be difficult. Outliers increase the variability in your data, which decreases statistical power. Outliers are the entries that have very few percentage/probability in the data and since we are sampling with replacement, the probability of each data point to be chosen remains the same i. E. 1/n.
What is a good loss in machine learning?
Loss is a number indicating how bad the model’s prediction was on a single example. If the model’s prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples.