Ensemble Learning: Bagging Flashcards
What is Ensemble Learning?
A machine learning technique that combines multiple individual models to make a final prediction. The idea is that by aggregating the predictions of multiple models, the overall performance of the system can be improved.
What is the framework behind Ensemble Learning?
Obtaining multiple classifiers and aggregating them properly.
What are estimators?
An estimator (f*) is an approximation of the true function. The goal is to find the estimator from training data.
What are Decision Trees?
Decision Tress are used as base classifiers in bagging and random forests. A decision tree is built by splitting data based on features. Branching criteria determines the splitting of data. The ideal classifier has pure classified sets or least impurity.
What are two Impurity Measures?
Gini Impurity Index and Entropy.
What is Gini Impurity Index used for?
It is used to measure the impurity of a node. A node is pure (Gini=0) if all instances belong to the same class. The Gini Score is computed as G = 1 - sum of the squares of the ratios of each class.
What is Entropy used for?
It measures the number of questions you need to ask to get to the data. It is also a measure of uncertainty and information. Higher entropy implies greater uncertainty.
What are the advantages of Decision Trees?
Each to achieve 0% error rate on training data if each example has its own leaf. Less effort for data preparation (no normalization or scaling needed). Good for model interpretability, making them WhiteBox models.
What are the disadvantages of Decision Trees?
High training time and can be expensive. High variance and a tendency to overfit. Instability due to sensitivity to variations in the dataset.
What do Regularization methods do for Decision Trees?
They restrict their freedom to prevent overfitting. Examples include: max tree depth, min samples a node must have before a split and min samples a leaf node must have.
What is Bootstrapping?
A resampling method where random samples are drawn with replacement from a dataset. A random sample of data is selected with replacement from the training set, meaning that data points can be chosen more than once. This random subset is called a bootstrap sample. Multiple models are then trained independently on each bootstrap sample.
What is Random Forest?
Trains a group of decision tress. Combines results through majority vote, averaging etc. Introduces randomness by using random training sets and feature randomness. Trees are diverse, which decreases training time.
What is Bagging? (Bootstrap Aggregating)
Obtains a set of classifiers. Aggregates the classifiers by sampling N’ examples with replacement from N training examples, usually with N=N’. Separate models are trained on each of these training sets.
Predictions are aggregated through averaging for regression and voting for classification.
What are some Bagging advantages?
-Reduces overfitting (good for high variance).
-Handles missing values.
-Allows for parallel processing.
What are some Bagging disadvantages?
-Doesn’t address bias.
-Low interpretability.
Why is bagging effective at reducing variance?
The independent models trained on different bootstrapped datasets capture different aspects of the underlying data. By averaging (or voting) their predictions, the model reduces the impact of any single model’s overfitting. This is because errors made by individual models tend to cancel each other out when aggregated, which leads to a more stable and less variable final prediction.
What is Out-of-Bag evaluation?
In bagging, each model is trained on a different subset of the training data. The data points not included in the bootstrapped sample for training a particular model are called the ‘out-of-bag’ instances. The model’s performance can be evaluated using these ‘out-of-bag’ instances, which provides an estimate of how well the model is generalizing. This eliminates the need for a separate validation dataset.
What is the random subspace method?
This method further enhances diversity in bagging by randomly selecting a subset of features for each model. This is often used in conjunction with bootstrapping and is a core component of random forests, which also uses a random subset of features at each split point of each decision tree.