Ensemble Learning: Bagging Flashcards
What is Ensemble Learning?
A machine learning technique that combines multiple individual models to make a final prediction. The idea is that by aggregating the predictions of multiple models, the overall performance of the system can be improved.
What is the framework behind Ensemble Learning?
Obtaining multiple classifiers and aggregating them properly.
What are estimators?
An estimator (f*) is an approximation of the true function. The goal is to find the estimator from training data.
What are Decision Trees?
Decision Tress are used as base classifiers in bagging and random forests. A decision tree is built by splitting data based on features. Branching criteria determines the splitting of data. The ideal classifier has pure classified sets or least impurity.
What are two Impurity Measures?
Gini Impurity Index and Entropy.
What is Gini Impurity Index used for?
It is used to measure the impurity of a node. A node is pure (Gini=0) if all instances belong to the same class. The Gini Score is computed as G = 1 - sum of the squares of the ratios of each class.
What is Entropy used for?
It measures the number of questions you need to ask to get to the data. It is also a measure of uncertainty and information. Higher entropy implies greater uncertainty.
What are the advantages of Decision Trees?
Each to achieve 0% error rate on training data if each example has its own leaf. Less effort for data preparation (no normalization or scaling needed). Good for model interpretability, making them WhiteBox models.
What are the disadvantages of Decision Trees?
High training time and can be expensive. High variance and a tendency to overfit. Instability due to sensitivity to variations in the dataset.
What do Regularization methods do for Decision Trees?
They restrict their freedom to prevent overfitting. Examples include: max tree depth, min samples a node must have before a split and min samples a leaf node must have.
What is Bootstrapping?
A resampling method where random samples are drawn with replacement from a dataset. A random sample of data is selected with replacement from the training set, meaning that data points can be chosen more than once. This random subset is called a bootstrap sample. Multiple models are then trained independently on each bootstrap sample.
What is Random Forest?
Trains a group of decision tress. Combines results through majority vote, averaging etc. Introduces randomness by using random training sets and feature randomness. Trees are diverse, which decreases training time.
What is Bagging? (Bootstrap Aggregating)
Obtains a set of classifiers. Aggregates the classifiers by sampling N’ examples with replacement from N training examples, usually with N=N’. Separate models are trained on each of these training sets.
Predictions are aggregated through averaging for regression and voting for classification.
What are some Bagging advantages?
-Reduces overfitting (good for high variance).
-Handles missing values.
-Allows for parallel processing.
What are some Bagging disadvantages?
-Doesn’t address bias.
-Low interpretability.