Decision Trees, Boosting, SVMs Flashcards
Gini Impurity
Most common loss function used in classification trees within random forests. Gini impurity measures the likelihood of incorrectly classifying a random chosen element if it was randomly labeled according to the distribution of labels in the node. The goal is to minimize Gini impurity at each split, leading to more homogeneous nodes.
Hinge Loss
Most common loss function for SVMs for classification. The hinge loss penalizes predictions that are not only wrong but also not far enough on the correct side of the decision boundary (i.e within the margin). This encourages the model to create a larger margin between classes.
Pruning
- When branches that have weak predictive power are removed in order to reduce the complexity of the model and increase the predictive accuracy of a decision tree model.
- Can happen bottoms up, or top down, with approaches such as reduced error pruning and cost complexity pruning.
- Reduced error is perhaps the simplest but also is optimized for maximum accuracy
- Replace each node, if it doesn’t decrease predictive accuracy, keep it pruned.
What is a decision tree?
A tree structure where internal nodes represent feature tests, branches represent outcomes, and leaf nodes represent decision or classifications.
What is information gain?
The reduction in entropy after a dataset is split on an attribute. It’s used to build decision trees.
Entropy can be thought of as how much variance the data has. For example, a dataset of only blues would have very low entropy, while a dataset of mixed blues, greens, and reds would have relatively high entropy. High entropy means more uncertainty, while low entropy means more predictability.
Information gain is a measure of how much information a feature provides about a class. It’s calculated using entropy and is used to determine which feature should be used to split the data at each internal node of the decision tree. The greater the information gain, the greater the decrease in entropy or uncertainty
What is a random forest?
An ensemble of decision trees where each tree is built on a random subset of the data and features. Predictions are made by averaging or voting over the trees
What is bagging?
Bootstrap Aggregating or bagging is a method that involves training multiple models on different subsets of the training data and combining their predictions to improve accuracy.
This is what a random forest is, good for high variance, low bias issues.
What is boosting?
A technique that combines weak learners (usually decision trees) sequentially, with each learner correcting errors of its predecessors.
This is like XGBoost, good for high bias, low variance situations.
What is AdaBoost?
A boosting technique that adjusts the weights of incorrectly classified instances, so subsequent models focus on those harder cases.
What is gradient boosting?
A boosting technique where new models are trained to predict the residual errors of the existing model in a gradient descent manner.
What is XGBoost
An optimized implementation of gradient boosting that is efficient and widely used in machine learning competitions.
What is LightGBM?
A gradient boosting framework that uses tree-based learning algorithms and is designed for speed and efficiency.
What is a kernel in SVMs?
A function that allows SVM to work in high-dimensional spaces by mapping the input space into a dimensional feature space.
Hence, Kernel Trick
What is the margin in SVM?
The distance between the hyperplane and the nearest data points from both classes. SVMs aim to maximize this margin. (this separation between classes)
What is soft margin in SVM?
A concept in SVMs where some misclassifications are allowed in order to balance the tradeoff between margin maximization and classification accuracy.