Trees, Forests, and Ensemble Models Flashcards
What is a decision tree?
A set of recursive binary partitions of the data
Define greedy partitioning
We find the variable j and value s that minimise classification error for the whole data, and repeat process until convergence.
Write the formula for cost complexity
Check notes
Write down the proportion of observations of class k
Check notes
Define misclassification error
Check notes
Define Gini index
Check notes
Define cross entropy
Check notes
How is feature importance measured in decision trees?
The importance of a feature is computed as the (normalized) total reduction of the impurity criterion brought by that feature. It is also known as the Gini importance (or mean decreased impurity).
What is the main problem with decision trees?
High variance, instability. Can get different trees for the same data
Define bagging
Booststrap AGGregation: generates variations of training data (bootstrapping) and trains a model for each bootstrap sample - then averages predictions across models.
What is boosting?
A method that produces a series of weak classifiers. The predictions are then combined through a weighted majority vote to produce the final prediction.
A popular implementation, AdaBoost, modifies the data at each iteration, adding weights to misclassified samples.
What is the idea behind random forests?
To improve the variance reduction of bagging by reducing the correlation between trees
What is “random” about random forests?
They’re quirky hahaha lol
They also randomise data samples through bootstrapping and pick a random subset of features at each step
Define feature importance in random forests
the improvement in the split criterion at each split in the tree accumulated over all tree in the forest
What is gradient boosting?
A boosting algorithm where each tree learns the mistakes of the previous tree (residual fitting)