Decision trees / ensemble methods Flashcards
What are nested trees, and how do they extend the capabilities of decision trees? Provide an example scenario where nested trees might be beneficial.
Nested trees: when we have a larger more complex tree - there is a way to make it less complex, called pruning. Pruning means that we remove branches - or subtrees from the tree, to make it 1) easier to interpret, and 2) more less computation time.
So when it is large nested trees can capture the same information or patterns - so we fit the data better.
pruning is necessary to prevent overfitting and reduce computational time.
Explain the basic concept of decision trees in machine learning. How do decision trees make decisions, and what is the significance of features, nodes, and leaves in a decision tree?
Decision trees use a hierarchical structure to make decisions. It learns patterns and relationships between the input features and the output or target variable. The input features work as criteria for making decisions.
Nodes in the tree represent decision points and splits it based on the criteria.
Leaves are the terminal nodes, meaning the final decisions.
Splitting criteria act to split the data into a yes or no answer based on the feature value.
If there are dominant features, decision trees might overfit because all trees look the same.
Discuss the concept of a Random Forest. How does it leverage multiple decision trees, and what is the role of randomness in the model? Explain the process of bagging in the context of Random Forest.
Random forest is an adaption of bagging - that uses multiple decision trees. It’s an ensemble of decision trees.
Bagging = bootstrapped aggregation = bootstrap a number of training datasets from the original one, with the same size. this is used with subsampling with replacement - meaning that observations are chosen randomly and can be chosen again (not removed from the original training set).
Random forest includes a randomness as in it only uses a set of features in these bootstrapped training sets which the base models are trained on. usually the size sqrt or log of the original features.
This adds additional randomness to reduce the correlation among trees. Reducing the over all variance.
Reulting in a more robust and generalization.
Explore the idea of gradient boosting and AdaBoost. How does gradient boosting differ from Random Forest in terms of building decision trees sequentially? What is the role of weak learners in gradient boosting?
AdaBoost trains models sequantially, where each modell corrects the errors of the previous one - they are dependent. The assign weights to instances in training data where higher weights are given to misclassified instances. Commonly only stumps or single level decision trees are used as weak learnings. The final model is a weighted sum of learned with emphases on those that perform well
Gradient boosting is similar to AdaBoost as it builds trees sequentially. Each addressing the residuals of the combined ensemble.
Instead of adjusting weights, it focuses on the residuals of predictions.
it tries to minimize the loss function.
it builds a strong learner by combining the weak learners.
Vs Random Forest:
Random forest works to reduce collinearity among trees. the two previous trains sequential while the random forest trains full trees that all have the same say in predictions.
Discuss the advantages and disadvantages of using ensemble models, such as Random Forest and Gradient Boosting, compared to individual decision trees. Highlight scenarios where ensemble models excel and situations where they might not be the best choice.
Discuss the advantages and disadvantages of using ensemble models, such as Random Forest and Gradient Boosting, compared to individual decision trees. Highlight scenarios where ensemble models excel and situations where they might not be the best choice.
Pros:
They can capture non-linear relationships.
we do not need to scale features such as through standardization or normalasation - they are trees where decisions are made.
Can use both categorical and continuous variables
They do not make any assumption of distribution of data.
Cons:
It is easy to overfit the data with trees - to handle this we need to prune the model. Or set a stopping criterion such that the trees are not too deep.
They might be biased towards the dominant class.
Not too good when we face problems where we have to extrapolate. This makes kinda much sense actually.