Decision Tree Modeling Flashcards
What is tree-based learning? What does it do and how?
Tree-based learning is a type of
- supervised machine learning
- performs classification and regression tasks.
- It uses a decision tree as a predictive model to go from observations about an item represented by the branches to conclusions about the items target value represented by the leaves.
Ensemble Learning
which enable you to use multiple decision trees simultaneously in order to produce very powerful models
What’s the benefit of hyperparameter tuning?
Knowing how and when to tune a model can help increase its performance significantly
What is a Decision Tree?
- non-parametric supervised learning algorithm (not based on assumptions about distribution)
- for classification and regression tasks
- It has a hierarchical tree structure consisting of a root node, branches, internal nodes, and leaf nodes.
How data professionals use decision tree?
to make predictions about future events based on the information that is currently available.
Decision Tree PROs
- require no assumptions on data’s distribution
- handle collinearity easily.
- requiring little preprocessing to prepare data for training
Decision Tree CONs
- susceptible to overfitting.
- sensitive to variations in the training data.
The model might get extremely good at predicting scene data, but as soon as new data is introduced, it may not work nearly as well.
What are made at each node?
Decisions are made at each node.
Edges
The edges connect together the nodes essentially directing from one node to the next along the tree.
What is a Root Node?
- It’s the first node in the tree
- all decisions needed to make the prediction will stem from it
- It’s a special type of decision node because it has no predecessors.
What is a Decision Node?
- All the nodes above the leaf nodes.
- The nodes where a decision is made
- always point to a leaf node or other decision nodes within the tree.
Leaf Node
- where a final prediction is made.
- The whole process ends here as they do not split anymore
What are Child Nodes?
- any node that results from a split.
- The nodes that are pointed to either leaf nodes or other decision nodes
What are Parent Nodes?
node that the child splits from
What prediction outcomes types can decision tree be used for?
- classification: where a specific class or outcome is predicted
- regression: where a continuous variable is predicted—like the price of a car.
What is the criteria to split a Decision node?
A decision node is split on the criterion that minimizes the impurity of the classes in their resulting children.
What is Impurity?
- the degree of mixture with respect to class.
- A perfect split would have no impurity in the resulting child nodes; it would partition the data with each child containing only a single class.
Name 4 metrics to determine impurity
- Gini impurity
- entropy
- information gain
- log loss
What’s the requirement for choosing split points?
- identify what type of variable it is—categorical or continuous
- the range of values that exist for that variable
Choosing split for categorical predictor variable
consider splitting based on the categorical variable, ie. color.
Choosing split for continuous predictor variable
splits can be made anywhere along the range of numbers that exist in the data
Ie. sorting the fruit based on diameter: 2.25, 2.75, 3.25, 3.75, 5, and 6.5 centimeters.
Describe Gini impurity score
- most straightforward
- the best scores are those closest to 0
- The worst score is 0.5, which would occur when each child node contains an equal number of each class.
Classification trees PROs
- Require few pre-processing steps.
- Can work with all types of variables (continuous, categorical, discrete).
- No normalization or scaling required
- Decisions are transparent.
- Not affected by extreme univariate values
Name 2 disadvantages of classification trees
- Can be computationally expensive relative to other algorithms.
- sensitive to data changes. Small changes in data can result in significant changes in predictions
What are Hyperparameters?
- parameters that can be set before the model is trained
- affect how the model fits the data
- Help balance best model to neither underfit nor overfit the data
What is Max Depth for decision trees?
- how deep the tree is allowed to grow
- The depth = number of levels between the root node and the farthest node
- the root node is level 0
Max Depth PROs
- reduce overfitting problems by limiting how deep the tree will go
- it can reduce the computational complexity of training and using the model
Min Samples Leaf
- the minimum number of samples that must be in each child node after the parent splits.
- split only if there are enough samples in each of the result nodes to satisfy the required value.
Example
There’s a decision node that currently has 10 samples. However, the min samples leaf hyper parameter is set to six. There would be no way to split the data so that each leaf node has six samples and therefore no further split can take place
What is GridSearch?
A tool to find the optimal values for the parameters
What does GridSearch do?
- to confirm that a model achieves goal
- by systematically checking every combination of hyper parameters
- to identify which set produces the best results based on the selected metric.
What is an Overfit model and how to identify it?
- model learns the training data so closely that it captures more than the intrinsic patterns of all such data distributions
- model that scores very well on the training data but considerably worse on unseen data because it cannot generalize well.
- identify when accuracy of training model is high ~1
What is an under-fitted model and how can it be identified?
- model does not learn the patterns and characteristics of the training data well, and consequently fails to make accurate predictions on new data.
- easier to identify underfitting, because the model performs poorly on both training and test data
Name 3 hyperparameters of a Decision tree
- Max Depth
- min samples split
- Min Samples Leaf
CON of increasing Max Depth
- overfitting
As you increase the max depth parameter, the performance of the model on the training set will continue to increase. It’s possible for a tree to grow so deep that leaves contain just a single sample. However, this overfits the model to the training data, and the performance on the testing data would probably be much worse.
Min Samples Split
- minimum number of samples the parent node must have before splitting
if you set this to 10, then any node that contains nine or fewer samples will automatically become a leaf node. It will not continue splitting.
What is the max and min number that min samples split can have?
Min: 2 is the smallest number that can be divided into two separate child nodes.
Max: The greater the value you use the sooner the tree will stop growing.
What is regularization?
- the process of reducing model complexity to prevent overfitting.
- Regularization helps to make the model more generalizable to new data
- regularization trades a marginal decrease in training accuracy for an increase in generalizability.
How does regularization prevent overfitting in machine learning models?
Regularization introduces penalty terms to the model’s loss function
- discouraging overly complex solutions and
- promoting better generalization to unseen data.
What is cross-validation, and why is it important in model evaluation?
Cross-validation is a technique used to assess a model’s performance by dividing the data into multiple subsets (folds) for training and testing.
It helps to estimate how well a model will generalize to new data and mitigates the risk of overfitting by using different subsets for training and testing.
What is Model validation process?
the whole process of
- evaluating different models
- selecting best model and then
- continuing to analyze the performance of the selected model to better understand its strengths and limitations.
Explain Validation Dataset (Separation Validation)
- The simplest way to maintain the objectivity of the test data
- is to create another partition in the data—a validation set—and
- save the test data for after you select the final model.
The validation set is then used, instead of the test set, to compare different models.
What happens when spitting data with Validation (Separate Validation)?
With validation, the data is actually split into three sets
1. Train: used to train all models of interest
2. Validation: is used to evaluate the models leaving the test set untouched
3. Test: used after final model selected
What happens when splitting data for Cross-validation?
- process that uses different folds of the data to test and train a model across several iterations.
- avoids having to split the data into three partitions (train / validate / test) in advance.
Cross-validation folds
- Instead of having one validation set to evaluate the model, the training data is split into multiple sections known as folds.
- Then the model is trained on different combinations of these folds.
- The training process occurs k times, each time using a different fold as the validation set.
- At the end, the final validation score is the average of all k scores.
Cross-validation PROs
- useful when working with smaller datasets
- as it maximizes the utility of the data available. More so than standard validation.
Cross-validation CONs
- not necessary when working with very large datasets
What does it mean to split the data and what is done after the split?
_ split a dataset into training and testing data.
- Then, you fit a model to the training data and
- evaluate its performance on the test data.
Name 2 Model Validations Methods
- Validation sets (Separation Validation)
2.Cross validation
When to use Separation Validation?
- very large dataset.
- The reason for this is that the more data you use for validation, the less you have for training and testing.
Model Validation Best Practice
once the final model is selected, best practice is to go back and fit the selected model to all the non-test data (i.e., the training data + validation data) before scoring this final model on the test data.
When is test data used in Model Validation process?
- should not be used to select a final model.
- The test data is used only
for this final model . - Your model’s score on this data is how you can expect the model to perform on completely new data.
What type of model is a Random Forest?
popular ensemble learning method for classification, regression, and other tasks that operates by constructing a multitude of decision trees at training time.
It’s essentially a collection of decision trees, each trained on a random subset of the data and featuresone methodology for building tree-based ensemble models.
Ensemble learning
involves building multiple models and then aggregating their outputs to make a final prediction
Ensemble learning PROs
- powerful because it combines the results of many models to help make more reliable final predictions,
- plus these predictions have less bias and lower variance than other standalone models.
- predictions using an ensemble of models are very accurate even when the individual models themselves are barely more accurate than a random guess.
Ensemble learning Best Practice
A best practice when building an ensemble is to use very different methodologies for each model it contains, such as a logistic regression, a Naive Bayes model, and a decision tree classifier. In this way, when the models make errors and they always will, the errors will be uncorrelated.
- The goal is for them to not all make the same errors for the same reasons.
Base Learner
is any individual model in an ensemble.
3 Methods of Ensemble Learning
Bagging, Boosting, Stacking
Bootstrap
Each base learner samples from the data with replacement, for bagging this means the various base learners all sample the same observation, and a single learner can sample that observation multiple times during training.
Aggregation
The aggregation part of bagging refers to the fact that the predictions of all the individual models are aggregated to produce a final prediction.
Name a popular aggregation method for Classification
it’s often whichever class receives the most predictions, which is the mode.
Aggregation for Regression
this is typically the average of all the predictions.
Random Forest
Bagging + random feature sampling. random forest takes the randomization from Bagging one step further and randomizes the features used to train each base learner too
Bagging
A technique used by certain kinds of models that use ensembles of base learners to make predictions; refers to the combination of bootstrapping and aggregating
What does Bagging stand for?
Bootstrap aggregrating
Difference between Bagging and Random Forest?
Bagging = base learners are trained on data that is randomized by observation.
Random forest takes the randomization from bagging one step further. It randomizes the data by features too.
- A regular decision tree model will seek the best feature to use to split a node.
- A random forest model will grow each of its trees by taking a random subset of the available features in the training data and then splitting each node at the best feature available to that tree.
- This means that each base learner in a random forest model has different combinations of features available to it, which helps to prevent the problem of correlated errors between learners in the ensemble. Each individual base learner is a decision tree. It may be fully grown, so each leaf is a single observation or it may be very shallow depending on how you choose to tune your model.
Bagging PROs
- Reduces variance: Standalone models can result in high variance. Aggregating base models’ predictions in an ensemble help reduce it.
- Fast: Training can happen in parallel across CPU cores and even across different servers.
- Good for big data: Bagging doesn’t require an entire training dataset to be stored in memory during model training.
Random Forest PROs
leverage randomness to reduce the likelihood that a given base learner will make the same mistakes as other base learners.
When mistakes between learners are uncorrelated, it reduces both bias and variance.
In bagging, this randomization occurs by training each base learner on a sampling of the observations, with replacement.
performance scores and faster execution times
Example of Random Forest
For car data, a random forest model of 3 base learners, each trained on bootstrapped samples of 3 observations and 2 features ->
- Observation 1: Feature = mile and price.
- Obs. 2: year, mile.
- Obs 3: model, price
How does all this sampling affect predictions?
- doesn’t affect prediction.
- not only is it possible for model scores to improve with sampling, but they also require significantly less time to run since each tree is built from less data.
Random Forest hyperparameters
- Max_depth.
- min-samples-leaf,
- min-samples-split,
- Max Features,
- number of estimators
RF: Max Features
- controls the randomness of the trees.
- specifies the number of features that each tree selects randomly from the training data to determine its splits.
number of estimators
controls how many decision trees your model will build for its ensemble.
For example, if you set your number of estimators to 300, your model will train 300 individual trees.
Boosting
is a supervised learning technique where you build an ensemble of weak learners. This is done sequentially with each consecutive base learner trying to correct the errors of the one before.
weak learner
A model that performs slightly better than randomly guessing
Boosting and Random Forest similarities
Like random forest, boosting is an
- ensembling technique, and it
- also builds many weak learners,
- then aggregates their predictions.
Boosting and Random Forest differences
Unlike random forest, which builds base learners in parallel, boosting builds learners sequentially. the methodology you choose for the weak learner isn’t limited to tree-based methods.
Boosting methodologies
- Adaptive boosting or AdaBoosting.
- Gradient Boosting
AdaBoost
is a tree-based boosting methodology where each consecutive base learner assigns greater weight to the observations incorrectly predicted by the preceding learner. This process repeats until either a tree makes a perfect prediction or the ensemble reaches the maximum number of trees, which is a hyperparameter that is specified by the data professional.
What types of problems can AdaBoost address?
both classification and regression problems, hence aggregatin differs depending on problem type
AdaBoost aggregation for classifcation problems
the ensemble uses a voting process that places weight on each vote. Base learners that make more accurate predictions are weighted more heavily in the final aggregation.
AdaBoost aggregation for regression problems
the model calculates a weighted mean prediction for all the trees in the ensemble.
Boosting disadvantages
You can’t train your model in parallel across many different servers, because each model in the ensemble is dependent on the one that preceded it.
- This means that in terms of computational efficiency, it doesn’t scale well to very large datasets when compared to bagging.
Boosting advantages
- accurate 2. it’s based on an ensemble of weak learners means that the problem of high variance is reduced. 3. This is because no single tree weighs too heavily in the ensemble. 4. reduces bias 5. It’s also easy to understand and doesn’t require the data to be scaled or normalized 6. can handle both numeric and categorical features 7. it can still function well even with multicollinearity among the features, 8. robust to outliers.
Gradient boosting machines (GBM)
Model ensembles that use gradient boosting
Gradient boosting
boosting methodology where each base learner in the sequence is built to predict the residual errors of the model that preceded it and therefore compensate for it. Its base learner trees are known as “weak learners” or “decision stumps.” They are generally very shallow.
Difference between Adaboost vs Gradient Boosting
Gradient Boosting is different from adaptive boosting because instead of assigning weights to incorrect predictions, each base learner in the sequence is built to predict the residual errors of the model that preceded it.
Gradient boosting advantages
- One of these is high accuracy. 2. scalable. 3. work well with missing data. 4. GBMs don’t require the data to be scaled and they can handle outliers easily.
XGBoost
Extreme Gradient Boosting used to tune GBM models
GBM Hyperparameters
Max Depth
- n_estimators
- learning_rate
- min_child_weight
XGBoost Max Depth
controls how deep each base learner tree will grow. The best way to find this value is through cross-validation. The model’s final max depth value is usually low.
XGBoost n_estimators
which is the number of estimators or maximum number of base learners that the ensemble will grow. This is best determined using Grid search.
- For smaller data sets, more trees, maybe better than fewer.
- For very large data sets, the opposite could be true. Typical ranges are 50-500.
XGBoost learning_rate (shrinkage)
Values can range from (0–1]. we use the learning rate to indicate how much weight the model should give to each consecutive base learner’s prediction.
- Lower learning rates mean that each subsequent tree contributes less to the ensemble’s final prediction.
- This helps prevent over-correction, and over-fitting.
- Another common name for this concept is shrinkage, because less, and less weight is given to each consecutive tree’s prediction in the final ensemble.
XGBoost min_child_weight
This is a regularization parameter. a tree will not split a node if it results in any child node with less weight than what you specify in this hyper-parameter, instead, the node would become a leaf.
What does higher min_child_weight value do?
Higher values will stop trees splitting further, if model is overfitting, increase this value to stop your trees from getting too finely divided
What does lower min_child_weight value do?
lower values will allow trees to continue to split further. If your model is underfitting, then you may want to lower it to allow for more complexity.
What’s the ideal approach to model selection?
- Split the data into training, validation, and test sets
- Tune hyperparameters using cross-validation on the training set
- Usealltuned models to predict on the validation set
- Select a champion model based on performance on the validation set
- Use champion model alone to predict on test data
What are the pros and cons of performing the model selection using test data instead of a separate validation dataset?
Pros:
- The coding workload is reduced.
- The scripts for data splitting are shorter.
- It’s only necessary to evaluate test dataset performance once, instead of two evaluations (validate and test).
Cons:
- If a model is evaluated using samples that were also used to build or fine-tune that model, it likely will provide a biased evaluation.
- A potential overfitting issue could happen when fitting the model’s scores on the test data.