Decision Tree Modeling Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

What is tree-based learning? What does it do and how?

A

Tree-based learning is a type of
- supervised machine learning
- performs classification and regression tasks.
- It uses a decision tree as a predictive model to go from observations about an item represented by the branches to conclusions about the items target value represented by the leaves.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Ensemble Learning

A

which enable you to use multiple decision trees simultaneously in order to produce very powerful models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What’s the benefit of hyperparameter tuning?

A

Knowing how and when to tune a model can help increase its performance significantly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a Decision Tree?

A
  • non-parametric supervised learning algorithm (not based on assumptions about distribution)
  • for classification and regression tasks
  • It has a hierarchical tree structure consisting of a root node, branches, internal nodes, and leaf nodes.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How data professionals use decision tree?

A

to make predictions about future events based on the information that is currently available.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Decision Tree PROs

A
  1. require no assumptions on data’s distribution
  2. handle collinearity easily.
  3. requiring little preprocessing to prepare data for training
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Decision Tree CONs

A
  1. susceptible to overfitting.
  2. sensitive to variations in the training data.

The model might get extremely good at predicting scene data, but as soon as new data is introduced, it may not work nearly as well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are made at each node?

A

Decisions are made at each node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Edges

A

The edges connect together the nodes essentially directing from one node to the next along the tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a Root Node?

A
  • It’s the first node in the tree
  • all decisions needed to make the prediction will stem from it
  • It’s a special type of decision node because it has no predecessors.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a Decision Node?

A
  • All the nodes above the leaf nodes.
  • The nodes where a decision is made
  • always point to a leaf node or other decision nodes within the tree.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Leaf Node

A
  • where a final prediction is made.
  • The whole process ends here as they do not split anymore
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are Child Nodes?

A
  • any node that results from a split.
  • The nodes that are pointed to either leaf nodes or other decision nodes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are Parent Nodes?

A

node that the child splits from

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What prediction outcomes types can decision tree be used for?

A
  1. classification: where a specific class or outcome is predicted
  2. regression: where a continuous variable is predicted—like the price of a car.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the criteria to split a Decision node?

A

A decision node is split on the criterion that minimizes the impurity of the classes in their resulting children.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Impurity?

A
  • the degree of mixture with respect to class.
  • A perfect split would have no impurity in the resulting child nodes; it would partition the data with each child containing only a single class.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Name 4 metrics to determine impurity

A
  • Gini impurity
  • entropy
  • information gain
  • log loss
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What’s the requirement for choosing split points?

A
  1. identify what type of variable it is—categorical or continuous
  2. the range of values that exist for that variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Choosing split for categorical predictor variable

A

consider splitting based on the categorical variable, ie. color.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Choosing split for continuous predictor variable

A

splits can be made anywhere along the range of numbers that exist in the data

Ie. sorting the fruit based on diameter: 2.25, 2.75, 3.25, 3.75, 5, and 6.5 centimeters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Describe Gini impurity score

A
  • most straightforward
  • the best scores are those closest to 0
  • The worst score is 0.5, which would occur when each child node contains an equal number of each class.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Classification trees PROs

A
  • Require few pre-processing steps.
  • Can work with all types of variables (continuous, categorical, discrete).
  • No normalization or scaling required
  • Decisions are transparent.
  • Not affected by extreme univariate values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Name 2 disadvantages of classification trees

A
  • Can be computationally expensive relative to other algorithms.
  • sensitive to data changes. Small changes in data can result in significant changes in predictions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are Hyperparameters?

A
  • parameters that can be set before the model is trained
  • affect how the model fits the data
  • Help balance best model to neither underfit nor overfit the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is Max Depth for decision trees?

A
  • how deep the tree is allowed to grow
  • The depth = number of levels between the root node and the farthest node
  • the root node is level 0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Max Depth PROs

A
  1. reduce overfitting problems by limiting how deep the tree will go
  2. it can reduce the computational complexity of training and using the model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Min Samples Leaf

A
  • the minimum number of samples that must be in each child node after the parent splits.
  • split only if there are enough samples in each of the result nodes to satisfy the required value.

Example
There’s a decision node that currently has 10 samples. However, the min samples leaf hyper parameter is set to six. There would be no way to split the data so that each leaf node has six samples and therefore no further split can take place

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is GridSearch?

A

A tool to find the optimal values for the parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What does GridSearch do?

A
  • to confirm that a model achieves goal
  • by systematically checking every combination of hyper parameters
  • to identify which set produces the best results based on the selected metric.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is an Overfit model and how to identify it?

A
  • model learns the training data so closely that it captures more than the intrinsic patterns of all such data distributions
  • model that scores very well on the training data but considerably worse on unseen data because it cannot generalize well.
  • identify when accuracy of training model is high ~1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is an under-fitted model and how can it be identified?

A
  • model does not learn the patterns and characteristics of the training data well, and consequently fails to make accurate predictions on new data.
  • easier to identify underfitting, because the model performs poorly on both training and test data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Name 3 hyperparameters of a Decision tree

A
  1. Max Depth
  2. min samples split
  3. Min Samples Leaf
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

CON of increasing Max Depth

A
  • overfitting

As you increase the max depth parameter, the performance of the model on the training set will continue to increase. It’s possible for a tree to grow so deep that leaves contain just a single sample. However, this overfits the model to the training data, and the performance on the testing data would probably be much worse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Min Samples Split

A
  • minimum number of samples the parent node must have before splitting

if you set this to 10, then any node that contains nine or fewer samples will automatically become a leaf node. It will not continue splitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is the max and min number that min samples split can have?

A

Min: 2 is the smallest number that can be divided into two separate child nodes.

Max: The greater the value you use the sooner the tree will stop growing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is regularization?

A
  • the process of reducing model complexity to prevent overfitting.
  • Regularization helps to make the model more generalizable to new data
  • regularization trades a marginal decrease in training accuracy for an increase in generalizability.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

How does regularization prevent overfitting in machine learning models?

A

Regularization introduces penalty terms to the model’s loss function
- discouraging overly complex solutions and
- promoting better generalization to unseen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is cross-validation, and why is it important in model evaluation?

A

Cross-validation is a technique used to assess a model’s performance by dividing the data into multiple subsets (folds) for training and testing.

It helps to estimate how well a model will generalize to new data and mitigates the risk of overfitting by using different subsets for training and testing.

40
Q

What is Model validation process?

A

the whole process of
- evaluating different models
- selecting best model and then
- continuing to analyze the performance of the selected model to better understand its strengths and limitations.

41
Q

Explain Validation Dataset (Separation Validation)

A
  • The simplest way to maintain the objectivity of the test data
  • is to create another partition in the data—a validation set—and
  • save the test data for after you select the final model.

The validation set is then used, instead of the test set, to compare different models.

42
Q

What happens when spitting data with Validation (Separate Validation)?

A

With validation, the data is actually split into three sets
1. Train: used to train all models of interest
2. Validation: is used to evaluate the models leaving the test set untouched
3. Test: used after final model selected

43
Q

What happens when splitting data for Cross-validation?

A
  • process that uses different folds of the data to test and train a model across several iterations.
  • avoids having to split the data into three partitions (train / validate / test) in advance.
44
Q

Cross-validation folds

A
  • Instead of having one validation set to evaluate the model, the training data is split into multiple sections known as folds.
  • Then the model is trained on different combinations of these folds.
  • The training process occurs k times, each time using a different fold as the validation set.
  • At the end, the final validation score is the average of all k scores.
45
Q

Cross-validation PROs

A
  • useful when working with smaller datasets
  • as it maximizes the utility of the data available. More so than standard validation.
46
Q

Cross-validation CONs

A
  • not necessary when working with very large datasets
47
Q

What does it mean to split the data and what is done after the split?

A

_ split a dataset into training and testing data.
- Then, you fit a model to the training data and
- evaluate its performance on the test data.

48
Q

Name 2 Model Validations Methods

A
  1. Validation sets (Separation Validation)

2.Cross validation

49
Q

When to use Separation Validation?

A
  • very large dataset.
  • The reason for this is that the more data you use for validation, the less you have for training and testing.
50
Q

Model Validation Best Practice

A

once the final model is selected, best practice is to go back and fit the selected model to all the non-test data (i.e., the training data + validation data) before scoring this final model on the test data.

51
Q

When is test data used in Model Validation process?

A
  • should not be used to select a final model.
  • The test data is used only
    for this final model .
  • Your model’s score on this data is how you can expect the model to perform on completely new data.
52
Q

What type of model is a Random Forest?

A

popular ensemble learning method for classification, regression, and other tasks that operates by constructing a multitude of decision trees at training time.

It’s essentially a collection of decision trees, each trained on a random subset of the data and featuresone methodology for building tree-based ensemble models.

53
Q

Ensemble learning

A

involves building multiple models and then aggregating their outputs to make a final prediction

54
Q

Ensemble learning PROs

A
  • powerful because it combines the results of many models to help make more reliable final predictions,
  • plus these predictions have less bias and lower variance than other standalone models.
  • predictions using an ensemble of models are very accurate even when the individual models themselves are barely more accurate than a random guess.
55
Q

Ensemble learning Best Practice

A

A best practice when building an ensemble is to use very different methodologies for each model it contains, such as a logistic regression, a Naive Bayes model, and a decision tree classifier. In this way, when the models make errors and they always will, the errors will be uncorrelated.

  • The goal is for them to not all make the same errors for the same reasons.
56
Q

Base Learner

A

is any individual model in an ensemble.

57
Q

3 Methods of Ensemble Learning

A

Bagging, Boosting, Stacking

58
Q

Bootstrap

A

Each base learner samples from the data with replacement, for bagging this means the various base learners all sample the same observation, and a single learner can sample that observation multiple times during training.

59
Q

Aggregation

A

The aggregation part of bagging refers to the fact that the predictions of all the individual models are aggregated to produce a final prediction.

60
Q

Name a popular aggregation method for Classification

A

it’s often whichever class receives the most predictions, which is the mode.

61
Q

Aggregation for Regression

A

this is typically the average of all the predictions.

62
Q

Random Forest

A

Bagging + random feature sampling. random forest takes the randomization from Bagging one step further and randomizes the features used to train each base learner too

63
Q

Bagging

A

A technique used by certain kinds of models that use ensembles of base learners to make predictions; refers to the combination of bootstrapping and aggregating

64
Q

What does Bagging stand for?

A

Bootstrap aggregrating

65
Q

Difference between Bagging and Random Forest?

A

Bagging = base learners are trained on data that is randomized by observation.
Random forest takes the randomization from bagging one step further. It randomizes the data by features too.
- A regular decision tree model will seek the best feature to use to split a node.
- A random forest model will grow each of its trees by taking a random subset of the available features in the training data and then splitting each node at the best feature available to that tree.
- This means that each base learner in a random forest model has different combinations of features available to it, which helps to prevent the problem of correlated errors between learners in the ensemble. Each individual base learner is a decision tree. It may be fully grown, so each leaf is a single observation or it may be very shallow depending on how you choose to tune your model.

66
Q

Bagging PROs

A
  • Reduces variance: Standalone models can result in high variance. Aggregating base models’ predictions in an ensemble help reduce it.
  • Fast: Training can happen in parallel across CPU cores and even across different servers.
  • Good for big data: Bagging doesn’t require an entire training dataset to be stored in memory during model training.
67
Q

Random Forest PROs

A

leverage randomness to reduce the likelihood that a given base learner will make the same mistakes as other base learners.

When mistakes between learners are uncorrelated, it reduces both bias and variance.

In bagging, this randomization occurs by training each base learner on a sampling of the observations, with replacement.

performance scores and faster execution times

68
Q

Example of Random Forest

A

For car data, a random forest model of 3 base learners, each trained on bootstrapped samples of 3 observations and 2 features ->

  • Observation 1: Feature = mile and price.
  • Obs. 2: year, mile.
  • Obs 3: model, price
69
Q

How does all this sampling affect predictions?

A
  • doesn’t affect prediction.
  • not only is it possible for model scores to improve with sampling, but they also require significantly less time to run since each tree is built from less data.
70
Q

Random Forest hyperparameters

A
  • Max_depth.
  • min-samples-leaf,
  • min-samples-split,
  • Max Features,
  • number of estimators
71
Q

RF: Max Features

A
  • controls the randomness of the trees.
  • specifies the number of features that each tree selects randomly from the training data to determine its splits.
72
Q

number of estimators

A

controls how many decision trees your model will build for its ensemble.

For example, if you set your number of estimators to 300, your model will train 300 individual trees.

73
Q

Boosting

A

is a supervised learning technique where you build an ensemble of weak learners. This is done sequentially with each consecutive base learner trying to correct the errors of the one before.

74
Q

weak learner

A

A model that performs slightly better than randomly guessing

75
Q

Boosting and Random Forest similarities

A

Like random forest, boosting is an
- ensembling technique, and it
- also builds many weak learners,
- then aggregates their predictions.

76
Q

Boosting and Random Forest differences

A

Unlike random forest, which builds base learners in parallel, boosting builds learners sequentially. the methodology you choose for the weak learner isn’t limited to tree-based methods.

77
Q

Boosting methodologies

A
  1. Adaptive boosting or AdaBoosting.
  2. Gradient Boosting
78
Q

AdaBoost

A

is a tree-based boosting methodology where each consecutive base learner assigns greater weight to the observations incorrectly predicted by the preceding learner. This process repeats until either a tree makes a perfect prediction or the ensemble reaches the maximum number of trees, which is a hyperparameter that is specified by the data professional.

79
Q

What types of problems can AdaBoost address?

A

both classification and regression problems, hence aggregatin differs depending on problem type

80
Q

AdaBoost aggregation for classifcation problems

A

the ensemble uses a voting process that places weight on each vote. Base learners that make more accurate predictions are weighted more heavily in the final aggregation.

81
Q

AdaBoost aggregation for regression problems

A

the model calculates a weighted mean prediction for all the trees in the ensemble.

82
Q

Boosting disadvantages

A

You can’t train your model in parallel across many different servers, because each model in the ensemble is dependent on the one that preceded it.
- This means that in terms of computational efficiency, it doesn’t scale well to very large datasets when compared to bagging.

83
Q

Boosting advantages

A
  1. accurate 2. it’s based on an ensemble of weak learners means that the problem of high variance is reduced. 3. This is because no single tree weighs too heavily in the ensemble. 4. reduces bias 5. It’s also easy to understand and doesn’t require the data to be scaled or normalized 6. can handle both numeric and categorical features 7. it can still function well even with multicollinearity among the features, 8. robust to outliers.
84
Q

Gradient boosting machines (GBM)

A

Model ensembles that use gradient boosting

85
Q

Gradient boosting

A

boosting methodology where each base learner in the sequence is built to predict the residual errors of the model that preceded it and therefore compensate for it. Its base learner trees are known as “weak learners” or “decision stumps.” They are generally very shallow.

86
Q

Difference between Adaboost vs Gradient Boosting

A

Gradient Boosting is different from adaptive boosting because instead of assigning weights to incorrect predictions, each base learner in the sequence is built to predict the residual errors of the model that preceded it.

87
Q

Gradient boosting advantages

A
  1. One of these is high accuracy. 2. scalable. 3. work well with missing data. 4. GBMs don’t require the data to be scaled and they can handle outliers easily.
88
Q

XGBoost

A

Extreme Gradient Boosting used to tune GBM models

89
Q

GBM Hyperparameters

A

Max Depth
- n_estimators
- learning_rate
- min_child_weight

90
Q

XGBoost Max Depth

A

controls how deep each base learner tree will grow. The best way to find this value is through cross-validation. The model’s final max depth value is usually low.

91
Q

XGBoost n_estimators

A

which is the number of estimators or maximum number of base learners that the ensemble will grow. This is best determined using Grid search.
- For smaller data sets, more trees, maybe better than fewer.
- For very large data sets, the opposite could be true. Typical ranges are 50-500.

92
Q

XGBoost learning_rate (shrinkage)

A

Values can range from (0–1]. we use the learning rate to indicate how much weight the model should give to each consecutive base learner’s prediction.
- Lower learning rates mean that each subsequent tree contributes less to the ensemble’s final prediction.
- This helps prevent over-correction, and over-fitting.
- Another common name for this concept is shrinkage, because less, and less weight is given to each consecutive tree’s prediction in the final ensemble.

93
Q

XGBoost min_child_weight

A

This is a regularization parameter. a tree will not split a node if it results in any child node with less weight than what you specify in this hyper-parameter, instead, the node would become a leaf.

94
Q

What does higher min_child_weight value do?

A

Higher values will stop trees splitting further, if model is overfitting, increase this value to stop your trees from getting too finely divided

95
Q

What does lower min_child_weight value do?

A

lower values will allow trees to continue to split further. If your model is underfitting, then you may want to lower it to allow for more complexity.

96
Q

What’s the ideal approach to model selection?

A
  1. Split the data into training, validation, and test sets
  2. Tune hyperparameters using cross-validation on the training set
  3. Usealltuned models to predict on the validation set
  4. Select a champion model based on performance on the validation set
  5. Use champion model alone to predict on test data
97
Q

What are the pros and cons of performing the model selection using test data instead of a separate validation dataset?

A

Pros:

  • The coding workload is reduced.
  • The scripts for data splitting are shorter.
  • It’s only necessary to evaluate test dataset performance once, instead of two evaluations (validate and test).

Cons:

  • If a model is evaluated using samples that were also used to build or fine-tune that model, it likely will provide a biased evaluation.
  • A potential overfitting issue could happen when fitting the model’s scores on the test data.