Decision Tree Modeling Flashcards

Question 1

Q

What is tree-based learning? What does it do and how?

Answer

A

Tree-based learning is a type of
- supervised machine learning
- performs classification and regression tasks.
- It uses a decision tree as a predictive model to go from observations about an item represented by the branches to conclusions about the items target value represented by the leaves.

Question 2

Q

Ensemble Learning

Answer

A

which enable you to use multiple decision trees simultaneously in order to produce very powerful models

Question 3

Q

What’s the benefit of hyperparameter tuning?

Answer

A

Knowing how and when to tune a model can help increase its performance significantly

Question 4

Q

What is a Decision Tree?

Answer

A

non-parametric supervised learning algorithm (not based on assumptions about distribution)
for classification and regression tasks
It has a hierarchical tree structure consisting of a root node, branches, internal nodes, and leaf nodes.

Question 5

Q

How data professionals use decision tree?

Answer

A

to make predictions about future events based on the information that is currently available.

Question 6

Q

Decision Tree PROs

Answer

A

require no assumptions on data’s distribution
handle collinearity easily.
requiring little preprocessing to prepare data for training

Question 7

Q

Decision Tree CONs

Answer

A

susceptible to overfitting.
sensitive to variations in the training data.

The model might get extremely good at predicting scene data, but as soon as new data is introduced, it may not work nearly as well.

Question 8

Q

What are made at each node?

Answer

A

Decisions are made at each node.

Question 9

Q

Edges

Answer

A

The edges connect together the nodes essentially directing from one node to the next along the tree.

Question 10

Q

What is a Root Node?

Answer

A

It’s the first node in the tree
all decisions needed to make the prediction will stem from it
It’s a special type of decision node because it has no predecessors.

Question 11

Q

What is a Decision Node?

Answer

A

All the nodes above the leaf nodes.
The nodes where a decision is made
always point to a leaf node or other decision nodes within the tree.

Question 12

Q

Leaf Node

Answer

A

where a final prediction is made.
The whole process ends here as they do not split anymore

Question 13

Q

What are Child Nodes?

Answer

A

any node that results from a split.
The nodes that are pointed to either leaf nodes or other decision nodes

Question 14

Q

What are Parent Nodes?

Answer

A

node that the child splits from

Question 15

Q

What prediction outcomes types can decision tree be used for?

Answer

A

classification: where a specific class or outcome is predicted
regression: where a continuous variable is predicted—like the price of a car.

Question 16

Q

What is the criteria to split a Decision node?

Answer

A

A decision node is split on the criterion that minimizes the impurity of the classes in their resulting children.

Question 17

Q

What is Impurity?

Answer

A

the degree of mixture with respect to class.
A perfect split would have no impurity in the resulting child nodes; it would partition the data with each child containing only a single class.

Question 18

Q

Name 4 metrics to determine impurity

Answer

A

Gini impurity
entropy
information gain
log loss

Question 19

Q

What’s the requirement for choosing split points?

Answer

A

identify what type of variable it is—categorical or continuous
the range of values that exist for that variable

Question 20

Q

Choosing split for categorical predictor variable

Answer

A

consider splitting based on the categorical variable, ie. color.

Question 21

Q

Choosing split for continuous predictor variable

Answer

A

splits can be made anywhere along the range of numbers that exist in the data

Ie. sorting the fruit based on diameter: 2.25, 2.75, 3.25, 3.75, 5, and 6.5 centimeters.

Question 22

Q

Describe Gini impurity score

Answer

A

most straightforward
the best scores are those closest to 0
The worst score is 0.5, which would occur when each child node contains an equal number of each class.

Question 23

Q

Classification trees PROs

Answer

A

Require few pre-processing steps.
Can work with all types of variables (continuous, categorical, discrete).
No normalization or scaling required
Decisions are transparent.
Not affected by extreme univariate values

Question 24

Q

Name 2 disadvantages of classification trees

Answer

A

Can be computationally expensive relative to other algorithms.
sensitive to data changes. Small changes in data can result in significant changes in predictions

Question 25

Q

What are Hyperparameters?

Answer

A

parameters that can be set before the model is trained
affect how the model fits the data
Help balance best model to neither underfit nor overfit the data

Question 26

Q

What is Max Depth for decision trees?

Answer

A

how deep the tree is allowed to grow
The depth = number of levels between the root node and the farthest node
the root node is level 0

Question 27

Q

Max Depth PROs

Answer

A

reduce overfitting problems by limiting how deep the tree will go
it can reduce the computational complexity of training and using the model

Question 28

Q

Min Samples Leaf

Answer

A

the minimum number of samples that must be in each child node after the parent splits.
split only if there are enough samples in each of the result nodes to satisfy the required value.

Example
There’s a decision node that currently has 10 samples. However, the min samples leaf hyper parameter is set to six. There would be no way to split the data so that each leaf node has six samples and therefore no further split can take place

Question 29

Q

What is GridSearch?

Answer

A

A tool to find the optimal values for the parameters

Question 30

Q

What does GridSearch do?

Answer

A

to confirm that a model achieves goal
by systematically checking every combination of hyper parameters
to identify which set produces the best results based on the selected metric.

Question 31

Q

What is an Overfit model and how to identify it?

Answer

A

model learns the training data so closely that it captures more than the intrinsic patterns of all such data distributions
model that scores very well on the training data but considerably worse on unseen data because it cannot generalize well.
identify when accuracy of training model is high ~1

Question 32

Q

What is an under-fitted model and how can it be identified?

Answer

A

model does not learn the patterns and characteristics of the training data well, and consequently fails to make accurate predictions on new data.
easier to identify underfitting, because the model performs poorly on both training and test data

Question 33

Q

Name 3 hyperparameters of a Decision tree

Answer

A

Max Depth
min samples split
Min Samples Leaf

Question 34

Q

CON of increasing Max Depth

Answer

A

overfitting

As you increase the max depth parameter, the performance of the model on the training set will continue to increase. It’s possible for a tree to grow so deep that leaves contain just a single sample. However, this overfits the model to the training data, and the performance on the testing data would probably be much worse.

Question 35

Q

Min Samples Split

Answer

A

minimum number of samples the parent node must have before splitting

if you set this to 10, then any node that contains nine or fewer samples will automatically become a leaf node. It will not continue splitting.

Question 36

Q

What is the max and min number that min samples split can have?

Answer

A

Min: 2 is the smallest number that can be divided into two separate child nodes.

Max: The greater the value you use the sooner the tree will stop growing.

Question 37

Q

What is regularization?

Answer

A

the process of reducing model complexity to prevent overfitting.
Regularization helps to make the model more generalizable to new data
regularization trades a marginal decrease in training accuracy for an increase in generalizability.

Question 38

Q

How does regularization prevent overfitting in machine learning models?

Answer

A

Regularization introduces penalty terms to the model’s loss function
- discouraging overly complex solutions and
- promoting better generalization to unseen data.

Question 39

Q

What is cross-validation, and why is it important in model evaluation?

Answer

A

Cross-validation is a technique used to assess a model’s performance by dividing the data into multiple subsets (folds) for training and testing.

It helps to estimate how well a model will generalize to new data and mitigates the risk of overfitting by using different subsets for training and testing.

Question 40

Q

What is Model validation process?

Answer

A

the whole process of
- evaluating different models
- selecting best model and then
- continuing to analyze the performance of the selected model to better understand its strengths and limitations.

Question 41

Q

Explain Validation Dataset (Separation Validation)

Answer

A

The simplest way to maintain the objectivity of the test data
is to create another partition in the data—a validation set—and
save the test data for after you select the final model.

The validation set is then used, instead of the test set, to compare different models.

Question 42

Q

What happens when spitting data with Validation (Separate Validation)?

Answer

A

With validation, the data is actually split into three sets
1. Train: used to train all models of interest
2. Validation: is used to evaluate the models leaving the test set untouched
3. Test: used after final model selected

Question 43

Q

What happens when splitting data for Cross-validation?

Answer

A

process that uses different folds of the data to test and train a model across several iterations.
avoids having to split the data into three partitions (train / validate / test) in advance.

Question 44

Q

Cross-validation folds

Answer

A

Instead of having one validation set to evaluate the model, the training data is split into multiple sections known as folds.
Then the model is trained on different combinations of these folds.
The training process occurs k times, each time using a different fold as the validation set.
At the end, the final validation score is the average of all k scores.

Question 45

Q

Cross-validation PROs

Answer

A

useful when working with smaller datasets
as it maximizes the utility of the data available. More so than standard validation.

Question 46

Q

Cross-validation CONs

Answer

A

not necessary when working with very large datasets

Question 47

Q

What does it mean to split the data and what is done after the split?

Answer

A

_ split a dataset into training and testing data.
- Then, you fit a model to the training data and
- evaluate its performance on the test data.

Question 48

Q

Name 2 Model Validations Methods

Answer

A

Validation sets (Separation Validation)

2.Cross validation

Question 49

Q

When to use Separation Validation?

Answer

A

very large dataset.
The reason for this is that the more data you use for validation, the less you have for training and testing.

Question 50

Q

Model Validation Best Practice

Answer

A

once the final model is selected, best practice is to go back and fit the selected model to all the non-test data (i.e., the training data + validation data) before scoring this final model on the test data.

Question 51

Q

When is test data used in Model Validation process?

Answer

A

should not be used to select a final model.
The test data is used only
for this final model .
Your model’s score on this data is how you can expect the model to perform on completely new data.

Question 52

Q

What type of model is a Random Forest?

Answer

A

popular ensemble learning method for classification, regression, and other tasks that operates by constructing a multitude of decision trees at training time.

It’s essentially a collection of decision trees, each trained on a random subset of the data and featuresone methodology for building tree-based ensemble models.

Question 53

Q

Ensemble learning

Answer

A

involves building multiple models and then aggregating their outputs to make a final prediction

Question 54

Q

Ensemble learning PROs

Answer

A

powerful because it combines the results of many models to help make more reliable final predictions,
plus these predictions have less bias and lower variance than other standalone models.
predictions using an ensemble of models are very accurate even when the individual models themselves are barely more accurate than a random guess.

Question 55

Q

Ensemble learning Best Practice

Answer

A

A best practice when building an ensemble is to use very different methodologies for each model it contains, such as a logistic regression, a Naive Bayes model, and a decision tree classifier. In this way, when the models make errors and they always will, the errors will be uncorrelated.

The goal is for them to not all make the same errors for the same reasons.

Question 56

Q

Base Learner

Answer

A

is any individual model in an ensemble.

Question 57

Q

3 Methods of Ensemble Learning

Answer

A

Bagging, Boosting, Stacking

Question 58

Q

Bootstrap

Answer

A

Each base learner samples from the data with replacement, for bagging this means the various base learners all sample the same observation, and a single learner can sample that observation multiple times during training.

Question 59

Q

Aggregation

Answer

A

The aggregation part of bagging refers to the fact that the predictions of all the individual models are aggregated to produce a final prediction.

Question 60

Q

Name a popular aggregation method for Classification

Answer

A

it’s often whichever class receives the most predictions, which is the mode.

Question 61

Q

Aggregation for Regression

Answer

A

this is typically the average of all the predictions.

Question 62

Q

Random Forest

Answer

A

Bagging + random feature sampling. random forest takes the randomization from Bagging one step further and randomizes the features used to train each base learner too

Question 63

Q

Bagging

Answer

A

A technique used by certain kinds of models that use ensembles of base learners to make predictions; refers to the combination of bootstrapping and aggregating

Question 64

Q

What does Bagging stand for?

Answer

A

Bootstrap aggregrating

Answer 65

A

Bagging = base learners are trained on data that is randomized by observation.
Random forest takes the randomization from bagging one step further. It randomizes the data by features too.
- A regular decision tree model will seek the best feature to use to split a node.
- A random forest model will grow each of its trees by taking a random subset of the available features in the training data and then splitting each node at the best feature available to that tree.
- This means that each base learner in a random forest model has different combinations of features available to it, which helps to prevent the problem of correlated errors between learners in the ensemble. Each individual base learner is a decision tree. It may be fully grown, so each leaf is a single observation or it may be very shallow depending on how you choose to tune your model.

Answer 66

A

Reduces variance: Standalone models can result in high variance. Aggregating base models’ predictions in an ensemble help reduce it.
Fast: Training can happen in parallel across CPU cores and even across different servers.
Good for big data: Bagging doesn’t require an entire training dataset to be stored in memory during model training.

Answer 67

A

leverage randomness to reduce the likelihood that a given base learner will make the same mistakes as other base learners.

When mistakes between learners are uncorrelated, it reduces both bias and variance.

In bagging, this randomization occurs by training each base learner on a sampling of the observations, with replacement.

performance scores and faster execution times

Answer 68

A

For car data, a random forest model of 3 base learners, each trained on bootstrapped samples of 3 observations and 2 features ->

Observation 1: Feature = mile and price.
Obs. 2: year, mile.
Obs 3: model, price

Answer 69

A

doesn’t affect prediction.
not only is it possible for model scores to improve with sampling, but they also require significantly less time to run since each tree is built from less data.

Answer 70

A

Max_depth.
min-samples-leaf,
min-samples-split,
Max Features,
number of estimators

Answer 71

A

controls the randomness of the trees.
specifies the number of features that each tree selects randomly from the training data to determine its splits.

Answer 72

A

controls how many decision trees your model will build for its ensemble.

For example, if you set your number of estimators to 300, your model will train 300 individual trees.

Answer 73

A

is a supervised learning technique where you build an ensemble of weak learners. This is done sequentially with each consecutive base learner trying to correct the errors of the one before.

Answer 74

A

A model that performs slightly better than randomly guessing

Answer 75

A

Like random forest, boosting is an
- ensembling technique, and it
- also builds many weak learners,
- then aggregates their predictions.

Answer 76

A

Unlike random forest, which builds base learners in parallel, boosting builds learners sequentially. the methodology you choose for the weak learner isn’t limited to tree-based methods.

Answer 77

A

Adaptive boosting or AdaBoosting.
Gradient Boosting

Answer 78

A

is a tree-based boosting methodology where each consecutive base learner assigns greater weight to the observations incorrectly predicted by the preceding learner. This process repeats until either a tree makes a perfect prediction or the ensemble reaches the maximum number of trees, which is a hyperparameter that is specified by the data professional.

Answer 79

A

both classification and regression problems, hence aggregatin differs depending on problem type

Answer 80

A

the ensemble uses a voting process that places weight on each vote. Base learners that make more accurate predictions are weighted more heavily in the final aggregation.

Answer 81

A

the model calculates a weighted mean prediction for all the trees in the ensemble.

Answer 82

A

You can’t train your model in parallel across many different servers, because each model in the ensemble is dependent on the one that preceded it.
- This means that in terms of computational efficiency, it doesn’t scale well to very large datasets when compared to bagging.

Answer 83

A

accurate 2. it’s based on an ensemble of weak learners means that the problem of high variance is reduced. 3. This is because no single tree weighs too heavily in the ensemble. 4. reduces bias 5. It’s also easy to understand and doesn’t require the data to be scaled or normalized 6. can handle both numeric and categorical features 7. it can still function well even with multicollinearity among the features, 8. robust to outliers.

Answer 84

A

Model ensembles that use gradient boosting

Answer 85

A

boosting methodology where each base learner in the sequence is built to predict the residual errors of the model that preceded it and therefore compensate for it. Its base learner trees are known as “weak learners” or “decision stumps.” They are generally very shallow.

Answer 86

A

Gradient Boosting is different from adaptive boosting because instead of assigning weights to incorrect predictions, each base learner in the sequence is built to predict the residual errors of the model that preceded it.

Answer 87

A

One of these is high accuracy. 2. scalable. 3. work well with missing data. 4. GBMs don’t require the data to be scaled and they can handle outliers easily.

Answer 88

A

Extreme Gradient Boosting used to tune GBM models

Answer 89

A

Max Depth
- n_estimators
- learning_rate
- min_child_weight

Answer 90

A

controls how deep each base learner tree will grow. The best way to find this value is through cross-validation. The model’s final max depth value is usually low.

Answer 91

A

which is the number of estimators or maximum number of base learners that the ensemble will grow. This is best determined using Grid search.
- For smaller data sets, more trees, maybe better than fewer.
- For very large data sets, the opposite could be true. Typical ranges are 50-500.

Answer 92

A

Values can range from (0–1]. we use the learning rate to indicate how much weight the model should give to each consecutive base learner’s prediction.
- Lower learning rates mean that each subsequent tree contributes less to the ensemble’s final prediction.
- This helps prevent over-correction, and over-fitting.
- Another common name for this concept is shrinkage, because less, and less weight is given to each consecutive tree’s prediction in the final ensemble.

Answer 93

A

This is a regularization parameter. a tree will not split a node if it results in any child node with less weight than what you specify in this hyper-parameter, instead, the node would become a leaf.

Answer 94

A

Higher values will stop trees splitting further, if model is overfitting, increase this value to stop your trees from getting too finely divided

Answer 95

A

lower values will allow trees to continue to split further. If your model is underfitting, then you may want to lower it to allow for more complexity.

Answer 96

A

Split the data into training, validation, and test sets
Tune hyperparameters using cross-validation on the training set
Usealltuned models to predict on the validation set
Select a champion model based on performance on the validation set
Use champion model alone to predict on test data

Answer 97

A

Pros:

The coding workload is reduced.
The scripts for data splitting are shorter.
It’s only necessary to evaluate test dataset performance once, instead of two evaluations (validate and test).

Cons:

If a model is evaluated using samples that were also used to build or fine-tune that model, it likely will provide a biased evaluation.
A potential overfitting issue could happen when fitting the model’s scores on the test data.