Machine Learning with Viya® 3.4® Lesson 3: Decision Trees and Ensembles of Trees Flashcards
What is a greedy algorithm?
one that makes locally optimal choices at each step
How does a decision tree predict cases?
decision trees use rules that involve the values or categories of the input variables
What is a decision tree referred to as when the target is categorical?
A classification tree
Name the first node at the base (top) of the tree
root node
What is a decision tree referred to as when the target is continuous?
A regression tree
What is a leaf node?
a node with only one connection
Which component of a decision tree provides the predictions?
A tree’s leaf nodes provide the predictions.
How do decision trees address the curse of dimensionality?
The split search process reduces the number of inputs in the model by eliminating irrelevant inputs. Irrelevant inputs do not appear in any splitting rules in the decision tree.
How does a decision tree handle missing values?
The split search criteria for decision trees assign the missing values along one side of a branch at the Splitting node as a category.
The input variables have missing values. What should you do before running a Decision Tree node with these input variables?
Nothing. There is no need to impute any missing values because trees can handle them
What does Model Studio display in the Tree Diagram?
the final tree structure for this particular model, such as the depth of the tree and all end leaves
What is a reduction in node impurity?
the reduction of within-node variability induced by the split
What is a surrogate rule?
A surrogate splitting rule is a backup to the main splitting rule.
When surrogate rules are requested, if a new case has a missing value on the splitting variable, then the best surrogate is used to classify the case.
If several surrogate rules exist, each surrogate is considered in sequence until one can be applied to the observation.
If none can be applied, the main rule assigns the observation to the branch that is designated for missing values.
How do you interpret a Gini index?
The Gini index can be interpreted as the probability that any two elements of a group, chosen at random (with replacement), are different.
A pure node (with no diversity) has a Gini index of 0. As the number of evenly distributed classes increases, the Gini index approaches 1 (more diverse, less pure.)
If we randomly select two observations from a group, the Gini index is the percentage chance that two observations will be different from each other
What is a Gini index?
a Gini index is a measure of variability for categorical data that can be used as a measure of node impurity
Which plot shows a Decision Tree model’s performance based on the misclassification rate?
the pruning error plot
What does the Cumulative Lift chart in the Assessment tab show?
how much better the model is than no model / random events
the model’s performance ordered by the percentage of the population
How can you set the maximum number of generations in nodes for a decision tree in Model Studio?
Expand the Splitting Options properties and set the Maximum Depth
Where would you evaluate model performance based on an assessment measure such as average squared error?
the fit statistics table
Where would you look to see the input variables that are most significant to the final model?
the Variable Importance table
What is the standard method used to fit decision trees?
Recursive partitioning
Allowing a larger tree to be grown by increasing the maximum depth could lead to what problem?
overfitting
What setting can you change to help prevent overfitting?
Increase the minimum leaf size
What is the response of the ensemble of simple decision trees for an interval target?
For an interval target, the response of the ensemble model is the average of the estimate of the individual decision trees.
What is the response of the ensemble of simple decision trees for a categorical target?
For a categorical target, the response of the ensemble of simple decision trees is the vote for the most popular class or the average of the posterior probabilities of the individual trees.