Machine Learning with Viya® 3.4® Lesson 3: Decision Trees and Ensembles of Trees Flashcards
What is a greedy algorithm?
one that makes locally optimal choices at each step
How does a decision tree predict cases?
decision trees use rules that involve the values or categories of the input variables
What is a decision tree referred to as when the target is categorical?
A classification tree
Name the first node at the base (top) of the tree
root node
What is a decision tree referred to as when the target is continuous?
A regression tree
What is a leaf node?
a node with only one connection
Which component of a decision tree provides the predictions?
A tree’s leaf nodes provide the predictions.
How do decision trees address the curse of dimensionality?
The split search process reduces the number of inputs in the model by eliminating irrelevant inputs. Irrelevant inputs do not appear in any splitting rules in the decision tree.
How does a decision tree handle missing values?
The split search criteria for decision trees assign the missing values along one side of a branch at the Splitting node as a category.
The input variables have missing values. What should you do before running a Decision Tree node with these input variables?
Nothing. There is no need to impute any missing values because trees can handle them
What does Model Studio display in the Tree Diagram?
the final tree structure for this particular model, such as the depth of the tree and all end leaves
What is a reduction in node impurity?
the reduction of within-node variability induced by the split
What is a surrogate rule?
A surrogate splitting rule is a backup to the main splitting rule.
When surrogate rules are requested, if a new case has a missing value on the splitting variable, then the best surrogate is used to classify the case.
If several surrogate rules exist, each surrogate is considered in sequence until one can be applied to the observation.
If none can be applied, the main rule assigns the observation to the branch that is designated for missing values.
How do you interpret a Gini index?
The Gini index can be interpreted as the probability that any two elements of a group, chosen at random (with replacement), are different.
A pure node (with no diversity) has a Gini index of 0. As the number of evenly distributed classes increases, the Gini index approaches 1 (more diverse, less pure.)
If we randomly select two observations from a group, the Gini index is the percentage chance that two observations will be different from each other
What is a Gini index?
a Gini index is a measure of variability for categorical data that can be used as a measure of node impurity
Which plot shows a Decision Tree model’s performance based on the misclassification rate?
the pruning error plot
What does the Cumulative Lift chart in the Assessment tab show?
how much better the model is than no model / random events
the model’s performance ordered by the percentage of the population
How can you set the maximum number of generations in nodes for a decision tree in Model Studio?
Expand the Splitting Options properties and set the Maximum Depth
Where would you evaluate model performance based on an assessment measure such as average squared error?
the fit statistics table
Where would you look to see the input variables that are most significant to the final model?
the Variable Importance table
What is the standard method used to fit decision trees?
Recursive partitioning
Allowing a larger tree to be grown by increasing the maximum depth could lead to what problem?
overfitting
What setting can you change to help prevent overfitting?
Increase the minimum leaf size
What is the response of the ensemble of simple decision trees for an interval target?
For an interval target, the response of the ensemble model is the average of the estimate of the individual decision trees.
What is the response of the ensemble of simple decision trees for a categorical target?
For a categorical target, the response of the ensemble of simple decision trees is the vote for the most popular class or the average of the posterior probabilities of the individual trees.
What is bagging?
Bagging takes bootstrap samples of the rows of training data. All columns are considered for splitting at every step.
What is a random forest?
A forest is an ensemble of simple (classification or regression) decision trees
How does training different trees with different training data improve predictions for a forest?
Training different trees with different training data reduces the correlation of the predictions of the trees
What is an out-of-bag sample?
the training data that are excluded during the construction of an individual tree
What data is used to assess the fit of a forest model?
the out-of-bag sample
How does Model Studio calculate the maximum number of inputs per split in a Forest Model when using the default settings?
By default, the number of inputs considered per split is the square root of the number of inputs
How does the forest algorithm sample the data?
The forest algorithm samples the rows and the columns at each step (leading to more perturbed data than the bagging algorithm)
What additional chart is available when the target is binary?
the ROC curve
What does the ROC curve show?
the model’s performance considering the true positive rate and the false positive rate
How does a split-search strategy work?
- Identify candidate splits based on the splitting criterion
- Select a split that is expressed as an IF-THEN-ELSE rule
- Repeat process for each child node, continuing until a stopping rule prevents further growth
What is the goal of splitting?
to reduce the variability of the target distribution and thus increase purity in the child nodes
What is a split search?
an iterative process used by recursive partitioning to select the best split for the node
Which splitting criteria may be used for categorical targets?
- Information gain ratio (IGR) (default in Model Studio)
- CHAID
- Chi-Square
- Entropy
- GINI
Which splitting criteria are appropriate for interval targets?
- Variance (default in Model Studio)
- CHAID
- Ftest
What is the purpose of the Bonferroni correction during a decision tree split search?
To maintain overall confidence by inflating the p-values.
Which split criteria can request a Bonferroni correction after the split has been determined?
Split criteria using the p-value (chi-square, CHAID, or F test)
Which window shows the score code for a specific node that may be deployed in production?
the Node Score Code window
When does Model Studio generate node score code?
Model Studio generates node score code for every node in the Data Mining Preprocessing group and the Supervised Learning group that creates DATA step score code.
What is another name for the “flow score code?”
Path EP Score Code
What is included in the Path EP Score Code?
score code for all the nodes until and including that modeling node to be used in other SAS environments
What does the ‘EP’ refer to in the term Path EP Score Code?
Embedded Process
Which window contains the SAS training code that may be used to train the model based on different data sets or platforms?
The Training Code window
What do large values of the F statistic indicate?
departures from the null hypothesis that all the node means are equal
What does the between-node sum of squares (SSbetween) measure?
the distance between the node means and the overall mean
What does the within-node sum of squares (SSwithin) measure?
the variability within a node
The FTEST splitting criteria is appropriate for what type of target?
interval
How does Model Studio use ENTROPY as a splitting criterion?
ENTROPY uses the gain in the information or the decrease in entropy to split each variable and then to determine the split
What do the letters in the acronym CHAID represent?
chi-squared automatic interaction detection
What value does CHAID use for a classification tree?
CHAID uses the value of a chi-square statistic for a classification tree
What value does the CHAID algorithm use as a splitting criterion for a regression tree?
CHAID uses the F statistic as a splitting criterion for a regression tree
Which grow criterion can be used for both interval and categorical target variables?
CHAID
How does the CHISQUARE splitting criteria method work?
CHISQUARE uses a chi-square statistic (logworth) to split each variable, and then uses the p-values that correspond to the resulting splits to determine the splitting variable.
How does Model Studio use GINI as a splitting criterion in a Decision Tree node?
GINI uses the decrease in the Gini index to split each variable and then to determine the split
How does Model Studio use IGR as a splitting criterion in a Decision Tree node?
Uses the entropy metric to split each variable and then uses the information gain ratio to determine the split
Which splitting criteria is the default for a categorical target in Model Studio?
Information Gain Ratio (IGR)
The Information gain ratio (IGR) splitting criteria is appropriate for what type of target?
categorical
Which splitting criteria is the default for an interval target in Model Studio?
VARIANCE
How does Model Studio use VARIANCE as a splitting criterion in a Decision Tree node?
VARIANCE uses the change in response variance to split each variable and then to determine the split
The FTEST splitting criteria is appropriate for what type of target?
categorical