Machine Learning with Viya® 3.4® Lesson 3: Decision Trees and Ensembles of Trees Flashcards by Nicole Fox

What is a greedy algorithm?

one that makes locally optimal choices at each step

How well did you know this?

Not at all

Perfectly

How does a decision tree predict cases?

decision trees use rules that involve the values or categories of the input variables

How well did you know this?

Not at all

Perfectly

What is a decision tree referred to as when the target is categorical?

A classification tree

How well did you know this?

Not at all

Perfectly

Name the first node at the base (top) of the tree

root node

How well did you know this?

Not at all

Perfectly

What is a decision tree referred to as when the target is continuous?

A regression tree

How well did you know this?

Not at all

Perfectly

What is a leaf node?

a node with only one connection

How well did you know this?

Not at all

Perfectly

Which component of a decision tree provides the predictions?

A tree’s leaf nodes provide the predictions.

How well did you know this?

Not at all

Perfectly

How do decision trees address the curse of dimensionality?

The split search process reduces the number of inputs in the model by eliminating irrelevant inputs. Irrelevant inputs do not appear in any splitting rules in the decision tree.

How well did you know this?

Not at all

Perfectly

How does a decision tree handle missing values?

The split search criteria for decision trees assign the missing values along one side of a branch at the Splitting node as a category.

How well did you know this?

Not at all

Perfectly

The input variables have missing values. What should you do before running a Decision Tree node with these input variables?

Nothing. There is no need to impute any missing values because trees can handle them

How well did you know this?

Not at all

Perfectly

What does Model Studio display in the Tree Diagram?

the final tree structure for this particular model, such as the depth of the tree and all end leaves

How well did you know this?

Not at all

Perfectly

What is a reduction in node impurity?

the reduction of within-node variability induced by the split

How well did you know this?

Not at all

Perfectly

What is a surrogate rule?

A surrogate splitting rule is a backup to the main splitting rule.

When surrogate rules are requested, if a new case has a missing value on the splitting variable, then the best surrogate is used to classify the case.

If several surrogate rules exist, each surrogate is considered in sequence until one can be applied to the observation.

If none can be applied, the main rule assigns the observation to the branch that is designated for missing values.

How well did you know this?

Not at all

Perfectly

How do you interpret a Gini index?

The Gini index can be interpreted as the probability that any two elements of a group, chosen at random (with replacement), are different.

A pure node (with no diversity) has a Gini index of 0. As the number of evenly distributed classes increases, the Gini index approaches 1 (more diverse, less pure.)

If we randomly select two observations from a group, the Gini index is the percentage chance that two observations will be different from each other

How well did you know this?

Not at all

Perfectly

What is a Gini index?

a Gini index is a measure of variability for categorical data that can be used as a measure of node impurity

How well did you know this?

Not at all

Perfectly

Which plot shows a Decision Tree model’s performance based on the misclassification rate?

the pruning error plot

How well did you know this?

Not at all

Perfectly

What does the Cumulative Lift chart in the Assessment tab show?

how much better the model is than no model / random events

the model’s performance ordered by the percentage of the population

How well did you know this?

Not at all

Perfectly

How can you set the maximum number of generations in nodes for a decision tree in Model Studio?

Expand the Splitting Options properties and set the Maximum Depth

How well did you know this?

Not at all

Perfectly

Where would you evaluate model performance based on an assessment measure such as average squared error?

the fit statistics table

How well did you know this?

Not at all

Perfectly

Where would you look to see the input variables that are most significant to the final model?

the Variable Importance table

How well did you know this?

Not at all

Perfectly

What is the standard method used to fit decision trees?

Recursive partitioning

How well did you know this?

Not at all

Perfectly

Allowing a larger tree to be grown by increasing the maximum depth could lead to what problem?

overfitting

How well did you know this?

Not at all

Perfectly

What setting can you change to help prevent overfitting?

Increase the minimum leaf size

How well did you know this?

Not at all

Perfectly

What is the response of the ensemble of simple decision trees for an interval target?

For an interval target, the response of the ensemble model is the average of the estimate of the individual decision trees.

How well did you know this?

Not at all

Perfectly

What is the response of the ensemble of simple decision trees for a categorical target?

For a categorical target, the response of the ensemble of simple decision trees is **the vote for the most popular class or the average of the posterior probabilities of the individual trees**.

What is bagging?

Bagging takes bootstrap samples of the rows of training data. All columns are considered for splitting at every step.

What is a **random forest**?

A forest is **an ensemble of simple** (classification or regression) **decision trees**

How does training different trees with different training data improve predictions for a forest?

Training different trees with different training data **reduces the correlation of the predictions of the trees**

What is an out-of-bag sample?

the training data that are excluded during the construction of an individual tree

What data is used to assess the fit of a forest model?

the out-of-bag sample

How does Model Studio calculate the maximum number of inputs per split in a Forest Model when using the default settings?

By default, the number of inputs considered per split is **the square root of the number of inputs**

How does the forest algorithm sample the data?

**The forest algorithm samples the rows *and* the columns at each step** (leading to more perturbed data than the bagging algorithm)

What additional chart is available when the target is binary?

the ROC curve

What does the ROC curve show?

the model’s performance considering the true positive rate and the false positive rate

How does a split-search strategy work?

1. **Identify candidate splits** based on the splitting criterion 2. **Select a split** that is expressed as an IF-THEN-ELSE rule 3. **Repeat process** for each child node, continuing until a stopping rule prevents further growth

What is the goal of splitting?

to reduce the variability of the target distribution and thus increase purity in the child nodes

What is a split search?

an iterative process used by recursive partitioning to select the best split for the node

Which splitting criteria may be used for categorical targets?

1. Information gain ratio (IGR) (default in Model Studio) 2. CHAID 3. Chi-Square 4. Entropy 5. GINI

Which splitting criteria are appropriate for interval targets?

1. Variance (default in Model Studio) 2. CHAID 3. Ftest

What is the purpose of the Bonferroni correction during a decision tree split search?

To maintain overall confidence by inflating the p-values.

Which split criteria can request a Bonferroni correction after the split has been determined?

Split criteria using the p-value (chi-square, CHAID, or F test)

Which window shows the score code for a specific node that may be deployed in production?

the Node Score Code window

When does Model Studio generate node score code?

Model Studio generates node score code for **every node in the Data Mining Preprocessing group and the Supervised Learning group that creates DATA step score code**.

What is another name for the "flow score code?"

Path EP Score Code

What is included in the Path EP Score Code?

score code for all the nodes until and including that modeling node to be used in other SAS environments

What does the 'EP' refer to in the term Path EP Score Code?

Embedded Process

Which window contains the SAS training code that may be used to train the model based on different data sets or platforms?

The Training Code window

What do large values of the F statistic indicate?

departures from the null hypothesis that all the node means are equal

What does the between-node sum of squares (SSbetween) measure?

the distance between the node means and the overall mean

What does the within-node sum of squares (SSwithin) measure?

the variability within a node

The FTEST splitting criteria is appropriate for what type of target?

interval

How does Model Studio use ENTROPY as a splitting criterion?

ENTROPY uses **the gain in the information *or* the decrease in entropy** to split each variable and then to determine the split

What do the letters in the acronym CHAID represent?

chi-squared automatic interaction detection

What value does CHAID use for a classification tree?

CHAID uses the value of a **chi-square statistic** for a classification tree

What value does the CHAID algorithm use as a splitting criterion for a regression tree?

CHAID uses **the F statistic** as a splitting criterion for a regression tree

Which grow criterion can be used for both interval and categorical target variables?

CHAID

How does the CHISQUARE splitting criteria method work?

CHISQUARE uses a chi-square statistic (logworth) to split each variable, and then uses the p-values that correspond to the resulting splits to determine the splitting variable.

How does Model Studio use GINI as a splitting criterion in a Decision Tree node?

GINI uses the decrease in the Gini index to split each variable and then to determine the split

How does Model Studio use IGR as a splitting criterion in a Decision Tree node?

Uses the entropy metric to split each variable and then uses the information gain ratio to determine the split

Which splitting criteria is the default for a categorical target in Model Studio?

Information Gain Ratio (IGR)

The Information gain ratio (IGR) splitting criteria is appropriate for what type of target?

categorical

Which splitting criteria is the default for an interval target in Model Studio?

VARIANCE

How does Model Studio use VARIANCE as a splitting criterion in a Decision Tree node?

VARIANCE uses the change in response variance to split each variable and then to determine the split

The FTEST splitting criteria is appropriate for what type of target?

categorical

Machine Learning with Viya® 3.4® Lesson 3: Decision Trees and Ensembles of Trees Flashcards

(64 cards)