Trees Analysis Flashcards

1
Q

A benefit of a TREE analysis is its visual outcome.

A

TRUE. This is one of the main reasons of TREES being so widely used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

The benefit of TREES is much clearer when there is not a single causation model for the whole population of interest

A

TRUE. A TREE can handle complexity in causation in a flexible way.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

OVERFITTING is about the risk of a poor “generalization” of our results in new samples

A

TRUE. This is a way of saying that, if we let the algorithm to “overfit“ we take the risk of getting good results in the training sample and not that good in new samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

If you increase the minimum number of cases in a parent node to be divided or in a child node, you take a higher risk of overfitting

A

FALSE. By increasing the minimum of cases needed, we avoid splitting small nodes into even smaller nodes so we limit the flexibility of the tree looking for a better generalization of results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The GAIN of a node in a tree measures the % of “hits” in the node compared to % of “hits” in the whole sample

A

TRUE. This is the exact definition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

CHAID, CRT, TWO-STEP and PCA are all different types of TREES algorithms

A

FALSE. TWO-STEP is a clustering algorithm and PCA is about factor analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In a CREDIT DEFAULT RISK exercise, a false positive is much more expensive than a false negative

A

FALSE. A false positive means to reject a credit but a false negative means to give credit to a risky customer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

If we predict a YES/NO target using a TREE, the main output we will get in terms of prediction is directly a YES/NO

A

FALSE. When we use a TREE to predict a YES (Vs NO) target, we will get a “propension” to “YES”, a kind of numerical score that we will later transform into a YES/NO according to a give threshold. Modeler is able to generate this YES/NO automatically (apart from the score) but we have to control the cutoff value as analysists.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

An Chi – Squared is used by CHAID tree algorithm to select the best predictors

A

TRUE. The name itself means Chi-Suqared Automated Interaction Detection. In effect, it uses Chi-Square TEST in order to select the best predictor at each split

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

TREES can be used to predict a CATEGORICAL variable with more than two categories

A

TRUE, there is no restriction in the number of categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

We normally have to hold out a part of our dataset / sample to validate our TREES

A

TRUE. We try to avoid the algorithm to “memorize” the sample of analysis (OVERFITTING) so we test the accuracy of our TREE in a holdout sample (not used to train the model)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

We should put a lot of work in pre- selecting the main predictors (as explanatory candidate variables) before launching a TREE analysis.

A

FALSE. A TREE algorithm has the ability of selecting the best predictors among a long list of candidates. This is, in fact, one of the advantages of this type of algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

TREES are part of the algorithm’s family called “RULE INDUCTION models”

A

TRUE. That name comes from the idea that these models derive a set of rules that describe distinct segments within the data in relation to the target. The model’s output shows the reasoning for each rule and can therefore be used to understand the decision-making process that drives a particular outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

TREES is a kind of “classification” analysis

A

TRUE. It is used to predict CATEGORICAL targets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

CRT is a bit different to other TREES algorithms because it can be used to predict SCALE targets.

A

TRUE. In fact, the name CRT comes from Classification & Regression TREE. The word “regression” (Vs Classification) means that it can be used to predict scale variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

TREES analysis can also be understood as a type of CLUSTER algorithm because, at the end, it could be used to find similar groups according to the values of a set of variables.

A

FALSE. A TREE does not find similar groups according to a set of variables. The segments identified by a TREE (NODES) consist in groups of customers that have similar propension in relation to A TARGET VARIABLE. In this sense, the groups are CONDITIONED to the target variable; we use this target variable as SUPERVISOR for the result.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

TREES can be called SUPERVISED technique BUT ONLY if we build the TREE in an interactive way, “supervising” the outcome.

A

FALSE. The tag “Supervised” is because a variable is used as a SUPERVISOR, as a target, for the whole result.

18
Q

Imagine that we run a FACTOR analysis using a group of satisfaction variables for our customer dataset. Could we use the factor/s score/s as input variables (may be among others) in a TREES analysis to predict “churn”?

A

TRUE. Why not? A factor score is a metric variable and WE CAN use metric variables as input variables in a TREE. If we have several different satisfaction indicators, a FACTOR would be a good way of introducing our own “satisfaction measure” in our TREE.

19
Q

In a SPAM EMAIL FILTER exercise, a false positive means to receive a junk mail in your inbox

A

FALSE. A false positive is about predicting “SPAM” instead of actual “HAM” so we will move safe email to our junk folder

20
Q

Normally, it is easy to increase TRUE POSITIVES if you are willing to accept also FALSE POSITIVES

A

TRUE. If you tend to predict POSITIVES, you will capture TRUE POSITIVES but also FALSE POSITIVES

21
Q

We call CLASSIFICATION techniques those used to predict or explain SCALE variables

A

No, classification is for categorical or ordinal variables

22
Q

A TREE is somewhere in the middle between pure predictive and pure explanatory techniques

A

YES, it can be used to predict but also to give some information about the determinants of our variable of interest

23
Q

TREES are flexible classification algorithms in the sense that they can capture complex relationships in the presence of lots of explanatory variables

A

TRUE. The benefit of TREES is much more clear when there is not a single causation model for the whole population of interest. At the same time, the algorithms are able to discriminate good and bad predictors.

24
Q

CHAID, CRT, C5 and QUEST are different types of TREES algorithms

A

TRUE. These are very common TREES algorithms

25
Q

An F - test is used by CHAID tree algorithm to select the best predictors

A

FALSE. It uses Chi-Square TEST

26
Q

We normally hold out a part of our dataset / sample to validate our TREES

A

TRUE. We try to avoid the algorithm to “memorize” the sample of analysis (OVERFITTING) so we test the accuracy of our TREE in a holdout sample (not used to train the model)

27
Q

In a CREDIT DEFAULT RISK exercise, a false positive means to give credit to a risky customer

A

FALSE. If it is about DEFAULT, a false positive means to predict RISK for a safe customer (and thus not to give him credit)

28
Q

In a SPAM EMAIL FILTER exercise, a false negative means to receive a junk mail in your inbox

A

TRUE. We predict false “ham” instead of real SPAM ang we let the junk email to enter our inbox

29
Q

In a CREDIT DEFAULT RISK exercise, a false negative is much more expensive than a false positive

A

TRUE. Because a false negative means give credit to a risky customer

30
Q

In a CHAID exercise, lower p-values from chi-squared tests are used to identify and select the best predictors

A

TRUE. A low p-value for a crosstab chi-square test means evidence of association between predictor and target

31
Q

Scale variables can also be used as predictors in CHAID analysis

A

TRUE. Scale variables are automatically transformed to ORDINAL by CHAID algorithm

32
Q

An interactive session permits the user to grow a tree applying his criteria in the selection of predictors

A

TRUE. By using this interactive way, the analyst may influence the TREE result for the sake of a better model in terms of the business goal

33
Q

The more a tree growths, the better the result we get in terms of VALIDATION

A

FALSE. An excessive growth increases the risk of overfitting (WORSE RESULTS IN TERMS OF EVALUATION)

34
Q

We normally have to control the tree growth in order to avoid OVERFITTING

A

TRUE. We need to balance accuracy in the TRAIN and TEST sample

35
Q

A variable may appear as predictor in a TREE more than one time, in different tree levels

A

TRUE. Yes, it is possible. Age may appear as the main predictor and appears again in a subset of the sample

36
Q

A TREE algorithm has the ability of selecting the best predictors among a long list of candidates

A

TRUE. This is, in fact, one of the advantages of this type of algorithm

37
Q

For ordinal predictors only adjacent categories are compared and possibly merged in a CHAID analysis

A

TRUE. This is because, normally, it is nonsense to merge categories that are not adjacent (people below 18 and over 65 for instance)

38
Q

The GAIN of a node in a tree measures the % of “hits” in the node

A

FALSE. It measures this % of hits compared to overall % of hits (in the whole sample)

39
Q

For categorical targets YES/NO, a classification table will always be a 2x2 table

A

TRUE. YES/NO predicted VS YES/NO observed

40
Q

Normally, it is easy to increase TRUE POSITIVES if you are willing to accept also FALSE POSITIVES

A

TRUE. If you tend to predict POSITIVES, you will capture TRUE POSITIVES but also FALSE POSITIVES