5. Trees Flashcards

1
Q

What is the general algorithm for creating a regression/classification tree? We use recursive, CCP, and CV together how?

A
  1. Construct a tree using recursive binary splitting
  2. Obtain a sequence of best sub trees, as a function of lambda, using cost complexity pruning
  3. Choose the value of lambda by applying K fold CV, we select the lambda that results in the lowest value of CV error
  4. The best sub tree is the sub tree created in step 2 with the chosen value of lambda
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

With recursive binary splitting, we aim to minimize _____ for regression and ______ for classification.

A
  1. Formula

2. Contains impurity measure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the 3 purity measures for classification trees? Draw the graph of the three of them

A

Gini, entropy, classification

Entropy is always the greatest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

With cost complexity pruning, in both of regression and classification, what are we trying to minimize?

A

Formula

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Recursive binary splitting is referred to as a _____.

A

Top-down: begins with the predictor space as one large region
Greedy: at each split, the best split is selected without accounting for future splits that could be better.
Approach

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Does cost complexity pruning increase or decrease the variance? How is flexibility measured for trees?

A

Decreases variance because decreases number of terminal nodes, which is how flexibility is measured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Cost complexity pruning results in a group of sub trees that are _____.

A

Nested.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

For cost complexity pruning, does increasing the tuning parameter increase or decrease the variance of the method?

A

Decreases variance, because the number of terminal nodes is a decreasing function of the tuning parameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When will trees outperform linear models?

A

When the relationship between explanatory and response is far more complicated than a linear equation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are 4 advantages of trees?

A
  1. Easy to interpret and explain
  2. Can be presented visually
  3. Manage categorical without the need for dummies
  4. Mimic human decision making
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the 2 disadvantages of trees?

A
  1. Not robust

2. Not the same degree of predictive accuracy as other statistical methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the procedure for bagging? 3

A
  1. Create b bootstrap samples from the original training set.
  2. Create a decision tree for each bootstrap sample using recursive binary splitting
  3. Predict the response of a new observation by either averaging the responses (regression) or by using the most frequent category (classification) across all b trees
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

If we increase the value of b in bagging, does this cause overfitting?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Does bagging reduce the variance? If yes, why?

A

Yes, because of the fact that the variance of the average of a set of observations is less than the variance of one single observation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a disadvantage to bagging?

A

Difficult to interpret the entire bagged model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The out of bag error is used as an estimate of _____.

A

Test MSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

On average, how many observations are used to train each bagged tree? What are the observations called that were not used to train the trees?

A

2/3. They are called the OOB observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

If the bootstrap training dataset has a very strong predictor variable, what will happen to the trees? What happens if the trees are similar to one another?

A

They will be monopolized by this predictor. If the trees are similar, the predictions will tend to be correlated. This then, does not reduce variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Random forests aim to _______ trees.

A

Decorrelate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In random forests, k is chosen to be _____ under regression and _______ under classification problems.

A

Regression k=p/3

Classification k=sqrt(p)

21
Q

What happens if we decrease k in a random forest?

A

Reduces the correlation between predictions

22
Q

When we increase b for boosting, does this cause overfitting?

A

Yes

23
Q

Does boosting reduce bias or variance?

A

Reduces bias

24
Q

How do we get a prediction using the b boosted trees?

A

Sum of the predictions from each tree

25
Q

Are the test errors for bagging and random forests similar or significantly different?

A

Similar, but random forest tends to perform better than bagging, resulting in a lower test error

26
Q

Does random forest randomly select a subset of predictors to be considered for creating each tree?

A

No, random forests randomly select a subset of predictors to be considered for creating each SPLIT.

27
Q

Does bagging require performing CV?

A

No, bagging doesn’t operate on a choice of flexibility.

28
Q

Does a boosted model become a GLM when all its trees are stumps?

A

No, it becomes an additive model.

29
Q

In a random forest, p=total number of features and m=number of features selected at each split. What is the probability that a split will not consider the strongest predictor?

A

(m-p)/p

Because m-p features are not chosen at a given split.

30
Q

In random forest, is each tree constructed independently of every other tree?

A

Yes, through the independent bootstrap samples

31
Q

What is a benefit of boosting?

A

The prediction is obtained through fitting successive trees.

32
Q

Does including more trees in the random forest model decrease the residuals?

A

No, we do not know the optimal number of trees to be included in this model.

33
Q

Can a random forest handle both qualitative and quantitative variables?

A

Yes

34
Q

Can random forests handle non-linear relationships?

A

Yes

35
Q

When are random forests appropriate?

A

When there isn’t a clear relationship between the predictors and the response. When there are clear relationships, we usually resort to statistical methods that include those relationships.

Example; quadratic model is better off modelled using a linear relationship than a random forest.

36
Q

True or false: GLMs cannot handle polynomial relationships.

A

False. They can

37
Q

Which of the following is true regarding models that use numerous decision trees to obtain a prediction function f-hat?
A. Every decision tree is usually pruned
B. OOB error helps to determine the optimal flexibility of f-hat
C. They address the lacking predictive accuracy of a single decision tree
D. F-hat is usually easier to interpret compared to models with a single decision tree.
E. They are more suitable for a regression setting than for a classification

A

A. False. Bagging, boosting and random forests are all not pruned.

B. False. OOB error can help determine an appropriate number of trees to construct, but that applies to bagging and random forests where the number of trees is not a flexibility measure.

C. True. With bagging and random forest, improvement in accuracy comes from lowering f-hat’s variance, whereas with boosting, it comes from slow learning and finding an appropriate number of trees

D. False. Usually harder

E. False. The models are equally suited to handle both types of settings

38
Q

As b increases (for bagging) does the predictive accuracy of the model increase or decrease?

A

Increases.

39
Q

When predictions are highly correlated between trees, what does this mean in terms of the change in variance?

A

Since bagging aims to decrease variance, the correlated predictions will lessen this, and the trees will have a smaller reduction in variance.

40
Q

What are the 3 tuning parameters in boosting?

A

b = number of trees, this is found using CV

d = number of splits in each tree, this controls the complexity of the boosted model (this is called the interaction depth)

lambda = shrinkage parameter, the smaller lambda is, the slower learning speed (prefer a small lambda)

41
Q

Are all variables used when making the splits in boosting?

A

Yes, they are only not used in random forest.

42
Q

Does boosting have bootstrapping? Does it have CV?

A

No bootstrapping, yes CV. CV on b = how many trees.

43
Q

Correlation matrix of standardized predictors is given with all positive high correlation values. What can be assumed in terms of the signs loadings of the PCs, and the signs of the PC scores?

A

If the variables are standardized, this means that the scores cannot be all positive/negative. Because they need to sum to zero.

The loadings will either be all positive/negative though because all correlation values are positive.

44
Q

From a biplot, when can we determine which predictor has the highest variance and how would we do this?

A

Have to use the unscaled/unstandardized biplot. Loading with the longest line = most variance.

45
Q

Correlation matrix: predictors are standardized and the correlation values in the matrix are low. They have both positive and negative values. What can be concluded from the matrix in terms of signs for the PC loading vectors and the proportion of variance explained by only one PC?

A

The loadings for the first PC will not be either all positive/negative because the correlation values have both positive/negative values.

We will need more than one PC to explain variability in the data set due to the fact that the predictors are not highly correlated.

46
Q

Can we deduce anything about the second PC scores/loadings from a correlation matrix?

A

No

47
Q

If X1 and X2 are positively correlated, their loadings will have _____ signs. If they are negatively correlated, their loadings will have ____ signs.

A

Positively correlated: the same signs

Negatively correlated: opposing signs

48
Q

If two variables are highly correlated, what can be said of the magnitude of their loadings? How does this connect to biplots?

A

The loadings will have similar magnitudes. This also connects to biplots because if the loading vectors are similar in length, they are likely correlated.

49
Q

What is the probability that an observation is an OOB observation?

A

(1-1/n)^n