5. Trees Flashcards by Marija Spehar

What is the general algorithm for creating a regression/classification tree? We use recursive, CCP, and CV together how?

Construct a tree using recursive binary splitting
Obtain a sequence of best sub trees, as a function of lambda, using cost complexity pruning
Choose the value of lambda by applying K fold CV, we select the lambda that results in the lowest value of CV error
The best sub tree is the sub tree created in step 2 with the chosen value of lambda

How well did you know this?

Not at all

Perfectly

With recursive binary splitting, we aim to minimize _____ for regression and ______ for classification.

Formula

2. Contains impurity measure

How well did you know this?

Not at all

Perfectly

What are the 3 purity measures for classification trees? Draw the graph of the three of them

Gini, entropy, classification

Entropy is always the greatest

How well did you know this?

Not at all

Perfectly

With cost complexity pruning, in both of regression and classification, what are we trying to minimize?

Formula

How well did you know this?

Not at all

Perfectly

Recursive binary splitting is referred to as a _____.

Top-down: begins with the predictor space as one large region
Greedy: at each split, the best split is selected without accounting for future splits that could be better.
Approach

How well did you know this?

Not at all

Perfectly

Does cost complexity pruning increase or decrease the variance? How is flexibility measured for trees?

Decreases variance because decreases number of terminal nodes, which is how flexibility is measured

How well did you know this?

Not at all

Perfectly

Cost complexity pruning results in a group of sub trees that are _____.

Nested.

How well did you know this?

Not at all

Perfectly

For cost complexity pruning, does increasing the tuning parameter increase or decrease the variance of the method?

Decreases variance, because the number of terminal nodes is a decreasing function of the tuning parameter.

How well did you know this?

Not at all

Perfectly

When will trees outperform linear models?

When the relationship between explanatory and response is far more complicated than a linear equation

How well did you know this?

Not at all

Perfectly

What are 4 advantages of trees?

Easy to interpret and explain
Can be presented visually
Manage categorical without the need for dummies
Mimic human decision making

How well did you know this?

Not at all

Perfectly

What are the 2 disadvantages of trees?

Not robust

2. Not the same degree of predictive accuracy as other statistical methods

How well did you know this?

Not at all

Perfectly

What is the procedure for bagging? 3

Create b bootstrap samples from the original training set.
Create a decision tree for each bootstrap sample using recursive binary splitting
Predict the response of a new observation by either averaging the responses (regression) or by using the most frequent category (classification) across all b trees

How well did you know this?

Not at all

Perfectly

If we increase the value of b in bagging, does this cause overfitting?

How well did you know this?

Not at all

Perfectly

Does bagging reduce the variance? If yes, why?

Yes, because of the fact that the variance of the average of a set of observations is less than the variance of one single observation

How well did you know this?

Not at all

Perfectly

What is a disadvantage to bagging?

Difficult to interpret the entire bagged model.

How well did you know this?

Not at all

Perfectly

The out of bag error is used as an estimate of _____.

Test MSE

How well did you know this?

Not at all

Perfectly

On average, how many observations are used to train each bagged tree? What are the observations called that were not used to train the trees?

2/3. They are called the OOB observations.

How well did you know this?

Not at all

Perfectly

If the bootstrap training dataset has a very strong predictor variable, what will happen to the trees? What happens if the trees are similar to one another?

They will be monopolized by this predictor. If the trees are similar, the predictions will tend to be correlated. This then, does not reduce variance

How well did you know this?

Not at all

Perfectly

Random forests aim to _______ trees.

Decorrelate

How well did you know this?

Not at all

Perfectly

In random forests, k is chosen to be _____ under regression and _______ under classification problems.

Study These Flashcards

Regression k=p/3

Classification k=sqrt(p)

What happens if we decrease k in a random forest?

Study These Flashcards

Reduces the correlation between predictions

When we increase b for boosting, does this cause overfitting?

Study These Flashcards

Yes

Does boosting reduce bias or variance?

Study These Flashcards

Reduces bias

How do we get a prediction using the b boosted trees?

Study These Flashcards

Sum of the predictions from each tree

Are the test errors for bagging and random forests similar or significantly different?

Similar, but random forest tends to perform better than bagging, resulting in a lower test error

Does random forest randomly select a subset of predictors to be considered for creating each tree?

No, random forests randomly select a subset of predictors to be considered for creating each SPLIT.

Does bagging require performing CV?

No, bagging doesn’t operate on a choice of flexibility.

Does a boosted model become a GLM when all its trees are stumps?

No, it becomes an additive model.

In a random forest, p=total number of features and m=number of features selected at each split. What is the probability that a split will not consider the strongest predictor?

(m-p)/p Because m-p features are not chosen at a given split.

In random forest, is each tree constructed independently of every other tree?

Yes, through the independent bootstrap samples

What is a benefit of boosting?

The prediction is obtained through fitting successive trees.

Does including more trees in the random forest model decrease the residuals?

No, we do not know the optimal number of trees to be included in this model.

Can a random forest handle both qualitative and quantitative variables?

Yes

Can random forests handle non-linear relationships?

Yes

When are random forests appropriate?

When there isn’t a clear relationship between the predictors and the response. When there are clear relationships, we usually resort to statistical methods that include those relationships. Example; quadratic model is better off modelled using a linear relationship than a random forest.

True or false: GLMs cannot handle polynomial relationships.

False. They can

Which of the following is true regarding models that use numerous decision trees to obtain a prediction function f-hat? A. Every decision tree is usually pruned B. OOB error helps to determine the optimal flexibility of f-hat C. They address the lacking predictive accuracy of a single decision tree D. F-hat is usually easier to interpret compared to models with a single decision tree. E. They are more suitable for a regression setting than for a classification

A. False. Bagging, boosting and random forests are all not pruned. B. False. OOB error can help determine an appropriate number of trees to construct, but that applies to bagging and random forests where the number of trees is not a flexibility measure. C. True. With bagging and random forest, improvement in accuracy comes from lowering f-hat’s variance, whereas with boosting, it comes from slow learning and finding an appropriate number of trees D. False. Usually harder E. False. The models are equally suited to handle both types of settings

As b increases (for bagging) does the predictive accuracy of the model increase or decrease?

Increases.

When predictions are highly correlated between trees, what does this mean in terms of the change in variance?

Since bagging aims to decrease variance, the correlated predictions will lessen this, and the trees will have a smaller reduction in variance.

What are the 3 tuning parameters in boosting?

b = number of trees, this is found using CV d = number of splits in each tree, this controls the complexity of the boosted model (this is called the interaction depth) lambda = shrinkage parameter, the smaller lambda is, the slower learning speed (prefer a small lambda)

Are all variables used when making the splits in boosting?

Yes, they are only not used in random forest.

Does boosting have bootstrapping? Does it have CV?

No bootstrapping, yes CV. CV on b = how many trees.

Correlation matrix of standardized predictors is given with all positive high correlation values. What can be assumed in terms of the signs loadings of the PCs, and the signs of the PC scores?

If the variables are standardized, this means that the scores cannot be all positive/negative. Because they need to sum to zero. The loadings will either be all positive/negative though because all correlation values are positive.

From a biplot, when can we determine which predictor has the highest variance and how would we do this?

Have to use the unscaled/unstandardized biplot. Loading with the longest line = most variance.

Correlation matrix: predictors are standardized and the correlation values in the matrix are low. They have both positive and negative values. What can be concluded from the matrix in terms of signs for the PC loading vectors and the proportion of variance explained by only one PC?

The loadings for the first PC will not be either all positive/negative because the correlation values have both positive/negative values. We will need more than one PC to explain variability in the data set due to the fact that the predictors are not highly correlated.

Can we deduce anything about the second PC scores/loadings from a correlation matrix?

If X1 and X2 are positively correlated, their loadings will have _____ signs. If they are negatively correlated, their loadings will have ____ signs.

Positively correlated: the same signs | Negatively correlated: opposing signs

If two variables are highly correlated, what can be said of the magnitude of their loadings? How does this connect to biplots?

The loadings will have similar magnitudes. This also connects to biplots because if the loading vectors are similar in length, they are likely correlated.

What is the probability that an observation is an OOB observation?

(1-1/n)^n

5. Trees Flashcards

(49 cards)