Topic 4: Machine Learning: Regularization, Regression Trees, Random Forest & Overfitting Flashcards

1
Q

Define generalization, overfitting

A

Generalization applying the model to data not used in building the model;

overfitting means tailoring the model to the training data at the expense of generalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define fitting the graph, holdout data

A

fitting the graph: shows the accuracy of a model as a function of complexity

holdout data: data for which you know the value of the target value but was not used in building the model. (also called the test set)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define the sweet spot for a typical fitting graph.

A

The sweet spot is where the model generalization on the test data is the highest.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Analyze overfitting for logistic regression and support vector machine.

A

Logistic regression can more easily lead to overfitting, while SVM incorporates complexity control

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain why overfitting should be of concern.

A

As a model gets more complex it is allowed to pick up harmful spurious correlations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define cross-validation and folds.

A

cross-validation is a more sophisticated holdout training and testing procedure estimating generalization performance.

It performs multiple splits and systematically swapping out samples for testing.

Folds are the number of splits (or partitions typically five or ten)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define a learning curve.

A

A plot of generalization performance vs. amount of training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Compare and contrast a learning curve with a fitting graph.

A

Learning curve shows the generalization performance (only on training data vs. amount of training data used). Fitting graph shows the same but plotted against model complexity and uses a fixed amount of training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe the shape of learning curves for logistic regression and tree induction.

A

Steep initially and less steep as the marginal advantage of having more data decreases. Sometimes it flattens out.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

List strategies that can be used to avoid overfitting in tree induction.

A

(i) stop growing the tree before it becomes too comples, and (ii) to grow the tree until its too large, then prune it back, reducing its size/complexity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe how the minimum number of instances in a tree leaf can be used
to limit tree size.

A

It requires a minimum number instances to be present in a leaf - it will grow the branches that have a lot of data and cut short branches that have fewer data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain how hypothesis testing can be used to limit tree induction.

A

Test every leaf to determine whether the obserfed difference in information gain could have been due to chance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Define the best subset selection

A

fit a separate least squares regression for each combination of the p predictors p(p-1)/2 and pick the best.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

List the steps used in the best subset selection

A

Step 1. denote the null model

Step 2. For k = 1, 2, p:

(a) fit all models
(b) pick the best models (highest R2 or lowest RSS) and call it Mk

Step 3. Select a single best using cross validated prediction error or Adjusted R2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Define deviance.

A

A measure that plays the role of RSS for a broader class of models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Describe forward stepwise selection and backward stepwise selection

A

Forward: Pick the variable with the highest t-value, then add a predictor that gives the greatest additional improvement.

Backward: starts by using all predictors and iteratively removes the least useful predictor, one-at-a-time

18
Q

List the steps used in the forward stepwise selection

A
  1. Let M0 denote the null model, which contains no predictors
  2. For k=0, ….., p-1:
    (a) consider all p - k models that augment the predictors in Mk with one additional predictor
    (b) choose the best among these p-k models and call it Mk+1 (best meaning lowest RSS or highest R2).
  3. Select the single best model from among M0, …. , Mp
19
Q

List the steps using the backward stepwise selection

A
  1. Let Mp denote the full model, which contains all predictors
  2. For k=p, p-1, ….., 1:
    (a) consider all k models that contain all but one of the predictors in Mk, for a total of k-1 predictors.
    (b) choose the best among these k models, and call it Mk-1.
  3. Select the single best model from among M0, ….., Mp
20
Q

Cp approach to variable selection

A

Cp = 1/n (RSS + 2dσ^2)

it adds a 2dσ^2 penalty to the training RSS to adjust for the fact that the number of predictors in the model increases (choose model with lowers Cp value).

21
Q

Akaike information criterion (AIC) approach to variable selection

A

AIC = 1/n (RSS + 2dσ^2)

22
Q

Bayesian information criterion (BIC) approach to variable selection

A

BIC = 1/n (RSS + log(n)dσ^2)

where n is numb er of observations (favors smaller models).

23
Q

Adjusted R2 approach to variable selection

A

Adjusted R2 = 1 - [(RSS/(n-d-1)/(TSS/(n-1))]

here a large value of adjusted R2 indicates a model with small test error

24
Q

Define ridge regression, tuning parameter, and shrinkage penalty.

A

ridge regression: a new line that doesn’t fit the training data as well which increases bias but decreases variance. Sum of Squared Residuals (RSS) + Lambda x Slope2

tuning parameter: Lambda determines the severity of the penalty added to RSS (it makes the linear equation less sensitive to the X variable)

shrinkage penalty: Lambda x Slope2 shrinking the estimates of the parameters toward 0

25
Q

Define l2 norm

A

L2 norm AKA square penalty AKA Ridge regression penalty is the distance of slope to zero.

26
Q

Define standardizing the predictors

A

Using a formula to convert all predictors to the same scale.

27
Q

Describe the bias-variance tradeoff

A

As lambda increases bias increases and variance decreases.

28
Q

Describe the ridge regression.

A

reducing the MSE by increasing lambda (works well with

29
Q

Describe the advantage of Lasso over ridge regression

A
  • Ridge regression will not result in exclusion of variables, where lasso will force some of the coefficient estimates to be exactly zero (and thus excluding them).
  • Lasso is more easily to interpret, because it produces sparse models (a subset of variables).
  • Neither ridge regression nor the lasso will universally dominate the other.
30
Q

Describe how to select the tuning parameter (lambda)

A
  1. Choose a grid of lambda values and
  2. compute the cross-validation error for each value of lambda.
  3. Pick the lambda with the smalles cross-validation error.
31
Q

Interpret as well as predict using a given decision tree.

A

Regression trees (predict quantitative response), Classification trees (predict qualitative response),

32
Q

Describe the advantages and disadvantages of decision trees compared to other
classification and regression methods.

A
  1. easy to explain to people
  2. closer mirrors human decision-making (compared to regression)
  3. can be displayed graphically
  4. can easily handle qualitative predictors
33
Q

Describe recursive binary (greedy) splitting for constructing regression trees

A

the best partition at that step of that particular step w/o taking into account future splits

34
Q

Describe tree pruning, specifically cost complexity (weakest link) pruning

A

the main idea behind pruning is to prevent overfitting the training data so that the regression tree will do better with the testing data.

Cost complexity pruning helps determine which tree to use. You can calculate a tree-score per tree SSR + Alpha x number of terminal nodes (leaves). It gives a penalty for complexity.

LOWEST TREE SCORE IS BEST

35
Q

Describe the construction of classification trees using the classification error rate, Gini index, and entropy.

A

Similar to regression tree, but predicts a qualitative response rather than a quantitative one.

  1. Classification error rate - fraction of training observations not belonging to the most common class -> (0,0,1,1,1,1 = 2/6)
  2. Gini Index - calculates node purity - the lower the value the more observations from a single class
  3. Entropy - calculates node impurity
36
Q

Calculate the Gini Index.

A

1- (probability “Yes”)^2 - (probability of “No”)^2

37
Q

Describe bagging and out-of-bag error estimation

A

Bagging is a procedure to reduce the variance of a statistical learning method.

Out-of-Bag is the entries that did not make it to the bootstrapped dataset

38
Q

Describe how variable importance measures can be created using the Gini index.

A

Larger decreases in Gini index = larger importance

39
Q

Describe random forests.

A

Build a number of decision trees on bootstrapped training samples. With each split only consider a random number of the predictors (columns).

40
Q

Compare and contrast random forests to bagging

A

Bagging does not lead to a substantial reduction in variance (bagging chooses strong indicators which causes highly correlated sub-trees)

41
Q

Describe boosting as an approach for improving the prediction results from decision trees.

A

each tree is grown sequentially: each tree is grown using information from previously grown trees (not the case in bagging or random forrests)