Topic 4: Machine Learning: Regularization, Regression Trees, Random Forest & Overfitting Flashcards by Haiko Aragon

Define generalization, overfitting

Generalization applying the model to data not used in building the model;

overfitting means tailoring the model to the training data at the expense of generalization

How well did you know this?

Not at all

Perfectly

Define fitting the graph, holdout data

fitting the graph: shows the accuracy of a model as a function of complexity

holdout data: data for which you know the value of the target value but was not used in building the model. (also called the test set)

How well did you know this?

Not at all

Perfectly

Define the sweet spot for a typical fitting graph.

The sweet spot is where the model generalization on the test data is the highest.

How well did you know this?

Not at all

Perfectly

Analyze overfitting for logistic regression and support vector machine.

Logistic regression can more easily lead to overfitting, while SVM incorporates complexity control

How well did you know this?

Not at all

Perfectly

Explain why overfitting should be of concern.

As a model gets more complex it is allowed to pick up harmful spurious correlations

How well did you know this?

Not at all

Perfectly

Define cross-validation and folds.

cross-validation is a more sophisticated holdout training and testing procedure estimating generalization performance.

It performs multiple splits and systematically swapping out samples for testing.

Folds are the number of splits (or partitions typically five or ten)

How well did you know this?

Not at all

Perfectly

Define a learning curve.

A plot of generalization performance vs. amount of training data

How well did you know this?

Not at all

Perfectly

Compare and contrast a learning curve with a fitting graph.

Learning curve shows the generalization performance (only on training data vs. amount of training data used). Fitting graph shows the same but plotted against model complexity and uses a fixed amount of training data.

How well did you know this?

Not at all

Perfectly

Describe the shape of learning curves for logistic regression and tree induction.

Steep initially and less steep as the marginal advantage of having more data decreases. Sometimes it flattens out.

How well did you know this?

Not at all

Perfectly

List strategies that can be used to avoid overfitting in tree induction.

(i) stop growing the tree before it becomes too comples, and (ii) to grow the tree until its too large, then prune it back, reducing its size/complexity

How well did you know this?

Not at all

Perfectly

Describe how the minimum number of instances in a tree leaf can be used
to limit tree size.

It requires a minimum number instances to be present in a leaf - it will grow the branches that have a lot of data and cut short branches that have fewer data.

How well did you know this?

Not at all

Perfectly

Explain how hypothesis testing can be used to limit tree induction.

Test every leaf to determine whether the obserfed difference in information gain could have been due to chance.

How well did you know this?

Not at all

Perfectly

How well did you know this?

Not at all

Perfectly

Define the best subset selection

fit a separate least squares regression for each combination of the p predictors p(p-1)/2 and pick the best.

How well did you know this?

Not at all

Perfectly

List the steps used in the best subset selection

Step 1. denote the null model

Step 2. For k = 1, 2, p:

(a) fit all models
(b) pick the best models (highest R2 or lowest RSS) and call it Mk

Step 3. Select a single best using cross validated prediction error or Adjusted R2

How well did you know this?

Not at all

Perfectly

Define deviance.

A measure that plays the role of RSS for a broader class of models

How well did you know this?

Not at all

Perfectly

Describe forward stepwise selection and backward stepwise selection

Study These Flashcards

Forward: Pick the variable with the highest t-value, then add a predictor that gives the greatest additional improvement.

Backward: starts by using all predictors and iteratively removes the least useful predictor, one-at-a-time

List the steps used in the forward stepwise selection

Study These Flashcards

Let M0 denote the null model, which contains no predictors
For k=0, ….., p-1:
(a) consider all p - k models that augment the predictors in Mk with one additional predictor
(b) choose the best among these p-k models and call it Mk+1 (best meaning lowest RSS or highest R2).
Select the single best model from among M0, …. , Mp

List the steps using the backward stepwise selection

Study These Flashcards

Let Mp denote the full model, which contains all predictors
For k=p, p-1, ….., 1:
(a) consider all k models that contain all but one of the predictors in Mk, for a total of k-1 predictors.
(b) choose the best among these k models, and call it Mk-1.
Select the single best model from among M0, ….., Mp

Cp approach to variable selection

Study These Flashcards

Cp = 1/n (RSS + 2dσ^2)

it adds a 2dσ^2 penalty to the training RSS to adjust for the fact that the number of predictors in the model increases (choose model with lowers Cp value).

Akaike information criterion (AIC) approach to variable selection

Study These Flashcards

AIC = 1/n (RSS + 2dσ^2)

Bayesian information criterion (BIC) approach to variable selection

Study These Flashcards

BIC = 1/n (RSS + log(n)dσ^2)

where n is numb er of observations (favors smaller models).

Adjusted R2 approach to variable selection

Study These Flashcards

Adjusted R² = 1 - [(RSS/(n-d-1)/(TSS/(n-1))]

here a large value of adjusted R² indicates a model with small test error

Define ridge regression, tuning parameter, and shrinkage penalty.

Study These Flashcards

ridge regression: a new line that doesn’t fit the training data as well which increases bias but decreases variance. Sum of Squared Residuals (RSS) + Lambda x Slope²

tuning parameter: Lambda determines the severity of the penalty added to RSS (it makes the linear equation less sensitive to the X variable)

shrinkage penalty: Lambda x Slope²shrinking the estimates of the parameters toward 0

Define l2 norm

**L2 norm** AKA square penalty AKA Ridge regression penalty is the distance of slope to zero.

Define standardizing the predictors

Using a formula to convert all predictors to the same scale.

Describe the bias-variance tradeoff

As lambda increases bias increases and variance decreases.

Describe the ridge regression.

reducing the MSE by increasing lambda (works well with

Describe the advantage of Lasso over ridge regression

* Ridge regression will not result in exclusion of variables, where lasso will force some of the coefficient estimates to be exactly zero (and thus excluding them). * Lasso is more easily to interpret, because it produces sparse models (a subset of variables). * Neither ridge regression nor the lasso will universally dominate the other.

Describe how to select the tuning parameter (lambda)

1. Choose a grid of lambda values and 2. compute the cross-validation error for each value of lambda. 3. Pick the lambda with the smalles cross-validation error.

Interpret as well as predict using a given decision tree.

Regression trees (predict quantitative response), Classification trees (predict qualitative response),

Describe the advantages and disadvantages of decision trees compared to other classification and regression methods.

1. easy to explain to people 2. closer mirrors human decision-making (compared to regression) 3. can be displayed graphically 4. can easily handle qualitative predictors

Describe recursive binary (greedy) splitting for constructing regression trees

the best partition at that step of that particular step w/o taking into account future splits

Describe tree pruning, specifically cost complexity (weakest link) pruning

the main idea behind **pruning** is to **prevent overfitting** the training data so that the regression tree will do better with the testing data. **Cost complexity pruning** helps determine which tree to use. You can calculate a tree-score per tree SSR + Alpha x number of terminal nodes (leaves). It gives a penalty for complexity. **LOWEST TREE SCORE IS BEST**

Describe the construction of classification trees using the classification error rate, Gini index, and entropy.

Similar to regression tree, but predicts a qualitative response rather than a quantitative one. 1. Classification error rate - fraction of training observations not belonging to the most common class -\> (0,0,1,1,1,1 = 2/6) 2. Gini Index - calculates node purity - the lower the value the more observations from a single class 3. Entropy - calculates node impurity

Calculate the Gini Index.

1- (probability "Yes")^2 - (probability of "No")^2

Describe bagging and out-of-bag error estimation

**Bagging** is a procedure to reduce the variance of a statistical learning method. **Out-of-Bag** is the entries that did not make it to the bootstrapped dataset

Describe how variable importance measures can be created using the Gini index.

Larger decreases in Gini index = larger importance

Describe random forests.

Build a number of decision trees on bootstrapped training samples. With each split only consider a random number of the predictors (columns).

Compare and contrast random forests to bagging

Bagging does not lead to a substantial reduction in variance (bagging chooses strong indicators which causes highly correlated sub-trees)

Describe boosting as an approach for improving the prediction results from decision trees.

each tree is grown sequentially: each tree is grown using information from previously grown trees (not the case in bagging or random forrests)

Topic 4: Machine Learning: Regularization, Regression Trees, Random Forest & Overfitting Flashcards

(41 cards)