Topic 4: Machine Learning: Regularization, Regression Trees, Random Forest & Overfitting Flashcards
Define generalization, overfitting
Generalization applying the model to data not used in building the model;
overfitting means tailoring the model to the training data at the expense of generalization
Define fitting the graph, holdout data
fitting the graph: shows the accuracy of a model as a function of complexity
holdout data: data for which you know the value of the target value but was not used in building the model. (also called the test set)
Define the sweet spot for a typical fitting graph.
The sweet spot is where the model generalization on the test data is the highest.
Analyze overfitting for logistic regression and support vector machine.
Logistic regression can more easily lead to overfitting, while SVM incorporates complexity control
Explain why overfitting should be of concern.
As a model gets more complex it is allowed to pick up harmful spurious correlations
Define cross-validation and folds.
cross-validation is a more sophisticated holdout training and testing procedure estimating generalization performance.
It performs multiple splits and systematically swapping out samples for testing.
Folds are the number of splits (or partitions typically five or ten)
Define a learning curve.
A plot of generalization performance vs. amount of training data
Compare and contrast a learning curve with a fitting graph.
Learning curve shows the generalization performance (only on training data vs. amount of training data used). Fitting graph shows the same but plotted against model complexity and uses a fixed amount of training data.
Describe the shape of learning curves for logistic regression and tree induction.
Steep initially and less steep as the marginal advantage of having more data decreases. Sometimes it flattens out.
List strategies that can be used to avoid overfitting in tree induction.
(i) stop growing the tree before it becomes too comples, and (ii) to grow the tree until its too large, then prune it back, reducing its size/complexity
Describe how the minimum number of instances in a tree leaf can be used
to limit tree size.
It requires a minimum number instances to be present in a leaf - it will grow the branches that have a lot of data and cut short branches that have fewer data.
Explain how hypothesis testing can be used to limit tree induction.
Test every leaf to determine whether the obserfed difference in information gain could have been due to chance.
Define the best subset selection
fit a separate least squares regression for each combination of the p predictors p(p-1)/2 and pick the best.
List the steps used in the best subset selection
Step 1. denote the null model
Step 2. For k = 1, 2, p:
(a) fit all models
(b) pick the best models (highest R2 or lowest RSS) and call it Mk
Step 3. Select a single best using cross validated prediction error or Adjusted R2
Define deviance.
A measure that plays the role of RSS for a broader class of models