Chapter 5 Flashcards
What is overfitting?
The tendency of data mining procedures to tailor models to the training data at the expense of generalization to previously unseen data points
What is generalisation?
Is the property of a model or modeling process, whereby the model applies to data that were not used to build the model.
Goal is to build models that apply to the general population, and not just the training data.
Goal: build models that generalize beyond the training data.
What is a table model?
- The data for which the dependent variable is true is stored in a table and it is merely a memory of the training data.
- Remembers the training data but performs no generalization.
- When a new customer is added, the model will predict “0% likelihood of churning” since this customer is not included in the table.
- Useless model in practice.
Why is overfitting bad?
- Overfitting reduces the performance of the model because the model picks up spurious correlations that are idiosyncratic to the training set.
- These correlations do not represent characteristics of the general population
- Spurious correlations produce incorrect generalizations in the model
- Every sample is a subset of the general population: will have variations even when there is no bias in the sampling
- No general way to find out if model has overfitted
- Method to detect and measure overfitting: holdout set
What is a fitting graph and why is it useful for overfitting analysis?
A graph that shows the accuracy of the model as a function of complexity.
The fitting graph shows the difference between a modeling procedure’s accuracy on the training data and the accuracy on holdout data as model complexity changes. Generally, there will be more overfitting as one allows the model to be more complex.
What is the base rate?
A classifier that always selects the majority class is called a base rate classifier.
In other words, it is referring to the table model. Since the table model always predicts no churn for every new case with which it is presented, it will get every no churn case right and every churn case wrong. (not sure about this)
Overfitting in Classification tree / Tree induction
- Once leaves are pure, the tree overfits, meaning it acquires details of the training set that are not characteristic of the population in general
- This means the model tends to generalize from fewer and fewer data points (at each leave)
- Overfit classification trees can generalize better than table models, because every new instance will get a classification (table model: new obs. fail to be matched entirely, error rate = very high)
- The complexity of the tree depends on the number of nodes
-
Perfect number of nodes is a trade-off between:
- Not splitting data at all
- Building a complex tree with only pure leaves
What is the overfitting in mathematical functions?
- Mathematical functions can overfit if you add more attributes and thus increase the dimensions.
- Selection of attributes increasingly cannot be done manually (1000s of attributes to choose from)
- As you increase the dimensionality you can perfectly fit larger sets of arbitrary points
- The more attributes a function has, the more leeway does the modeling procedure have to fit the training set
- With more flexibility (attributes) comes more overfitting
What is the overfitting in linear functions SVM and Logistic regression?
- Logistic regression function
- will always find a boundary if it exists, even if that means moving the boundary to accommodate outliers
- If one observation changes the linear boundary strongly, it indicates overfitting
- One observation should not have an impact on a large dataset.
- SVM
- Less sensitive to individual examples
- In SVM, to avoid overfitting, we choose a Soft Margin, instead of a Hard one i.e. we let some data points enter our margin intentionally (but we still penalize it) so that our classifier don’t overfit on our training sample.
- Reason for less overfitting: training procedure incorporates complexity controls
- one way: regularisation
What is cross validation?
- A more sophisticated holdout training and testing procedure
- Computes its estimates over all the data by performing multiple splits and systematically swapping out samples for testing
- Also calculates statistics on the performance (e.g. mean, variance)
-
First step in cross-validation is to split the dataset into k partitions called folds
- Typical numbers for k= 5 or 10
- The training and testing is iterated k times
- Purpose of cross-validation is to use the original labeled data efficiently to estimate the performance of a modeling procedure
- Application: can be performed on both, trees & logistic regressions
What is holdout data?
- Test set / lab model to evaluate generalization performance
- Data for which the target value is known, but which will not be used to build the model
- “Hold out some of the data for which we know the target variable” (20% = testing data)
- Test the model with the so called “training data”
- Holdout sets only provide a single estimate of generalization performance
- Good performance might be due to chance: apply cross-validation
What is a learning curve?
- A plot of the generalization performance (accuracy) vs the amount of training data (no. instances)
- Usually steep in the beginning: model finds most apparent regularities in the data
- Flattens out: but the marginal advantage of having more data decreases at some point
What is the difference between a fitting graph a learning curve?
Learning curves shows the generalization performance – the performance only on testing data, plotted against the amount of training data used.
Fitting graphs show the generalization performance as well as the performance on the training data plotted against model complexity.
Generally shown for a fixed amount of training data
How do the learning curves of classification trees and logistic regression compare?
Given the same set of features, classification trees are a more flexible model representation than linear regression
-
For smaller datasets
- tree induction will tend to overfit more
- logistic regressions tend to perform better
-
For larger datasets
- Tree induction often the better choice (less overfitting): represent substantially better nonlinear relationships between the features and the target.
How can you mitigate/avoid overfitting in tree induction?
- Stop growing the tree before it gets too complex (i.e. “too pure”) by:
- Specifying a minimum number of instances
- Hypothesis testing at every leaf (p-value)
- Pruning too large tree
- Minimum / simplest method: specify a minimum number of instances that must be present in a leaf
- Tree induction will automatically grow the tree branches that have a lot of data and cut short those that have fewer data
- Pruning: Grow the tree until it is too large and then prune it back, reducing its size: cutting off leaves and branches and replacing them with leaves
- Test if replacing a set of leaves reduces accuracy in order to make decision
- Hypothesis testing: Alternative approach is to conduct a hypothesis test at every leaf to determine whether the observed difference in e.g. information gain could have been due to chance
- If null hypothesis is rejected, the tree growing continues