Chapter 5 - Overfitting and its Avoidance Flashcards
What is overfitting?
The tendency of data mining procedures to tailer models to the training data, at the expense of generalization to previously unseen data points.
What is a table model and what is the problem with it?
A tabble model memorizes the training data and preforms NO GENERALIZATION. This is also the problem, because the model can only calculate the known data 100%, but does not have any predictive power.
What is generalization?
This is a property of a model or modeling process, whereby the model applies to data that was not used to build the model.
What is the fitting graph
This is an analytical tool to recognize overfitting. It shows the accuracy of a model as a function of complexity.
What is holdout data?
Data not used in building the model, but for which we do know the actual value of the target variable.
How does the error rate of holdout data react to an increase in data points?
The holdout set error never decreases, because there is never an overlap between training and holdout sets.
What is the base rate?
The base rate is the error rate of a holdout set. This is calculated as e.g. the percentage of churn cases in a population.
Why does performance degrade due to overfitting?
As models become more complex, it is allowed to pick up harmful spurious correlations
What it the holdout evaluation method?
This is a way of solving overfitting by dividing the data in test data (holdout data) and training data.
What are the cons a holdout set?
- It is a single estimate and therefore we too much focussed on one dataset (we could have had luck with a good dataset).
What is cross-validation?
Solution to holdout set problems. It includes the mean and variance of estimated performance. The variance is critical in assesing confidence in the perfoemance estimate.
What are the pro’s of cross-validation?
- Makes better use of a limited dataset
2. Computes estimates over all the data by performing multiple splits and swapping out samples for testing.
How does cross validation work?
- Split dataset into k partitions (k = 5 or 10)
- Iterates training and testing data k times.
- Each iteration: training data = (k-1)/k and test data = 1/k
- Calculate the average accuracy
Doe you trust performance done on a training set and why?
No, due to the high chances of overfitting. Possible solutions are: Cross-validation and holdout methods
Explain the learing curve.
This is a plot of generalization performance against the amount of training data. They are steep initially and flattens with bigger datasets due to a decrease in marginal advantages.