Chapter 5 - Overfitting and its Avoidance Flashcards

Question 1

Q

What is overfitting?

Answer

A

The tendency of data mining procedures to tailer models to the training data, at the expense of generalization to previously unseen data points.

Question 2

Q

What is a table model and what is the problem with it?

Answer

A

A tabble model memorizes the training data and preforms NO GENERALIZATION. This is also the problem, because the model can only calculate the known data 100%, but does not have any predictive power.

Question 3

Q

What is generalization?

Answer

A

This is a property of a model or modeling process, whereby the model applies to data that was not used to build the model.

Question 4

Q

What is the fitting graph

Answer

A

This is an analytical tool to recognize overfitting. It shows the accuracy of a model as a function of complexity.

Question 5

Q

What is holdout data?

Answer

A

Data not used in building the model, but for which we do know the actual value of the target variable.

Question 6

Q

How does the error rate of holdout data react to an increase in data points?

Answer

A

The holdout set error never decreases, because there is never an overlap between training and holdout sets.

Question 7

Q

What is the base rate?

Answer

A

The base rate is the error rate of a holdout set. This is calculated as e.g. the percentage of churn cases in a population.

Question 8

Q

Why does performance degrade due to overfitting?

Answer

A

As models become more complex, it is allowed to pick up harmful spurious correlations

Question 9

Q

What it the holdout evaluation method?

Answer

A

This is a way of solving overfitting by dividing the data in test data (holdout data) and training data.

Question 10

Q

What are the cons a holdout set?

Answer

A

It is a single estimate and therefore we too much focussed on one dataset (we could have had luck with a good dataset).

Question 11

Q

What is cross-validation?

Answer

A

Solution to holdout set problems. It includes the mean and variance of estimated performance. The variance is critical in assesing confidence in the perfoemance estimate.

Question 12

Q

What are the pro’s of cross-validation?

Answer

A

Makes better use of a limited dataset

2. Computes estimates over all the data by performing multiple splits and swapping out samples for testing.

Question 13

Q

How does cross validation work?

Answer

A

Split dataset into k partitions (k = 5 or 10)
Iterates training and testing data k times.
Each iteration: training data = (k-1)/k and test data = 1/k
Calculate the average accuracy

Question 14

Q

Doe you trust performance done on a training set and why?

Answer

A

No, due to the high chances of overfitting. Possible solutions are: Cross-validation and holdout methods

Question 15

Q

Explain the learing curve.

Answer

A

This is a plot of generalization performance against the amount of training data. They are steep initially and flattens with bigger datasets due to a decrease in marginal advantages.

Question 16

Q

What is the difference between learning curves and fitting graphs?

Answer

A

A learning curve shows the generalization performance ONLY ON TESTING DATA. (plotted against THE AMOUNT OF TRAINING DATA)
A fitting graph shows the generalization performance on BOTH PERFORMANCE AND TESTING DATA. (plotted against COMPLEXITY)

fitting graphs are generally shown for fixed amount of training data.

Question 17

Q

Compare classification trees and linear logistic regression

Answer

A

Classification tress:
* Will overfit more for smaller datasets.
* Can work better for large datasets.
Logistic regression:
* Usually leads to better performance on smaller datasets

Question 18

Q

What is the deal w/ tree induction

Answer

A

Main problems:

Tree will keep growing to fit the training data -> creates pure leave nodes.
Overly complex trees that overfit data.

Solutions:

Stop growing the tree before it gets too complex
Grow tree until it is too large and then “prune” back to reduce the size and complexity.

Methods for solution:

Limit tree size to specify a minimum number of instances that must be present in a leaf.
“Prune” an overly large tree (cut off leaves and branhes, and replace with leaves).
* By